전체 페이지뷰

2013년 11월 10일 일요일

lucene scaling & term vector & merge segment

from lucene in action 2nd


SCALING
One particularly tricky area is scaling of your search application. The vast majority of
search applications don’t have enough content or simultaneous search traffic to
require scaling beyond a single computer. Lucene indexing and searching throughput
allows for a sizable amount of content on a single modern computer. Still, such
applications may want to run two identical computers to ensure there’s no single
point of failure (no downtime) in the event of hardware failure. This approach also
enables you to pull one computer out of production to perform maintenance and
upgrades without affecting ongoing searches.
There are two dimensions to scaling: net amount of content, and net query
throughput. If you have a tremendous amount of content, you must divide it into
shards, so that a separate computer searches each shard. A front-end server sends a
single incoming query to all shards, and then coalesces the results into a single result
set. If instead you have high search throughput during your peak traffic, you’ll have to
take the same index and replicate it across multiple computers. A front-end load balancer
sends each incoming query to the least loaded back-end computer. If you
require both dimensions of scaling, as a web scale search engine will, you combine
both of these practices.
A number of complexities are involved in building such an architecture. You’ll
need a reliable way of replicating the search index across computers. If a computer
has some downtime, planned or not, you need a way to bring it up-to-date before putting
it back into production. If there are transactional requirements, so that all searchers
must “go live” on a new index commit simultaneously, that adds complexity. Error
recovery in a distributed setting can be complex. Finally, important functionality like
spell correction and highlighting, and even how term weights are computed for scoring,
are impacted by such a distributed architecture.

Merge segment
Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is the segment’s name and <ext> is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, and so on). If you’re using the com- pound file format (which is enabled by default but you can change using Index- Writer.setUseCompoundFile), then most of these index files are collapsed into a single compound file: _X.cfs. This reduces the number of open file descriptors during searching, at a small cost of searching and indexing performance. Chapter 11 covers this trade-off in more detail.
There’s one special file, referred to as the segments file and named segments_<N>, that references all live segments. This file is important! Lucene first opens this file, and then opens each segment referenced by it. The value <N>, called “the generation,” is an integer that increases by one every time a change is committed to the index.
Naturally, over time the index will accumulate many segments, especially if you open and close your writer frequently. This is fine. Periodically, IndexWriter will select segments and coalesce them by merging them into a single new segment and then removing the old segments. The selection of segments to be merged is governed by a separate MergePolicy. Once merges are selected, their execution is done by the MergeScheduler.  

term vectors
what exactly are term vectors? Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a min- iature inverted index for that one document. Unlike a stored field, where the original String content is stored verbatim, term vectors store the actual separate terms that  were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order. Because the tokens coming out of an analyzer also have position and offset information (see section 4.2.1), you can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:


TermVectorMapper
Sometimes, the parallel array structure returned by IndexReader.getTermFreqVector
may not be convenient for your application. Perhaps instead of sorting by Term, you’d
like to sort the term vectors according to your own criteria. Or maybe you’d like to
only load certain terms of interest. All of these can be done with a recent addition to
Lucene, TermVectorMapper. This is an abstract base class that, when passed to
IndexReader.getTermFreqVector methods, separately receives each term, with
optional positions and offsets and can choose to store the data in its own manner.
Table 5.2 describes the methods that a concrete TermVectorMapper implementation
(subclass) must implement.
Lucene includes a few public core implementations of TermVectorMapper, described
in table 5.3. You can also create your own implementation.
As we’ve now seen, term vectors are a powerful advanced functionality. We saw two
examples where you might want to use them: automatically assigning documents to
categories, and finding documents similar to an existing example. We also saw
Lucene’s advanced API for controlling exactly how term vectors are loaded. We’ll now
see how to load stored fields using another advanced API in Lucene: FieldSelector.
Table 5.2 Methods that a custom TermVectorMapper must implement
Method Purpose

setDocumentNumber Called once per document to tell you which document is currently being
loaded.

setExpectations Called once per field to tell you how many terms occur in the field, and
whether positions and offsets are stored.

map Called once per term to provide the actual term vectors data.

isIgnoringPositions You should return false only if you need to see the positions data for
the term vectors.

isIgnoringOffsets You should return false only if you need to see the offsets data for the
term vectors.

Table 5.3 Built-in implementations of TermVectorMapper

Method Purpose
PositionBasedTermVectorMapper For each field, stores a map from the integer position to
terms and optionally offsets that occurred at that position.

SortedTermVectorMapper Merges term vectors for all fields into a single SortedSet,
sorted according to a Comparator that you specify. One

comparator is provided in the Lucene core, TermVector-

EntryFreqSortedComparator, which sorts first by frequency
of the term and second by the term itself.

FieldSortedTermVectorMapper Just like SortedTermVectorMapper, except the fields
aren’t merged together and instead each field stores its

sorted terms separately.

댓글 없음:

댓글 쓰기