1. Query model
There are three common theoretical models of search:
Pure Boolean model—Documents either match or don’t match the provided
query, and no scoring is done. In this model there are no relevance scores associated
with matching documents, and the matching documents are unordered;
a query simply identifies a subset of the overall corpus as matching the query.
Vector space model—Both queries and documents are modeled as vectors in a
high dimensional space, where each unique term is a dimension. Relevance, or
similarity, between a query and a document is computed by a vector distance
measure between these vectors.
Probabilistic model—In this model, you compute the probability that a document
is a good match to a query using a full probabilistic approach.
Lucene’s approach combines the vector space and pure Boolean models, and offers
you controls to decide which model you’d like to use on a search-by-search basis.
2. scaling apache project under the Lucene
Lucene provides no facilities for scaling. However, both Solr and Nutch, projects
under the Apache Lucene umbrella, provide support for index sharding and replication.
The Katta open source project, hosted at http://katta.sourceforge.net and based
on Lucene, also provides this functionality. Elastic search, at http://www.elasticsearch.
com, is another option that’s also open source and based on Lucene.
Etc .. there are related project such as Hibernate Search , Tika
3. Lucene index structure
Lucene stores the input in a data structure known as an inverted index. This data structure makes efficient use of disk space while allowing quick keyword lookups. What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities, much like the index of this book references the page number(s) where a concept occurs. In other words, rather than trying to answer the question “What words are contained in this document?” this structure is optimized for providing quick answers to “Which documents contain word X?”
If you think about your favorite web search engine and the format of your typical query, you’ll see that this is exactly the query that you want to be as quick as possible. The core of today’s web search engines are inverted indexes. Lucene’s index directory has a unique segmented architecture, which we describe next.
4. Lucene index segments
Every Lucene index consists of one or more segments, as depicted in figure 2.2.
Each segment is a standalone index, holding a subset of all indexed documents. A
new segment is created whenever the writer flushes buffered added documents and
pending deletions into the directory. At search time, each segment is visited separately
and the results are combined.
Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is
the segment’s name and <ext> is the extension that identifies which part of the index
that file corresponds to. There are separate files to hold the different parts of the
index (term vectors, stored fields, inverted index, and so on). If you’re using the compound
file format (which is enabled by default but you can change using Index-
Writer.setUseCompoundFile), then most of these index files are collapsed into a
single compound file: _X.cfs. This reduces the number of open file descriptors during
searching, at a small cost of searching and indexing performance. Chapter 11 covers
this trade-off in more detail.
There’s one special file, referred to as the segments file and named segments_<N>,
that references all live segments. This file is important! Lucene first opens this file, and
then opens each segment referenced by it. The value <N>, called “the generation,” is
an integer that increases by one every time a change is committed to the index.
5. Field options for indexing
Index.ANALYZED— body, title, abstract, etc..
Index.NOT_ANALYZED— URLs, file system paths, dates, personal names,
Social Security numbers, and telephone numbers.
Index.ANALYZED_NO_NORMS
Index.NOT_ANALYZED_NO_NORMS
Index.NO
6. Term vactor option
Term vectors are a mix between an indexed
field and a stored field. They’re similar to a stored field because you can quickly
retrieve all term vector fields for a given document: term vectors are keyed first by
document ID. But then, they’re keyed secondarily by term, meaning they store a miniature
inverted index for that one document
TermVector.YES—Records the unique terms that occurred, and their counts,
in each document, but doesn’t store any positions or offsets information
TermVector.WITH_POSITIONS—Records the unique terms and their counts,
and also the positions of each occurrence of every term, but no offsets
TermVector.WITH_OFFSETS—Records the unique terms and their counts, with
the offsets (start and end character position) of each occurrence of every term,
but no positions
TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts,
along with positions and offsets
TermVector.NO—Doesn’t store any term vector information
Note
From Lucene in action 2nd
댓글 없음:
댓글 쓰기