전체 페이지뷰

2015년 1월 12일 월요일

Fast Incremental Indexing

Fast Incremental Indexing for Full-Text Information Retrieval

Lucene offers powerful features through a simple API:

Scalable, High-Performance Indexing

  • over 150GB/hour on modern hardware
  • small RAM requirements -- only 1MB heap
  • incremental indexing as fast as batch indexing
  • index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

  • ranked searching -- best results returned first
  • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
  • fielded searching (e.g. title, author, contents)
  • sorting by any field
  • multiple-index searching with merged results
  • allows simultaneous update and searching
  • flexible faceting, highlighting, joins and result grouping
  • fast, memory-efficient and typo-tolerant suggesters
  • pluggable ranking models, including the Vector Space Model and Okapi BM25
  • configurable storage engine (codecs)
 Conclusions

If IR systems are to satisfy the demand for applications that
can manage an ever increasing repository of information,
they must be able to efficiently add documents to large existing
collections. The main bottleneck in that operation is
updating the index structure used to manage the collection.
The traditional solution to this problem is to re-index the
entire collection, an operation with costs proportional to
the size of the whole collection. This solution is clearly
unacceptable.
We have proposed an alternative solution that yields
costs proportional to the size of the update. Using the
data management facilities of a persistent object store, we
have designed a more sophisticated inverted file index that
provides fast incremental updates. More importantly, we
have implemented our scheme in an operational full-text
information retrieval system and verified its performance
empirically.
The results we present show that our scheme maintains
a nearly constant per posting update cost as the size of the
collection grows, indicating excellent potential for scale. In
fact, we have used our scheme to index the full 2 Gbyte TIPSTER
collection in 13 batches and have found the trends
described in Section 4 to hold. Our scheme requires considerably
less disk space during indexing than traditional
techniques, and allows much of the processing for a new
batch of documents to be done independently from the existing
index. This last point is particularly important for
the eventual support of simultaneous query processing and
collection updating, since the period of time during which
index structures must be locked for updating can be minimized.
We found that best performance is achieved when documents
are added in the largest batches possible, both in terms
of incremental indexing time and resultant query processing
speed. We have also shown that our scheme provides
a good level of performance with small batch updates, and
have suggested techniques to improve both small batch update
and query processing performance. These techniques
bear further investigation and represent future work.
Finally, we have achieved these results using “off-theshelf”
data management technology, continuing to show
that the data management facilities in IR systems need not
be custom built to achieve high performance.

댓글 없음:

댓글 쓰기