All that is comes from the mind: Fast Incremental Indexing

Fast Incremental Indexing for Full-Text Information Retrieval

http://maroo.cs.umass.edu/getpdf.php?id=102

Lucene offers powerful features through a simple API:

Scalable, High-Performance Indexing

over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Conclusions

If IR systems are to satisfy the demand for applications that

can manage an ever increasing repository of information,

they must be able to efficiently add documents to large existing

collections. The main bottleneck in that operation is

updating the index structure used to manage the collection.

The traditional solution to this problem is to re-index the

entire collection, an operation with costs proportional to

the size of the whole collection. This solution is clearly

unacceptable.

We have proposed an alternative solution that yields

costs proportional to the size of the update. Using the

data management facilities of a persistent object store, we

have designed a more sophisticated inverted file index that

provides fast incremental updates. More importantly, we

have implemented our scheme in an operational full-text

information retrieval system and verified its performance

empirically.

The results we present show that our scheme maintains

a nearly constant per posting update cost as the size of the

collection grows, indicating excellent potential for scale. In

fact, we have used our scheme to index the full 2 Gbyte TIPSTER

collection in 13 batches and have found the trends

described in Section 4 to hold. Our scheme requires considerably

less disk space during indexing than traditional

techniques, and allows much of the processing for a new

batch of documents to be done independently from the existing

index. This last point is particularly important for

the eventual support of simultaneous query processing and

collection updating, since the period of time during which

index structures must be locked for updating can be minimized.

We found that best performance is achieved when documents

are added in the largest batches possible, both in terms

of incremental indexing time and resultant query processing

speed. We have also shown that our scheme provides

a good level of performance with small batch updates, and

have suggested techniques to improve both small batch update

and query processing performance. These techniques

bear further investigation and represent future work.

Finally, we have achieved these results using “off-theshelf”

data management technology, continuing to show

that the data management facilities in IR systems need not

be custom built to achieve high performance.

All that is comes from the mind

전체 페이지뷰

2015년 1월 12일 월요일

Fast Incremental Indexing

Scalable, High-Performance Indexing

Powerful, Accurate and Efficient Search Algorithms

댓글 없음:

댓글 쓰기