전체 페이지뷰

2013년 9월 4일 수요일

lucene query objects

Searching by term: TermQuery

TermQuerys are especially useful for retrieving documents by a key. If documents were indexed using Field.Index.NOT_ANALYZED, the same value can be used to retrieve these documents.

Term t = new Term("isbn", "9781935182023");
Query query = new TermQuery(t);
TopDocs docs = searcher.search(query, 10);

Searching within a term range: TermRangeQuery

Only use this query for textual ranges, such as for finding all names that begin with N through Q The following code illustrates TermRangeQuery, searching for all books whose title begins with any letter from d to j. Note that the title2 field in our book index is simply the lowercased title, indexed as a single token using Field.NOT_ANALYZED_NO_NORMS:  
TermRangeQuery query = new TermRangeQuery("title2", "d", "j", true, true);
TopDocs matches = searcher.search(query, 100);  
       
Searching within a numeric range: NumericRangeQuery

 If you indexed your field with NumericField, you can efficiently search a particular range for that field using NumericRangeQuery. Under the hood, Lucene translates the requested range into the equivalent set of brackets in the previously indexed trie structure. Each bracket is a distinct term in the index whose documents are OR’d together.  

NumericRangeQuery query =
      NumericRangeQuery.newIntRange("pubmonth",200605,200609,true,true);
      TopDocs matches = searcher.search(query, 10);
 
Searching on a string: PrefixQuery    

PrefixQuery matches documents containing terms beginning with a specified string.
 
Term term = new Term("category", "/technology/computers/programming");
PrefixQuery query = new PrefixQuery(term);  
TopDocs matches = searcher.search(query, 10);
int programmingAndBelow = matches.totalHits;  
matches = searcher.search(new TermQuery(term), 10);
int justProgramming = matches.totalHits;  

Combining queries: BooleanQuery

The query types discussed here can be combined in complex ways using Boolean-Query, which is a container of Boolean clauses. A clause is a subquery that can be required, optional, or prohibited.

 TermQuery searchingBooks = new TermQuery(new Term("subject", "search"));
 Query books2010 = NumericRangeQuery.newIntRange("pubmonth", 201001,  201012,
                               true, true);
  BooleanQuery searchingBooks2010 = new BooleanQuery();
  searchingBooks2010.add(searchingBooks, BooleanClause.Occur.MUST);
  searchingBooks2010.add(books2010, BooleanClause.Occur.MUST);
  Directory dir = TestUtil.getBookIndexDirectory();
  IndexSearcher searcher = new IndexSearcher(dir);
  TopDocs matches = searcher.search(searchingBooks2010, 10);
   
BooleanQuery.add has two overloaded method signatures. One accepts only a BooleanClause, and the other accepts a Query and a BooleanClause.Occur instance. A BooleanClause is simply a container of a Query and a BooleanClause.Occur instance, so we omit coverage of it. BooleanClause.Occur.MUST means exactly that: only documents matching that clause are considered. BooleanClause.Occur.SHOULD means the term is optional. BooleanClause.Occur.MUST_NOT means any documents matching this clause are excluded from the results.      

BooleanQuerys are restricted to a maximum number of clauses; 1,024 is the default. This limitation is in place to prevent queries from accidentally adversely affecting performance. A TooManyClauses exception is thrown if the maximum is exceeded.    

Should you ever have the unusual need of increasing the number of clauses allowed, there’s a setMax-ClauseCount(int) method on BooleanQuery, but be aware of the performance cost of executing such queries.

Searching by phrase: PhraseQuery

An index by default contains positional information of terms, as long as you didn’t create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms used to recon-struct the phrase in order. Let’s take the phrase just mentioned and see how the slop factor plays out. First we need a little test infrastructure, which includes a setUp() method to index a single document, a tearDown() method to close the directory and searcher, and a custom matched (String[], int) method to construct, execute, and assert a phrase query matched the test document,

  protected void setUp() throws IOException {
    dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir,new WhitespaceAnalyzer(),
                                 IndexWriter.MaxFieldLength.UNLIMITED);
    Document doc = new Document();
    doc.add(new Field("field","the quick brown fox jumped over the lazy dog",
          Field.Store.YES,Field.Index.ANALYZED));
    writer.addDocument(doc);
    writer.close();
    searcher = new IndexSearcher(dir);
  }

  private boolean matched(String[] phrase, int slop)throws IOException {
    PhraseQuery query = new PhraseQuery();
    query.setSlop(slop);
    for (String word : phrase) {
      query.add(new Term("field", word));
    }
    TopDocs matches = searcher.search(query, 10);
    return matches.totalHits > 0;
  }
}

PhraseQuery supports multiple-term phrases. Regardless of how many terms are used
for a phrase, the slop factor is the maximum total number of moves allowed to put the
terms in order.

Searching by wildcard: WildcardQuery

Wildcard queries let you query for terms with missing pieces but still find matches.
Two standard wildcard characters are used: * for zero or more characters, and ? for
zero or one character.

private void indexSingleFieldDocs(Field[] fields) throws Exception {
   IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer()
, IndexWriter.MaxFieldLength.UNLIMITED);
   for (Field f : fields) {
         Document doc = new Document();
         doc.add(f);
         writer.addDocument(doc);
   }
   writer.optimize();
   writer.close();
}

public void testWildcard() throws Exception {
  indexSingleFieldDocs(
         new Field[] {
                new Field("contents", "wild", Field.Store.YES,Field.Index.ANALYZED),
                new Field("contents", "child", Field.Store.YES,Field.Index.ANALYZED),
                new Field("contents", "mild", Field.Store.YES,Field.Index.ANALYZED),
                new Field("contents", "mildew", Field.Store.YES,Field.Index.ANALYZED)
         });

   IndexSearcher searcher = new IndexSearcher(directory);
   Query query = new WildcardQuery(new Term("contents", "?ild*"));
   TopDocs matches = searcher.search(query, 10);
   assertEquals("child no match", 3, matches.totalHits);
   assertEquals("score the same", matches.scoreDocs[0].score,
   matches.scoreDocs[1].score, 0.0);
   assertEquals("score the same", matches.scoreDocs[1].score,
   matches.scoreDocs[2].score, 0.0);
   searcher.close();
}

Searching for similar terms: FuzzyQuery

Lucene’s FuzzyQuery matches terms similar to a specified term. The Levenshtein distance algorithm determines how similar terms in the index are to a specified target
term.Edit distance is another term for Levenshtein distance; it’s a measure of similarity between two strings, where distance is measured as the number of character deletions, insertions, or substitutions required to transform one string to the other string. For example, the edit distance between three and tree is 1, because only one character deletion is needed.

public void testFuzzy() throws Exception {
   indexSingleFieldDocs(
        new Field[] {
           new Field("contents","fuzzy",Field.Store.YES,Field.Index.ANALYZED),
           new Field("contents","wuzzy",Field.Store.YES,Field.Index.ANALYZED)
   });

   IndexSearcher searcher = new IndexSearcher(directory);
   Query query = new FuzzyQuery(new Term("contents", "wuzza"));
   TopDocs matches = searcher.search(query, 10);
   assertEquals("both close enough", 2, matches.totalHits);
   assertTrue("wuzzy closer than fuzzy",  
   matches.scoreDocs[0].score != matches. scoreDocs[1].score);
   Document doc = searcher.doc(matches.scoreDocs[0].doc);
   assertEquals("wuzza bear", "wuzzy", doc.get("contents"));
    searcher.close();
}

댓글 없음:

댓글 쓰기