전체 페이지뷰

2013년 11월 6일 수요일

Tokens and tokenization

reference from taming text

I can't believe that the Carolina Hurricanes won the 2005-2006 Stanley
Cup.

Sentence split by whitespace
I can't believe that the Carolina Hurricanes won the 2005-2006 Stanley Cup.

Sentence split by Solr StandardTokenizer
I can't believe that the Carolina Hurricanes won the 2005 2006 Stanley Cup

Sentence split by OpenNLP english.Tokenizer
I ca n't believe that the Carolina Hurricanes won the 2005-2006 Stanley Cup .

Sentence split by OpenNLP SimpleTokenizer
I can ' t believe that the Carolina Hurricanes won the 2005 - 2006 Stanley Cup .

Other common techniques applied at the token level include these:
 Case alterations—Lowercasing all tokens can be helpful in searching.
 Stopword removal—Filtering out common words like the, and, and a. Commonly
occurring words like these often add little value (note we didn’t say no value) to
applications that don’t rely on sentence structure.
 Expansion—Adding synonyms or expanding acronyms and abbreviations in a
token stream can allow applications to handle alternative inputs from users.
 Part of speech tagging—Assigning the part of speech to a token. Covered in more
detail in the following section.
 Stemming—Reducing a word to a root or base form, for example, dogs to dog.

OpenNLP Maximum Entropy Tagger, available at http://opennlp.apache.org/. Don’t
worry too much about the phrase maximum entropy; it’s just a way of saying it uses statistics
to figure out which part of speech is most likely for a word. The OpenNLP English
POS tagger uses part of speech tags from the Penn Treebank Project (http://
www.cis.upenn.edu/~treebank) to label words in sentences with their part of speech.

댓글 없음:

댓글 쓰기