Table 1.1 Processing text presents challenges at many levels, from handling character encodings to
inferring meaning in the context of the world around us.
Level
|
Challenges
|
Character
– Character encodings, such as ASCII, Shift-JIS, Big 5, Latin-1, UTF-8, UTF-16.
– Case (upper and lower), punctuation, accents, and numbers all require different treatments in different applications.
Words and morphemesa
– Word segmentation: dividing text into words. Fairly easy for English and other languages that use whitespace; much harder for languages like Chinese and Japanese.
– Assigning par t of speech.
– Identifying synonyms; synonyms are useful for searching.
– Stemming: the process of shortening a word to its base or root form.
– Identifying synonyms; synonyms are useful for searching.
– Stemming: the process of shortening a word to its base or root form.
For example, a simple stemming of words is word.
– Abbreviations, acronyms, and spelling also play important roles in understanding words.
Multiword and sentence– Abbreviations, acronyms, and spelling also play important roles in understanding words.
– Phrase detection: quick red fox, hockey legend Bobby Orr, and big
brown shoe are all examples of phrases.
– Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
– Sentence boundary detection is a well-understood problem in English, but is still not perfect.
– Coreference resolution: “Jason likes dogs, but he would never buy one.” In this example, he is a coreference to Jason. The need for coreference resolution can also span sentences.
– Words often have multiple meanings; using the context of a sentence or more may help choose the correct word. This process is called word sense disambiguation and is difficult to do well.
– Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.
Multisentence and para- graph
At this level, processing becomes more difficult in an effort to find deeper understanding of an author’s intent. Algorithms for summariza- tion often require being able to identify which sentences are more impor- tant than others.
Document
Similar to the paragraph level, understanding the meaning of a docu- ment often requires knowledge that goes beyond what’s contained in the actual document. Authors often expect readers to have a certain back- ground or possess certain reading skills. For example, most of this book won’t make much sense if you’ve never used a computer and done some programming, whereas most newspapers assume at least a sixth-grade reading level.
Multidocument and corpus
At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents. Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.
– Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
– Sentence boundary detection is a well-understood problem in English, but is still not perfect.
– Coreference resolution: “Jason likes dogs, but he would never buy one.” In this example, he is a coreference to Jason. The need for coreference resolution can also span sentences.
– Words often have multiple meanings; using the context of a sentence or more may help choose the correct word. This process is called word sense disambiguation and is difficult to do well.
– Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.
Multisentence and para- graph
At this level, processing becomes more difficult in an effort to find deeper understanding of an author’s intent. Algorithms for summariza- tion often require being able to identify which sentences are more impor- tant than others.
Document
Similar to the paragraph level, understanding the meaning of a docu- ment often requires knowledge that goes beyond what’s contained in the actual document. Authors often expect readers to have a certain back- ground or possess certain reading skills. For example, most of this book won’t make much sense if you’ve never used a computer and done some programming, whereas most newspapers assume at least a sixth-grade reading level.
Multidocument and corpus
At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents. Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.
from taming text
댓글 없음:
댓글 쓰기