All that is comes from the mind: processing text challenges

Table 1.1 Processing text presents challenges at many levels, from handling character encodings to inferring meaning in the context of the world around us.

Level

Challenges

Character

– Character encodings, such as ASCII, Shift-JIS, Big 5, Latin-1, UTF-8, UTF-16.

– Case (upper and lower), punctuation, accents, and numbers all require different treatments in different applications.

Words and morphemesa

– Word segmentation: dividing text into words. Fairly easy for English and other languages that use whitespace; much harder for languages like Chinese and Japanese.

– Assigning par t of speech.
– Identifying synonyms; synonyms are useful for searching.
– Stemming: the process of shortening a word to its base or root form.

For example, a simple stemming of words is word.
– Abbreviations, acronyms, and spelling also play important roles in understanding words.

Multiword and sentence

– Phrase detection: quick red fox, hockey legend Bobby Orr, and big brown shoe are all examples of phrases.
– Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
– Sentence boundary detection is a well-understood problem in English, but is still not perfect.
– Coreference resolution: “Jason likes dogs, but he would never buy one.” In this example, he is a coreference to Jason. The need for coreference resolution can also span sentences.
– Words often have multiple meanings; using the context of a sentence or more may help choose the correct word. This process is called word sense disambiguation and is difficult to do well.
– Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.
Multisentence and para- graph
At this level, processing becomes more difficult in an effort to find deeper understanding of an author’s intent. Algorithms for summariza- tion often require being able to identify which sentences are more impor- tant than others.
Document
Similar to the paragraph level, understanding the meaning of a docu- ment often requires knowledge that goes beyond what’s contained in the actual document. Authors often expect readers to have a certain back- ground or possess certain reading skills. For example, most of this book won’t make much sense if you’ve never used a computer and done some programming, whereas most newspapers assume at least a sixth-grade reading level.
Multidocument and corpus
At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents. Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.

from taming text

All that is comes from the mind

전체 페이지뷰

2013년 8월 9일 금요일

processing text challenges

댓글 없음:

댓글 쓰기