“Fuzzy string matching immediately opens up a number of questions for which the answers aren’t so clear. For instance:
How many characters need to match?
What if the letters are the same but not in the same order?
What if there are extra letters?
Are some letters more important than others?
Different approaches to fuzzy matching answer these questions differently.
Some approaches focus on character overlap as their primary means of looking at string similarity. Other approaches model the order in which the characters occur more directly, whereas still others look at multiple letters simultaneously.
We’ll break these approaches down into three sections.
In the first, titled “Character overlap measures,” we’ll look at the Jaccard measure and some of its variations, as well as the Jaro-Winkler distance as a way of addressing the character overlap approaches.
Our next section, titled “Edit distance measures,” will look at the character order approach and some of its variations.
Finally, we’ll consider the simultaneous approach in the section titled “N-gram edit distance.”
Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris. “Taming Text.” iBooks.
댓글 없음:
댓글 쓰기