|
In information theory and computer science, the Levenshtein distance is a string metric which is one way to measure edit distance. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is useful in applications that need to determine how similar two strings are, such as spell checkers. Not to be confused with information technology, information science, or informatics. ...
Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. ...
It has been suggested that String Metrics be merged into this article or section. ...
In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other. ...
In computer programming and formal language theory, (and other branches of mathematics), a string is an ordered sequence of symbols. ...
Vladimir Iosifovich Levenshtein (Russian: ) (born 1935) is a Russian scientist who did research in information theory and error-correcting codes. ...
Year 1965 (MCMLXV) was a common year starting on Friday (link will display full calendar) of the 1965 Gregorian calendar. ...
In computing terms, a spelling checker (also spell checker) is a software program designed to verify the spelling of words in a file, helping a user ensure his/her spelling is correct. ...
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits: - kitten → sitten (substitution of 's' for 'k')
- sitten → sittin (substitution of 'i' for 'e')
- sittin → sitting (insert 'g' at the end)
It can be considered a generalization of the Hamming distance, which is used for strings of the same length and only considers substitution edits. There are also further generalizations of the Levenshtein distance that consider, for example, exchanging two characters as an operation, like in the Damerau-Levenshtein distance algorithm. In information theory, the Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different. ...
Damerau-Levenshtein distance is a string metric for measuring edit distance in information theory and computer science. ...
The algorithm A commonly-used bottom-up dynamic programming algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. This algorithm is based on the Wagner-Fischer algorithm for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them: In mathematics and computer science, dynamic programming is a method of solving problems exhibiting the properties of overlapping subproblems and optimal substructure (described below) that takes much less time than naive methods. ...
Pseudocode (derived from pseudo and code) is a compact and informal high-level description of a computer programming algorithm that uses the structural conventions of programming languages, but omits detailed subroutines, variable declarations or language-specific syntax. ...
int LevenshteinDistance(char s[1..m], char t[1..n]) // d is a table with m+1 rows and n+1 columns declare int d[0..m, 0..n] for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j for i from 1 to m for j from 1 to n if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) return d[m, n] Two examples of the resulting matrix (the minimum steps to be taken are highlighted): | | k | i | t | t | e | n | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | s | 1 | 1 | 2 | 3 | 4 | 5 | 6 | | i | 2 | 2 | 1 | 2 | 3 | 4 | 5 | | t | 3 | 3 | 2 | 1 | 2 | 3 | 4 | | t | 4 | 4 | 3 | 2 | 1 | 2 | 3 | | i | 5 | 5 | 4 | 3 | 2 | 2 | 3 | | n | 6 | 6 | 5 | 4 | 3 | 3 | 2 | | g | 7 | 7 | 6 | 5 | 4 | 4 | 3 | | | | S | a | t | u | r | d | a | y | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | u | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | | n | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | | d | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | | a | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | | y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | | The invariant maintained throughout the algorithm is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. At the end, the bottom-right element of the array contains the answer. In mathematics, an invariant is something that does not change under a set of transformations. ...
This algorithm is essentially part of a solution to the Longest common subsequence problem (LCS), in the particular case of 2 input lists. The longest common subsequence problem (LCS) is finding a longest sequence which is a subsequence of all sequences in a set of sequences (often just two). ...
Proof of correctness As mentioned earlier, the invariant is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since: In mathematics, an invariant is something that does not change under a set of transformations. ...
- It is initially true on row and column 0 because
s[1..i] can be transformed into the empty string t[1..0] by simply dropping all i characters. Similarly, we can transform s[1..0] to t[1..j] by simply adding all j characters. - The minimum is taken over three distances, each of which is feasible:
- If we can transform
s[1..i] to t[1..j-1] in k operations, then we can simply add t[j] afterwards to get t[1..j] in k+1 operations. - If we can transform
s[1..i-1] to t[1..j] in k operations, then we can do the same operations on s[1..i] and then remove the original s[i] at the end in k+1 operations. - If we can transform
s[1..i-1] to t[1..j-1] in k operations, we can do the same to s[1..i] and then do a substitution of t[j] for the original s[i] at the end if necessary, requiring k+cost operations. - The operations required to transform
s[1..n] into t[1..m] is of course the number required to transform all of s into all of t, and so d[n,m] holds our result. This proof fails to validate that the number placed in d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal. Reductio ad absurdum (Latin: reduction to the absurd) also known as an apagogical argument, reductio ad impossibile, or proof by contradiction, is a type of logical argument where one assumes a claim for the sake of argument, derives an absurd or ridiculous outcome, and then concludes that the original assumption...
Possible improvements Possible improvements to this algorithm include: - We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
- We can store the number of insertions, deletions, and substitutions separately, or even the positions at which they occur, which is always
j. - We can normalize the distance to the interval
[0,1]. - If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[2]
- We can give different penalty costs to insertion, deletion and substitution. We can also give penalty costs that depend on which characters are inserted, deleted or substituted.
- The initialization of
d[i,0] can be moved inside the main outer loop. - By initializing the first row of the matrix with 0, the algorithm can be used for fuzzy string search of a string in a text [3]. This modification gives the end-position of matching substrings of the text. To determine the start-position of the matching substrings, the number of insertions and deletions can be stored separately and used to compute the start-position from the end-position [4]
- This algorithm parallelizes poorly, due to a large number of data dependencies. However, all the
cost values can be computed in parallel, and the algorithm can be adapted to perform the minimum function in phases to eliminate dependencies. - By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small. [5]
For other uses, see Big O. In computational complexity theory, big O notation is often used to describe how the size of the input data affects an algorithms usage of computational resources (usually running time or memory). ...
For other uses, see Big O. In computational complexity theory, big O notation is often used to describe how the size of the input data affects an algorithms usage of computational resources (usually running time or memory). ...
Fuzzy string searching is the name for a category of techniques for finding one or more substrings of a text that approximately match some given pattern string. ...
Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. ...
A data dependency in computer science is a situation whereby computer instructions refer to the results of preceding instructions that have not yet been completed. ...
In computer programming, lazy evaluation is a technique that attempts to delay computation of expressions until the results of the computation are known to be needed. ...
Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include: - It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
- If the strings are called
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound. In information theory, the Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different. ...
See also Image File history File links Wikibooks-logo-en. ...
Wikibooks logo Wikibooks, previously called Wikimedia Free Textbook Project and Wikimedia-Textbooks, is a wiki for the creation of books. ...
Agrep (Approximate grep) is a fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. ...
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm developed by Udi Manber and Sun Wu in 1991 based on work done by Ricardo Baeza-Yates and Gaston Gonnet. ...
Damerau-Levenshtein distance is a string metric for measuring edit distance in information theory and computer science. ...
In computing, diff is a file comparison utility that outputs the differences between two files. ...
Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. ...
Fuzzy string searching is the name for a category of techniques for finding one or more substrings of a text that approximately match some given pattern string. ...
In information theory, the Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different. ...
The Hamming weight of a string of bits is the number of 1s in it. ...
The Jaccard index, also known as Jaccard distance and Jaccard similarity coefficent, is a statistic used for comparing the similarity and diversity of sample sets. ...
The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. ...
In computer science, Levenshtein automata are a family of finite state automata that can recognize the set V of all words in a formal language for which the Levenshtein distance to an arbitrary word W does not exceed a particular constant. ...
The longest common subsequence problem (LCS) is finding a longest sequence which is a subsequence of all sequences in a set of sequences (often just two). ...
Lucene is a free/open source information retrieval library, originally implemented in Java. ...
Taxicab geometry, considered by Hermann Minkowski in the 19th century, is a form of geometry in which the usual metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the (absolute) differences of their coordinates. ...
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined. ...
The NeedlemanâWunsch algorithm performs a global alignment on two sequences (called A and B here). ...
Several equivalence relations in mathematics are called similarity. ...
Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. ...
The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. ...
References - ^ В.И. Левенштейн (1965) Двоичные коды с исправлением выпадений, вставок и замещений символов. Доклады Академий Наук СССР 163.4:845–848. Appeared in English as: V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707–710.
- ^ Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
- ^ Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.
- ^ Bruno Woltzenlogel Paleo. An approximate gazetteer for GATE based on levenshtein distance. Student Section of the European Summer School in Logic, Language and Information (ESSLLI), 2007.
- ^ L. Allison, Lazy Dynamic-Programming can be Eager. Inf. Proc. Letters 43(4) pp207-212, Sept' 1992 http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html
External links General Architecture for Text Engineering or GATE is a Java software toolkit originally developed at the University of Sheffield since 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including information extraction in many languages. ...
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. ...
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. ...
|