|
String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. In various branches of mathematics and computer science, strings are sequences of various simple objects (symbols, tokens, characters, etc. ...
Generally, string is a thin piece of fiber which is used to tie, bind, or hang other objects. ...
A pattern is a form, template, or model (or, more abstractly, a set of rules) which can be used to make or to generate things or parts of a thing, especially if the things that are generated have enough in common for the underlying pattern to be inferred or discerned...
Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are concatenation of elements of Σ. The Σ may be usual human alphabet (A-Z). Other applications may use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics. Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...
Basic classification
The various algorithms can be classified by the number of patterns each uses. Flowcharts are often used to represent algorithms. ...
Single pattern algorithms where m is the length of the pattern and n is the length of the searchable text. The Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i. ...
In the theory of computation, a finite state machine (FSM) or finite state automaton (FSA) is an abstract machine that has only a finite, constant amount of memory. ...
The Knuth-Morris-Pratt string searching algorithm searches for occurrences of a pattern string within a main string by employing the simple observation that when a mismatch occurs, we have enough knowledge simply by possessing the pattern to determine where the next match could begin, thus bypassing re-examination of...
The Boyer-Moore string search algorithm is a particularly efficient string searching algorithm. ...
The bitap algorithm computes the levenshtein distance between two strings assuming it is smaller than a small integer n. ...
Algorithms using finite set of patterns The Aho-Corasick algorithm is a string searching algorithm discovered by Alfred V. Aho and Margaret J. Corasick. ...
Algorithms using infinite number of patterns Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or regular expression. In computer science a right regular grammar is a formal grammar (N, Σ, P, S) such that all the production rules in P are of one of the following forms: A → a - where A is a non-terminal in N and a is a terminal in Σ A → aB - where...
A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. ...
Other classification Other classification approaches are possible. One of the most common uses preprocessing as main criteria. Classes of string searching algorithms | Text not preprocessed | Text preprocessed | | Patterns not preprocessed | Elementary algorithms | Index methods | | Patterns preprocessed | Constructed search engines | Signature methods | Naïve string search The simplest and least efficient way to see where one string occurs inside another is to check each place it could be, one by one, to see if it's there. So first we see if there's a copy of the needle in the first few characters of the haystack; if not, we look to see if there's a copy of the needle starting at the second character of the haystack; if not, we look starting at the third character, and so forth. In the normal case, we only have to look at one or two characters for each wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of the haystack and m is the length of the needle; but in the worst case, searching for a string like "aaaab" in a string like "aaaaaaaaab", it takes O(nm) steps. The Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...
The Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...
Stubs KMP computes a deterministic finite state automaton that recognizes inputs with the string to search for as a suffix, so it doesn't need to back up. Boyer-Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza-Yates and Gonnet uses bits in a word to keep track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza-Yates' approach. In the theory of computation, a finite state machine (FSM) or finite state automaton (FSA) is an abstract machine that has only a finite, constant amount of memory. ...
Fuzzy string searching is the name for a category of techniques for finding one or more substrings of a text that approximately match some given pattern string. ...
The bitap algorithm computes the levenshtein distance between two strings assuming it is smaller than a small integer n. ...
Index methods Faster search allow algorithms based on preprocessing of the text. After building an index, for example suffix tree or suffix array, these algorithms allow to find the pattern very fast, using binary search in the index. Suffix tree for the string BANANA$. Suffix links drawn dashed. ...
A suffix array (invented by Gene Myers and Udi Manber) is an indexing structure for locating substrings in a larger string. ...
In computer science, binary search is a search algorithm for searching a set of sorted data for a particular value. ...
External links - Huge (maintained) list of pattern matching links
- StringSearch - Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
- Exact String Matching Algorithms—Animation in Java
|