|
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm developed by Udi Manber and Sun Wu in 1991[1][2] based on work done by Ricardo Baeza-Yates and Gaston Gonnet[3]. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. Fuzzy string searching is the name for a category of techniques for finding one or more substrings of a text that approximately match some given pattern string. ...
Udi Manber is one of the authors of Agrep and GLIMPSE. He also has an interest in computer networks. ...
Ricardo Baeza-Yates is a Chilean researcher and a director of the Center of the Web Research at the University of Chile. ...
Gastón H. Gonnet is a Uruguayan computer scientist and entrepreneur. ...
In information theory, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution. ...
In computer science, a mask is some data that, along with an operation, are used in order to extract information stored elsewhere. ...
In computer programming, a bitwise operation operates on one or two bit patterns or binary numerals at the level of their individual bits. ...
The bitap algorithm is perhaps best known as one of the underlying algorithms of the Unix utility agrep, written by Manber, Wu. Manber and Wu's original paper gives extensions of the algorithm to deal with fuzzy matching of general regular expressions. Wikibooks has more about this subject: Guide to UNIX Unix or UNIX is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T Bell Labs employees including Ken Thompson, Dennis Ritchie, and Douglas McIlroy. ...
A programming tool is a program or application that software developers use to create, debug, or maintain other programs and applications. ...
Agrep (Approximate grep) is a fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. ...
A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. ...
Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Once it has been implemented for a given alphabet and word length m, however, its running time is completely predictable — it runs in O(mn) operations, no matter the structure of the text or the pattern. In computer hardware terminology, word size (word length) is the number of bits that a CPU can process at one time (the word). ...
To analyze an algorithm is to determine the amount of resources (such as time and storage) necessary to execute it. ...
Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...
Exact searching
The bitap algorithm for exact string searching, in full generality, looks like this when implemented in C: String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. ...
The C Programming Language, Brian Kernighan and Dennis Ritchie, the original edition that served for many years as an informal specification of the language The C programming language is a standardized imperative computer programming language developed in the early 1970s by Dennis Ritchie for use on the Unix operating system. ...
#include <stdlib.h> #include <string.h> typedef char BIT; const char *bitap_search(const char *text, const char *pattern) { const char *result = NULL; int m = strlen(pattern); BIT *R; int i, k; if (pattern[0] == '0') return text; /* Initialize the bit array R */ R = malloc((m+1) * sizeof *R); for (k=1; k <= m; ++k) R[k] = 0; R[0] = 1; for (i=0; text[i] != '0'; ++i) { /* Update the bit array */ for (k=m; k >= 1; --k) R[k] = R[k-1] && (text[i] == pattern[k-1]); if (R[m]) { result = (text+i - m) + 1; break; } } free(R); return result; } Bitap distinguishes itself from other well-known string searching algorithms in its natural mapping onto simple bitwise operations, as in the following modification of the above program. Notice that in this implementation, counterintuitively, each bit with value zero indicates a match, and each bit with value 1 indicates a non-match. The same algorithm can be written with the intuitive semantics for 0 and 1, but in that case we must introduce another instruction into the inner loop to set R |= 1. In this implementation, we take advantage of the fact that right-shifting a value shifts in zeros on the right, which is precisely the behavior we need. In computer programs, an important form of control flow is the loop. ...
Notice also that we require CHAR_MAX additional bitmasks in order to convert the (text[i] == pattern[k-1]) condition in the general implementation into bitwise operations. Therefore, the bitap algorithm performs better when applied to inputs over smaller alphabets. #include <string.h> #include <limits.h> const char *bitap_bitwise_search(const char *text, const char *pattern) { int m = strlen(pattern); unsigned long R; unsigned long pattern_mask[CHAR_MAX+1]; int i; if (pattern[0] == '0') return text; if (m > 31) return "The pattern is too long!"; /* Initialize the bit array R */ R = ~1; /* Initialize the pattern bitmasks */ for (i=0; i <= CHAR_MAX; ++i) pattern_mask[i] = ~0; for (i=0; i < m; ++i) pattern_mask[pattern[i]] &= ~(1UL << i); for (i=0; text[i] != '0'; ++i) { /* Update the bit array */ R |= pattern_mask[text[i]]; R <<= 1; if (0 == (R & (1UL << m))) return (text+i - m) + 1; } return NULL; } Fuzzy searching To perform fuzzy string searching using the bitap algorithm, it is necessary to extend the bit array R into a second dimension. Instead of having a single array R that changes over the length of the text, we now have k distinct arrays R1..k. Array Ri holds a representation of the prefixes of pattern that match any suffix of the current string with k or fewer errors. In this context, an "error" may be an insertion, deletion, or substitution; see Levenshtein distance for more information on these operations. In information theory, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution. ...
The implementation below performs fuzzy matching (returning the first match with up to k errors) using the fuzzy bitap algorithm. However, it only pays attention to substitutions, not to insertions or deletions. As before, the semantics of 0 and 1 are reversed from their intuitive meanings. #include <stdlib.h> #include <string.h> #include <limits.h> const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k) { const char *result = NULL; int m = strlen(pattern); unsigned long *R; unsigned long pattern_mask[CHAR_MAX+1]; int i, d; if (pattern[0] == '0') return text; if (m > 31) return "The pattern is too long!"; /* Initialize the bit array R */ R = malloc((k+1) * sizeof *R); for (i=0; i <= k; ++i) R[i] = ~1; /* Initialize the pattern bitmasks */ for (i=0; i <= CHAR_MAX; ++i) pattern_mask[i] = ~0; for (i=0; i < m; ++i) pattern_mask[pattern[i]] &= ~(1UL << i); for (i=0; text[i] != '0'; ++i) { /* Update the bit arrays */ unsigned long old_Rd1 = R[0]; R[0] |= pattern_mask[text[i]]; R[0] <<= 1; for (d=1; d <= k; ++d) { unsigned long tmp = R[d]; /* Substitution is all we care about */ R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1; old_Rd1 = tmp; } if (0 == (R[k] & (1UL << m))) { result = (text+i - m) + 1; break; } } return result; } External links and references - ^ Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science, University of Arizona, Tucson, June 1991. (gzipped PostScript)
- ^ Udi Manber, Sun Wu. "Fast text search allowing errors." Communications of the ACM, 35(10), October 1992.
- ^ Ricardo A. Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." Communications of the ACM, 35(10): pp. 74–82, October 1992.
- Libbitap, a free implementation that shows how the algorithm can easily be extended for most regular expressions. Unlike the code above, it places no limit on the pattern length.
|