|
A hash function is any well-defined procedure or mathematical function for turning some kind of data into a relatively small integer, that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. Flowcharts are often used to graphically represent algorithms. ...
This article is about functions in mathematics. ...
For other uses, see Data (disambiguation). ...
The integers consist of the positive natural numbers (1, 2, 3, …) the negative natural numbers (−1, −2, −3, ...) and the number zero. ...
In computer science, an index can be: an integer which identifies an array element a pointer data element. ...
For the DNA microarray in life sciences, see DNA microarray. ...
Hash functions are mostly used to speed up table lookup or data comparison tasks — such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on. This article is principally about managing and structuring the collections of data held on computers. ...
In computer science, a database record is a description of a single item as stored in a database. ...
A file in a computer system is a stream (sequence) of bits stored as a single unit, typically in a file system on disk or magnetic tape. ...
Look up nucleic acid in Wiktionary, the free dictionary. ...
Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomizing functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements. The HashKeeper database maintained by the National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values. A checksum is a form of redundancy check, a simple way to protect the integrity of data by detecting errors in data that are sent through space (telecommunications) or time (storage). ...
A check digit is a form of redundancy check used for error detection, the decimal equivalent of a binary checksum. ...
In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes [1]. Fingerprints are typically used to avoid the comparison and transmission...
It has been suggested that this article or section be merged with Forward error correction. ...
In cryptography, a cryptographic hash function is a hash function with certain additional security properties to make it suitable for use as a primitive in various information security applications, such as authentication and message integrity. ...
HashKeeper is a database application of value primarily to those conducting forensic examinations of computers on a somewhat regular basis. ...
The National Drug Intelligence Center (NDIC), established in 1993, is a component of the U.S. Department of Justice and a member of the Intelligence Community. ...
A typical hash function at work Image File history File links Hash_function. ...
Image File history File links Hash_function. ...
Applications
Hash tables Hash functions are mostly used in hash tables, to quickly locate a data record (for example, a dictionary definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the index of a slot in the table where the corresponding record is supposedly stored. In computer science, a hash table is a data structure that speeds up searching for information by a particular aspect of that information, called a key. ...
For other uses, see Dictionary (disambiguation). ...
In general, a hashing function may map several different keys to the same hash value. Therefore, each slot of a hash table contains (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices. This article is about sets in mathematics. ...
Thus, the hash function only hints at the record's location — it only tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries. In the Java programming language, for example, hash functions are central to the way the language allows objects to be stored in hashing collections such as the standard HashMap and HashSet classes. To this end, Java's Object parent class provides a hashCode() method mandating the generation of a 32-bit integer hash value. Java is an object-oriented programming language developed by James Gosling and colleagues at Sun Microsystems in the early 1990s. ...
Finding duplicate records To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value i. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T[i] which contains two or more members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs). Look up list in Wiktionary, the free dictionary. ...
Finding similar records Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table of T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where k ranges between -m and m.
Finding similar substrings The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above. The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings. The Rabin-Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp that uses hashing to find a substring in a text. ...
String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. ...
In computational complexity theory, big O notation is often used to describe how the size of the input data affects an algorithms usage of computational resources (usually running time or memory). ...
Geometric hashing This principle is widely used in computer graphics, computational geometry and many other disciplines, to locate close pairs of points in the plane or in three-dimensional space, similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a bucket grids), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method. This article is about the scientific discipline of computer graphics. ...
In computer science, computational geometry is the study of algorithms to solve problems stated in terms of geometry. ...
UPIICSA IPN - Binary image Image processing is any form of information processing for which the input is an image, such as photographs or frames of video; the output is not necessarily an image, but can be for instance a set of features of the image. ...
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined. ...
This is a list of partition topics, in the mathematical sense. ...
In computer science, geometric hashing is a method for efficiently finding geometric objects of the same or similar shape, even though they may be rotated or otherwise transformed. ...
Properties Determinism To serve its purpose, a hash function must be fast and deterministic — meaning that two identical or equivalent inputs must generate the same hash value. In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. ...
Uniformity A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries. Probability is the likelihood or chance that something is the case or will happen. ...
Note that the hash values need only be uniformly distributed, not random in any sense. Thus, a good randomizing function is usually good for hashing, but the converse need not be true. Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries. In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (Ideally, no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m (see the birthday paradox). In probability theory, the birthday paradox states that in a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will have the same birthday. ...
Variable range In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters — the input data z, and the number n of allowed hash values.
Data normalization In some applications, the input data may contain features that are irrelevant for comparison purposes. When looking up a personal name, for instance, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. In mathematics, an equivalence relation is a binary relation between two elements of a set which groups them together as being equivalent in some way. ...
Continuity A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values. In mathematics, a continuous function is a function for which, intuitively, small changes in the input result in small changes in the output. ...
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that use linear search. In computer science, linear search is a search algorithm, also known as sequential search, that is suitable for searching a set of data for a particular value. ...
Hash function algorithms The choice of a hashing function depends strongly on the nature of the input data, and their probability distribution in the intended application. A probability distribution describes the values and probabilities that a random event can take place. ...
Injective and perfect hashing The ideal hashing function should be injective — that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search. One-to-one redirects here. ...
An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs, is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots. Perfect hash functions are hash functions which guarantee O(1) operations complexity when used in a hash table. ...
Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values and therefore cannot be injective.
Use of short search key as indirect index In special situations, such as short, one or two byte search keys, the search key itself can be used as the first (indirect) 'index' to a table of target index values (1-255) - with any gaps in sequence set as null (i.e. index=0) - denoting an 'invalid' key. Consider the search key to be an 8-bit character containing the alphanumeric characters A-Z, (both upper and lower case) and the numeric digits 0-9, with other possible values an error. The 'hash table' can simply be a pre-defined, and pre-compiled, static 256 byte table containing a sequential index (X'00' - 'FF') in the relevant offsets - equating to the search key itself. If no validation is required, (because the validity of the key is already known), the table need only be as large as the range of (high-low) keys. For example if the possible ASCII keys were say A thru Z, a table of only 26 bytes (containing x'00' - x'19') would be needed for the table - providing excellent locality of reference. The 'conversion' from search key to index value can be accomplished in an extremely short fixed time interval - in fact the time to execute around 5 or 6 machine instructions - or even less in some architectures - irrespective of the search value. Similar methods can also be applied for 16-bit keys (range 0-65535) with an index range but in this case requiring a maximum pre-defined table size of around 128K. If the search key range is known however, the table can often be considerably less. Path length in terms of instruction count is similar to that required for 8 bit keys. Once obtained, the index value can be used for any number of direct indexing including use in branch tables to determine program logic where appropriate and available in the language being used. Image:ASCII fullsvg There are 95 printable ASCII characters, numbered 32 to 126. ...
A system of codes directly understandable by a computers CPU is termed this CPUs native or machine language. ...
In computer programming, a branch table (sometimes known as a jump table) is a term used to describe an efficient method of transferring program control (branching) to another part of a program using a table of branch instructions. ...
Hashing uniformly distributed data If the inputs are bounded-length strings (such as telephone numbers, car license plates, invoice numbers, etc.), and each input may independently occur with uniform probability, then a hash function needs only map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by n/N and truncated to an integer), or many other formulas. In computer programming and formal language theory, (and other branches of mathematics), a string is an ordered sequence of symbols. ...
For other uses, see Telephone (disambiguation). ...
An invoice or bill is a commercial document issued by a seller to a buyer, indicating the products, quantities and agreed prices for products or services with which the seller has already provided the buyer. ...
In mathematics, the uniform distributions are simple probability distributions. ...
Hashing data with other distributions These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n, which is quite sensitive to the trailing digits, may still yield a fairly even distribution of hash values. Packaged food aisles in a Fred Meyer store in Portland, Oregon A supermarket is a departmentalized self-service store offering a wide variety of food and household merchandise. ...
When the data values are long (or variable-length) character strings — such as personal names, web page addresses, or mail messages — their distribution is usually very uneven, with complicated dependencies. For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string — and depends on each character in a different way. A Uniform Resource Locator, URL (spelled out as an acronym, not pronounced as earl), or Web address, is a standardized address name layout for resources (such as documents or images) on the Internet (or elsewhere). ...
In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming languages or the languages used in the study of formal logic, especially mathematical logic) and...
In general, a character is a distinctive significant mark or feature. ...
Digraph has several meanings: directed graph, or digraph Digraph (orthography) Digraph (computing) This is a disambiguation page: a list of articles associated with the same title. ...
Special-purpose hash functions In many such cases, one can design a special-purpose (heuristic) hash function that yield many fewer collisions than a good general-purpose hash function. For example, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution. In computer science, besides the common use as rule of thumb (see heuristic), the term heuristic has two well-defined technical meanings. ...
Hashing with checksum functions One can obtain good general-purpose hash functions for string data by adapting certain checksum or fingerprinting algorithms. Some of those algorithms will map arbitrary long string data z, with any typical real-world distribution — no matter how non-uniform and dependent — to a fixed length bit string, with a fairly uniform distribution. This string can be interpreted as a binary integer k, and turned into a hash value by the formula h = k mod n. This method will produce a fairly even distribution of hash values, as long as the hash range size n is small compared to the range of the checksum function. Bob Jenkins' LOOKUP3 algorithm[1] uses a 32-bit checksum. A 64-bit checksum should provide adequate hashing for tables of any feasible size.
Hashing with cryptographic hash functions Some cryptographic hash functions, such as MD5, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions. However, the uniformity advantage may be too small to offset their much higher cost. In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. ...
Audio identification -
For audio identification [2] such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a human listener. Though they are not common[citation needed], hashing algorithms do exist that are robust to these minor differences. There is a service called MusicBrainz which creates a fingerprint for an audio file and matches it to its online community driven database. For acoustic emissions of ships and submarines, see Acoustic signature. ...
For other uses, see MP3 (disambiguation). ...
In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. ...
In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. ...
MusicBrainz (MusicBrainz. ...
Origins of the term The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix". Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology.[3] Wikipedia does not have an article with this exact name. ...
Wiktionary (a portmanteau of wiki and dictionary) is a multilingual, Web-based project to create a free content dictionary, available in over 151 languages. ...
Donald Ervin Knuth ( or Ka-NOOTH[1], Chinese: [2]) (b. ...
Hans Peter Luhn (July 1, 1896 in Barmen, Germany â 1964) computer scientist for IBM, creator of the Luhn algorithm and KWIC (Key Words In Context) indexing. ...
For other uses, see IBM (disambiguation) and Big Blue. ...
Year 1953 (MCMLIII) was a common year starting on Thursday (link will display full calendar) of the Gregorian calendar. ...
Robert Morris is an American cryptographer. ...
Communications of the ACM (CACM) is the flagship monthly magazine of the Association for Computing Machinery. ...
In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division. The SHA (Secure Hash Algorithm) family is a set of related cryptographic hash functions designed by the National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST). ...
Modular arithmetic (sometimes called modulo arithmetic, or clock arithmetic because of its use in the 24-hour clock system) is a system of arithmetic for integers, where numbers wrap around after they reach a certain value â the modulus. ...
See also It has been suggested that this article or section be merged with Universal hash function. ...
The German Lorenz cipher machine, used in World War II for encryption of very high-level general staff messages Cryptography (or cryptology; derived from Greek κÏÏ
ÏÏÏÏ kryptós hidden, and the verb γÏάÏÏ gráfo write or λεγειν legein to speak) is the study of message secrecy. ...
In cryptography, a cryptographic hash function is a hash function with certain additional security properties to make it suitable for use as a primitive in various information security applications, such as authentication and message integrity. ...
A keyed-hash message authentication code, or HMAC, is a type of message authentication code (MAC) calculated using a cryptographic hash function in combination with a secret key. ...
In computer science, geometric hashing is a method for efficiently finding geometric objects of the same or similar shape, even though they may be rotated or otherwise transformed. ...
Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table: (name, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given name. ...
Perfect hash functions are hash functions which guarantee O(1) operations complexity when used in a hash table. ...
Linear Hashing is a dynamic hash table algorithm invented by Witold Litwin in 1980. ...
A rolling hash is a hash function where the input is hashed in a fixed width window that moves through the input. ...
The Rabin-Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp that uses hashing to find a substring in a text. ...
Wikipedia does not have an article with this exact name. ...
The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. ...
In computer science, a hash table is a data structure that speeds up searching for information by a particular aspect of that information, called a key. ...
In computer science, a hash list can mean any kind of list of hashes. ...
In computer science, hash trees, also known as Merkle (hash) trees or Tiger tree hashes, are an extension of the simpler concept hash list, which in turn is an extension of the old concept of hashing, for instance, a file. ...
Coalesced Hashing example Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining and open addressing. ...
In computer chess and other computer games, transposition tables are used to speed up the search of the game tree. ...
This is a list of hash functions including cyclic redundancy checks, checksum functions, and cryptographic hash functions. ...
Notes - ^ In the remainder of this article, the term function is used to refer to algorithms as well as the functions they compute.
Flowcharts are often used to graphically represent algorithms. ...
This article is about functions in mathematics. ...
References - ^ Jenkins, Bob (September, 1997), Hash Functions, “Algorithm Alley”, Dr. Dobb's Journal, <http://www.ddj.com/184410284>
- ^ "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"
- ^ Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching, 506-542.
Donald Ervin Knuth ( or Ka-NOOTH[1], Chinese: [2]) (b. ...
Cover of books The Art of Computer Programming[1] is a comprehensive monograph written by Donald Knuth which covers many kinds of programming algorithms and their analysis. ...
External links - General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby)
- Hash Functions and Block Ciphers by Bob Jenkins
- Integer Hash Function by Thomas Wang
- The Goulburn Hashing Function by Mayur Patel
- Hash Functions by Paul Hsieh
- HSH 11/13 by Herbert Glarner
- Online Char (ASCII), HEX, Binary, Base64, etc... Encoder/Decoder with MD2, MD4, MD5, SHA1+2, etc. hashing algorithms
- FNV Fowler, Noll, Vo Hash Function
|