|
In text retrieval, full text search (also called free search text [citation needed]) refers to a technique for searching a computer-stored document or database; in a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. Full-text searching techniques became common in online bibliographic databases in the 1970s. Most Web sites and application programs (such as word processing software) provide full text search capabilities. Some Web search engines, such as AltaVista employ full text search techniques, while others index only a portion of the Web pages examined by its indexing system.[1] Text retrieval is a branch of computerised information retrieval where the information is stored primarily in the form of text. ...
This article is about the machine. ...
For the similarly-named Surrealist journal, see Documents (journal). ...
This article is about computing. ...
This article is about search engines. ...
A bibliographic or library database is a database of bibliographic information. ...
The 1970s decade refers to the years from 1970 to 1979, also called The Seventies. ...
Word processing, in its now-usual meaning, is the use of a word processor to create documents using computers. ...
This article does not cite any references or sources. ...
Indexing
When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching. In general, a query is a form of questioning, in a line of inquiry. ...
grep is a command line utility that was originally written for use with the Unix operating system. ...
However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a term index or concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents. A search index, or more precisely a full-text index, is the database which a full-text search engine uses to respond to the query issued by the user. ...
...
The indexer will make an entry in the index for each term or word found in a document and possibly it's relative position within the document. Usually the indexer will ignore stop words, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive". Stop words, or stopwords, is the name given to words which are filtered out prior to, or after, processing of natural language data (text). ...
A stemmer is a program or algorithm which determines the morphological root of a given inflected (or, sometimes, derived) word form -- generally a written word form. ...
The precision vs. recall tradeoff Due to the ambiguities of natural language, a full text search typically produces a retrieval list that has low precision: most of the items retrieved are irrelevant. Controlled-vocabulary searching solves this problem by tagging the documents in such a way that the ambiguities are eliminated. However, a controlled vocabulary search may have low recall: it may fail to retrieve some documents that are actually relevant to the search question. Despite the presence of many irrelevant documents in a free text search's retrieval list, a free text search may be able to locate a document that a controlled vocabulary search failed to retrieve. The term natural language is used to distinguish languages spoken and signed (by hand signals and facial expressions) by humans for general-purpose communication from constructs such as writing, computer-programming languages or the languages used in the study of formal logic, especially mathematical logic. ...
Controlled vocabularies are used in indexing schemes, subject headings, thesauri and taxonomies. ...
For a proposal for tagging in Wikipedia, see Wikipedia:WikiProject Microformats#MediaWiki issues A tag cloud with terms related to Web 2. ...
The false positive problem 'As anyone who has performed a free text search will readily recognize, free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language; for example, the word football might refer either to soccer, American, Canadian, Gaelic or Australian rules football, etc., whereas the person searching is probably interested in only one of these.' Relevance is a term used to describe how pertinent, connected, or applicable some information is to a given matter. ...
A false positive, also called false alarm, exists when a test reports, incorrectly, that it has found a signal where none exists in reality. ...
The term natural language is used to distinguish languages spoken and signed (by hand signals and facial expressions) by humans for general-purpose communication from constructs such as writing, computer-programming languages or the languages used in the study of formal logic, especially mathematical logic. ...
Look up Football in Wiktionary, the free dictionary. ...
Football is a ball game played between two teams of eleven players, each attempting to win by scoring more goals than their opponent. ...
Gaelic football (Irish: Peil or Caid ), commonly referred to as football, Gaelic or GAA (gah), is a form of football played mainly in Ireland. ...
High marking is a key skill and spectacular attribute of Australian rules football Precise field and goal kicking using the oval shaped ball is the key skill in Australian rules football Australian rules football, also known as Australian football, Aussie rules, or simply football or footy is a code of...
Certain clustering techniques based on Bayesian algorithms (similar to spam filter in Google) can help reduce the false positive errors. So if in the above example the search term is "football", these techniques can categorize the document/data universe into say "American football", "corporate football" etc. Depending on the occurrences words in a document, it can fall into one of the categories or more. This is kind of one step beyond the full text search. These techniques are being extensively deployed in the e-discovery domain. [2]
Improving the performance of full text searching The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.
Improved querying tools - Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
- Field-restricted search. Some search engines enable users to limit free text searches to a particular field within a stored data record, such as "Title" or "Author."
- Boolean queries. Searches that use Boolean operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a free text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. [2]
- Phrase search. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia."
- Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
- Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.
- Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
- Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example, in the search function in Microsoft Word, using the asterisk in the search query "s*n" will find "sin", "son", "sun", etc. in a text.
In computer science, a keyword is an identifier which indicates a specific command. ...
In computer science, data that has several parts can be divided into fields. ...
For use in mathematics, see Boolean algebra (structure). ...
The term recall has a number of meanings: Product recall A recall election Recall to employment after a layoff Recall from memory. ...
Phrase Search It is a type of Full text search that matches only those documents that contain a specified phrase, such as Wikipedia, the free encyclopedia. ...
Look up Concordance on Wiktionary, the free dictionary see Concordance system for usage in politics. ...
Look up text in Wiktionary, the free dictionary. ...
Proximity Search is an advanced search option used by the Yahoo! and Walhello search engine. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
For other uses, see Syntax (disambiguation). ...
The term wildcard character has the following meanings: // In telecommunications, a wildcard character is a character that may be substituted for any of a defined subset of all possible characters. ...
An asterisk (*), is a typographical symbol or glyph. ...
Microsoft Word is Microsofts flagship word processing software. ...
Improved search algorithms Technological advances have greatly improved the performance of free text searching. For example, Google's PageRank algorithm gives more prominence to documents to which other Web pages have linked. This algorithm dramatically improves users' perception of search precision, a fact that explains its popularity among Internet users. See search engine for additional examples. How PageRank Works PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set. ...
In Wikipedia, precision has the following meanings: In engineering, science, industry and statistics, precision characterises the degree of mutual agreement among a series of individual measurements, values, or results - see accuracy and precision. ...
This article is about search engines. ...
Text retrieval software The following is a partial list of available software products whose predominant purpose is to perform full text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full text search may be accomplished. Fast Search & Transfer ASA (recursive acronym FAST) (OSE: FAST) is a Norwegian company based in Oslo. ...
ht://Dig is a free software system for indexing and searching a finite set of sites or an intranet and is licensed under the GNU General Public License. ...
Inktomi was a Californian company that provided software for Internet Service Providers, which was founded in 1996 by UC Berkeley professor Eric Brewer and graduate student Paul Gauthier. ...
Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. ...
Sphinx is a free software search engine designed with indexing database content in mind. ...
Xapian is an Open Source Probabilistic Information Retrieval library, released under the GNU General Public License (GPL). ...
Notes - ^ In practice, it may be difficult to determine how a given search engine works. The search algorithms actually employed by Web search services are seldom fully disclosed out of fear that Web entrepreneurs will use search engine optimization techniques to improve their prominence in retrieval lists.
- ^ Studies have repeatedly shown that most users do not understand the negative impacts of boolean queries.[1]
In computer science, a search algorithm, broadly speaking, is an algorithm that takes a problem as input and returns a solution to the problem, usually after evaluating a number of possible solutions. ...
Structure of a typical search results page Search engine optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines via natural (organic or algorithmic) search results. ...
See also |