FACTOID # 2: Andorra has no unemployment, which is just as well because they have no broadcast TV channels either. What would everyone watch?
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Text mining

Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Data mining has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1] and the science of extracting useful information from large data sets or databases [2]. Data mining involves sorting through large amounts of data and picking out relevant information. ... The ASCII codes for the word Wikipedia represented in binary, the numeral system most commonly used for encoding computer information. ... Pattern recognition is a field within the area of machine learning. ... This article is about computing. ... In computer science, particularly searching, relevance is a score assigned to a search result, representing how well the result meets the information need of the user who issued the search query. ... Novelty is a patentability test, according to which an invention is not patentable if it was already known before the date of filing, or before the date of priority if a priority is claimed, of the patent application. ... Document classification/categorization is a problem in information science. ... Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. ...

Contents

History

Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning. Interdisciplinary work is that which integrates concepts across different disciplines. ... Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web. ... Data mining has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1] and the science of extracting useful information from large data sets or databases [2]. Data mining involves sorting through large amounts of data and picking out relevant information. ... As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ... This article is about the field of statistics. ... Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language from a computational perspective. ...


Sentiment analysis

Sentiment analysis may, for example, involve analysis of movie reviews for estimating how favorably a review is for a movie.[1] Such an analysis may require a labeled data set or labeling of the affectiveness of words. A resource for affectiveness of words have been made for WordNet.[2] Sentiment analysis refers to a broad (definitionally challenged) area of natural language processing, computational linguistics and text mining. ... WordNet is a semantic lexicon for the English language. ...


Applications

Recently, text mining has been receiving attention in many areas.


Security applications

One of the largest text mining applications that exists is probably the classified ECHELON surveillance system. Additionally, many text mining software packages such as Aerotext, Attensity and Expert System are marketed towards security applications, particularly analysis of plain text sources such as internet news. This article is about the Signals Intelligence capability. ... Attensity Corporation, founded in January 2000, is headquartered in Palo Alto and has a technology center in Salt Lake City. ... An expert system, also known as a knowledge based system, is a computer program that contains some of the subject-specific knowledge, and contains the knowledge and analytical skills of one or more human experts. ...


Software and applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results. For other uses, see IBM (disambiguation) and Big Blue. ... Microsoft Corporation, (NASDAQ: MSFT, HKSE: 4338) is a multinational computer technology corporation with global annual revenue of US$44. ...


Academic applications

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access. A database is an information set with a regular structure. ... A database index is a data structure that improves the speed of operations in a table. ... Nature is a prominent scientific journal, first published on 4 November 1869. ... National Institutes of Health Building 50 at NIH Clinical Center - Building 10 The National Institutes of Health (NIH) is an agency of the United States Department of Health and Human Services and is the primary agency of the United States government responsible for biomedical research. ... Document Type Definition (DTD), defined slightly differently within the XML and SGML (the language XML was derived from) specifications, is one of several SGML and XML schema languages, and is also the term used to describe a document or portion thereof that is authored in the DTD language. ...


Academic institutions have also become involved in the text mining initiative:


UK: The National Centre for Text Mining, a collaborative effort between the Universities of Manchester and Liverpool, funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils provides customised tools, research facilities and offers advice to the academic community. With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of Social Science. The National Centre for Text Mining (http://www. ... Affiliations Russell Group, EUA, N8 Group, NWUA, Worldwide Universities Network (WUN) Website http://www. ... The University of Liverpool is a university in the city of Liverpool, England. ... JISC (the Joint Information Systems Committee) is a publicly-funded UK-wide body supporting the use of ICT and related technology for learning, teaching, research and administration in further and higher education. ... The Research Councils of the UK are government agencies responsible for particular areas of science and technology. ... Biology studies the variety of life (clockwise from top-left) E. coli, tree fern, gazelle, Goliath beetle Biology (from Greek: βίος, bio, life; and λόγος, logos, knowledge), also referred to as the biological sciences, is the study of living organisms utilizing the scientific method. ... Health science is the discipline of applied science which deals with human and animal health. ... The social sciences are groups of academic disciplines that study the human aspects of the world. ...


USA: In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis. The UC Berkeley School of Information is a graduate school offering both a professional masters degree as well as a research-oriented PhD degree. ... Sather tower (the Campanile) looking out over the San Francisco Bay and Mount Tamalpais. ...


Commercial software and applications

  • AITellU - provides a range of text mining services and applications based on advanced artificial intelligence techniques.
  • Anderson Analytics - provider of text analytics and content analysis especially as it relates to consumer behavior.
  • Attensity - suite of text mining solutions that includes search, statistical and NLP based technologies for a variety of industries.
  • Autonomy - suite of text mining, clustering and categorization solutions for a variety of industries.
  • Carabao Language Kit - suite of components for text mining, categorization, sense disambiguation, idiom extraction, named entity recognition with tools to add a new language or edit exiting one(s).
  • Clarabridge - text mining and categorization applications for customer, healthcare, and investigative analytics.
  • Clearforest - text mining software to extract meaning from various forms of textual information. (Clearforest was sold to Reuters)
  • Cortex Intelligence - text mining for Competitive Intelligence with Named Entity Recognition.
  • Crossminder - text mining company enabling multilingual searches and searches through semantic approximation.
  • Endeca Technologies - provides software to analyze and cluster unstructured text.
  • Evolutionary Software, Inc. - develops software for automatic extraction from financial reports and provides Text Mining consulting serivces.
  • Expert System S.p.A. - suite of semantic technologies and products for developers and knowledge managers.
  • Fair Isaac - leading provider of decision management solutions powered by advanced analytics (includes text analytics).
  • IBM TAKMI - research prototype.
  • IBM OmniFind Analytics Edition - commercial text mining software.
  • Inxight - provider of text analytics, search, and unstructured visualization technologies. (Inxight was sold to Business Objects that was sold to SAP)
  • Island Data - Real-time market intelligence from unstructured customer feedback.
  • Jane16 - Free online text analysis site, extraction of subject and summary of text.
  • Linguamatics - Intelligence from text with real-time, agile NLP.
  • Nstein Technologies - provider of text analytics, and asset/web content management technologies (media, e-publishing, online publishing).
  • PolyAnalyst - commercial text mining software.
  • RapidMiner - open-source data and text mining software for scientific and commercial use.
  • SAS Enterprise Miner - commercial text mining software.
  • SPSS - provider of SPSS Text Analysis for Surveys, Text Mining for Clementine, LexiQuest Mine and LexiQuest Categorize, commercial text analytics software that can be used in conjunction with SPSS Predictive Analytics Solutions.
  • TEMIS - TEMIS is a software editor providing innovative Information Discovery solutions to serve the Information Intelligence needs of business corporations.
  • TextAnalyst - commercial text mining software
  • Textalyser - an online text analysis tool for generating text analysis statistics of web pages and other texts.
  • Topicalizer - an online text analysis tool for generating text analysis statistics of web pages and other texts.
  • Zoomhive - Text Mining, Self-structuring Asynchronous Learning Network, by means of a multi-keyword extraction algorithm (patent pending).
  • Zoomix - Self learning Text Mining software (patent pending).

Image File history File links Broom_icon. ... Attensity Corporation, founded in January 2000, is headquartered in Palo Alto and has a technology center in Salt Lake City. ... Look up autonomy, autonomous in Wiktionary, the free dictionary. ... Endeca Technologies, Inc. ... Expert System is a software company, founded in Italy in 1989, pioneer in developing and marketing semantic technologies to understand and manage unstructured information. ... Fair Isaac Corporation (NYSE: FIC), founded in 1956 by engineer Bill Fair and mathematician Earl Isaac, provides consulting services and decision management systems. ... Inxight Software, Inc. ... Jane 16 Jane16 means Java Associative Nervous Engine The 16 means based on 16 bytes seqeunce long database (bespoke, internal). ... RapidMiner (formerly YALE) is an environment for machine learning and data mining experiments. ... The computer program SPSS (originally, Statistical Package for the Social Sciences) was released in its first version in 1968, and is among the most widely used programs for statistical analysis in social science. ...

Open-source software and applications

  • GATE - natural language processing and language engineering tool.
  • YALE with its Word Vector Tool plugin - data and text mining software.
  • Pimiento a text-mining application framework written in Java.

General Architecture for Text Engineering or GATE is a Java software toolkit originally developed at the University of Sheffield since 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including information extraction in many languages. ... YALE (Yet Another Learning Environment) is an environment for machine learning experiments and data mining. ...

Implications

Until recently websites most often used text-based lexical searches; in other words, users could find documents only by the words that happened to occur in the documents. Text mining may allow searches to be directly answered by the semantic web; users may be able to search for content based on its meaning and context, rather than just by a specific word. The semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a form that can be read and used by software agents, thus permitting them to find, share and integrate information more easily. ...


Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, by using software that extracts specifics facts about businesses and individuals from news reports, large datasets can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Counter Intelligence A uk label started and owned by John Machielsen. ...


Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. A mail filter is a piece of software which takes an input of an email message. ...


Notes

  1. ^ Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP): 79–86. 
  2. ^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources". PsychNology Journal (1): 61–83. 

Also see: 2002 (number). ... Year 2005 (MMV) was a common year starting on Saturday (link displays full calendar) of the Gregorian calendar. ...

References

  • Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University Press, ISBN 9780521836579
  • Kao Anne, Poteet, Steve R. (Editors), Natural Language Processing and Text Mining, Springer, ISBN-10: 184628175X
  • Konchady Manu "Text Mining Application Programming (Programming Series)" by Manu Konchady, Charles River Media, ISBN 1584504609
  • M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques, WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp. 966-974 (http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/Text%20Classification%20final%20journal.pdf)

See also

Approximate nonnegative matrix factorization describes any algorithm which can be used to factor a matrix composed of non-negative values into two other matrices, which (when multiplied) will approximately equal the original result. ... BioCreative (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. ... Business intelligence (BI) is a business management term, which refers to applications and technologies that are used to gather, provide access to, and analyze data and information about company operations. ... Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language from a computational perspective. ... It has been suggested that Taxonomic classification be merged into this article or section. ... Data mining has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1] and the science of extracting useful information from large data sets or databases [2]. Data mining involves sorting through large amounts of data and picking out relevant information. ... Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web. ... In computer science, name resolution (also called name lookup) is the process of finding the entity that an identifier used in a certain context refers to. ... Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. ... Stop words, or stopwords, is the name given to words which are filtered out prior to, or after, processing of natural language data (text). ... Text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. ... Document classification is a problem in information science. ... This article needs to be cleaned up to conform to a higher standard of quality. ... Web mining - is the application of data mining techniques to discover patterns from the Web. ... Basic definiton A w-shingling is a set of unique shingles—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. ...

External links

  • http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ MUC
  • http://projects.ldc.upenn.edu/ace/ ACE (LDC)
  • http://www.itl.nist.gov/iad/894.01/tests/ace/ ACE (NIST)
  • http://www.arts-humanities.net/text_mining (Discussion group text mining)
  • http://videolectures.net/Top/Computer_Science/Text_Mining/ Video tutorials, talks, lectures (Videolectures.Net)

  Results from FactBites:
 
Text mining - Wikipedia, the free encyclopedia (625 words)
Text mining, also known as intelligent text analysis, text data mining, unstructured data management, or knowledge discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge (usually converted to metadata elements) from unstructured text (i.e.
Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics.
Text classification sometimes is considered a (sub)task of text mining.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.