|
Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. Information science (a. ...
Electronic document means any computer data (other than programs or system files) that are intended to be used in their computerized form, without being printed (although printing is usually possible). ...
See also the category disambiguation page. ...
Techniques
Document classification techniques include: and approaches based on natural language processing. A naive Bayes classifier (also known as Idiots Bayes) is a simple probabilistic classifier based on applying Bayes theorem with strong (naive) independence assumptions. ...
The tf-idf weight (term frequency - inverse document frequency) is a weight often used in Information Retrieval. ...
Latent semantic analysis (LSA) is a technique in information retrieval invented in 1990 [1]. It is sometimes called latent semantic indexing (LSI). ...
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. ...
An artificial neural network (ANN) or commonly just neural network (NN) is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. ...
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. ...
In decision theory, a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. ...
ID3 (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree. ...
It has been suggested that Taxonomic classification be merged into this article or section. ...
Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. ...
Applications A recent notable use of document classification techniques has been spam filtering which tries to discern E-mail spam messages from legitimate emails. A mail filter is a piece of software which takes an input of an email message. ...
E-mail spam is a subset of spam that involves sending nearly identical messages to numerous recipients by e-mail. ...
See also Classification may refer to: Taxonomic classification See also class (philosophy) Statistical classification Security classification Hint: Language use may refer to a taxonomic classification that is used for statistical purposes also as a statistical classification (like International Statistical Classification of Diseases and Related Health Problems). ...
Supervised learning is a machine learning technique for creating a function from training data. ...
Unsupervised learning is a method of machine learning where a model is fit to observations. ...
Document retrieval is defined as the matching of some stated user query against useful parts of free-text records. ...
Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. ...
As a broad subfield of artificial intelligence, Machine learning is concerned with the development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ...
Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. ...
Web mining and web usage mining is the application of data mining techniques to discover usage patterns from the Web in order to better understand and serve the needs of Web-based applications. ...
It has been suggested that Taxonomic classification be merged into this article or section. ...
External links Publications: Resources: Data sets: Software: - LingPipe - Java natural language processing software including a rich classification runtime and evaluation framework with classifiers based on character- and token- language models (including Naive Bayes).
- DrugSense newsbot - Continuous classification of breaking drug-related news. Around 800 drug-related news articles a day are classified by the type of (illegal or mind-altering) drugs mentioned, recognized drug-war propaganda themes. Open XML concepts dictionary keywords (thesaurus) drives newsbot's spider/crawler, as well as the classification/categorization of documents.
- TIS eFLOW platform - a modular solution that offers advanced data capture and document classification capabilities.
- YALE (Yet Another Learning Environment) - freely available integrated open-source software environment for knowledge discovery, data mining, machine learning, visualization (e.g. of text clusterings), etc. featuring a plugin WordVectorTool for text mining tasks like text classification, text clustering, document feature set construction and transformation, etc.
- PolyAnalyst - commercial text analysis software, includes Support Vector Machines, Naive Bayes, ID3 (Decision Trees) and other classification tools
- Bow - freely available open-source toolkit for statistical language modeling, text retrieval, classification, and clustering.
- XmlMiner Data and text mining toolkit targeted at XML data.
|