|
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on specific universe. Linguistics is the scientific study of language. ...
A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. Annotation is extra information associated with a particular point in a document or other piece of information. ...
An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Part-of-speech tagging is the process of marking up the words in a text with their corresponding parts of speech. ...
In linguistics, and particularly in morphology, a lemma is the canonical form of a lexeme. ...
A gloss is a note made in the margins or between the lines of a book, in which the meaning of the text in its original language is explained in another language. ...
Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes. Corpus linguistics is the study of language as expressed in samples (corpora) or real world text. ...
Computational linguistics is an interdisciplinary field dealing with the statistical and logical modeling of natural language from a computational perspective. ...
This article or section is in need of attention from an expert on the subject. ...
Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. ...
State transitions in a hidden Markov model (example) x â hidden states y â observable outputs a â transition probabilities b â output probabilities A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine...
Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Historical documents are document that contain important information about a person, place, or event. ...
Decipherment is the analysis of documents written in ancient languages, where the language is unknown, or knowledge of the language has been lost. ...
Some notable text corpora English language: Historical languages: American National Corpus (ANC) is a paid membership-based collaboratory. ...
The Bank of English is the name of the COBUILD corpus, a collection of English texts. ...
The British National Corpus (or just BNC) is a 100-million-word collection of samples of written and spoken English from a wide range of sources. ...
The Brown Corpus of Standard American English (or just Brown Corpus) was compiled by Henry Kucera and W. Nelson Francis at Brown University, Providence, RI as a general corpora in the field of corpus linguistics. ...
The Oxford English Corpus is a collection of English language texts used by the makers of the Oxford English Dictionary and by Oxford University Presss language research programme. ...
The Scottish Corpus of Texts & Speech (SCOTS) is an ongoing project to build a corpus of modern-day texts in Scottish English and varieties of Scots. ...
- Electronic Text Corpus of Sumerian Literature
- Neo-Assyrian Text Corpus Project
Other languages: In the Neo-Assyrian Text Corpus Project, the following works are published: // State archives of Assyria cuneiform texts The following works are published in the series: State Archives of Assyria Cuneiform Texts: 1997âSAACT-Volume I..---The Standard Babylonian Epic of Gilgamesh, by Simo Parpola, 1997. ...
Bilingual corpora: The Persian Today Corpus or The Persian One-Million-word Corpus (ÙØ§ÚÙâÙØ§Ù Ù¾Ø±ÙØ§Ø±Ø¨Ø±Ø¯ ÙØ§Ø±Ø³Ù اÙ
Ø±ÙØ² in Persian) A book written in Persian by Hamid Hassani, published in Iran, Tehran, 2005. ...
- Evrokorpus English-Slovene parallel corpus
See also For other uses, see concordance. ...
Corpus linguistics is the study of language as expressed in samples (corpora) or real world text. ...
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. ...
A parallel text is a text in one language together with its translation in another language. ...
The success of the Google search engine was mainly due to its powerful PageRank algorithm and its simple, easy-to-use interface. ...
A translation memory, or TM, is a type of database that is used in software programs designed to aid human translators. ...
A treebank is a text corpus in which each sentence has been annotated with syntactic structure. ...
Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. ...
External links |