FACTOID # 116: More than a third of the world's airports are in the United States of America.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Gene finding

Gene finding is the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ... In biology the genome of an organism is the whole hereditary information of an organism that is encoded in the DNA (or, for some viruses, RNA). ... Space-filling model of a section of DNA molecule Deoxyribonucleic acid (DNA) or deoxyribose nucleic acid is a nucleic acid that contains the genetic instructions specifying the biological development of all cellular forms of life (and many viruses). ... This stylistic schematic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). ... A non-coding RNA (ncRNA) is any RNA molecule that functions without being translated into a protein. ... In biochemistry, a regulatory region is a DNA base sequence that controls gene expression. ... For the sense of sequencing used in electronic music, see the music sequencer article. ...


Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. The latter still demands in vivo experimentation through gene knockout and other assays, although frontiers of bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone. In vivo (Latin for (with)in the living). ... A gene knockout is a genetically engineered organism that carries one or more genes in its chromosomes that has been made inoperative. ... Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...


In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources, gene finding has been redefined as a largely computational problem. Figure 1: Chromosome. ... A genetic map is a chromosome map of a species that shows the position of its known genes and/or markers relative to each other, rather than as specific physical points on each chromosome. ...

Contents

Extrinsic Approaches

In extrinsic gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a messenger RNA (mRNA) or protein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST is a widely used system designed for this purpose. The interaction of mRNA in a eukaryote cell. ... A representation of the 3D structure of myoglobin, showing coloured alpha helices. ... Transcription is the process through which a DNA sequence is enzymatically copied by an RNA polymerase to produce a complementary RNA. In the case of protein-encoding DNA, transcription is the beginning of the process that ultimately leads to the translation of the genetic code (via the mRNA intermediate) into... RNA codons. ... Blast can be an initialism: An algorithm, used in bioinformatics, see BLAST. BLAST can also mean Berkeley Lazy Abstraction Software Verification Tool. [1] The journal of the Vorticism movement, published in 1914 and 1915: see BLAST. A figure from Norse mythology, see Hrimthurs. ...


A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.


Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data. EnsEMBL is a research project aiming to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. ...


Ab Initio Approaches

Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ab initio gene finding, in which genomic DNA sequence alone is systematically searched for certain telltale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.


In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (content), which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon to occur approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy. Prokaryotes are unicellular (in rare cases, multicellular) organisms without a nucleus. ... for disambiguation of the term promoter, see the promoter Wiktionary article In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. ... The Pribnow box is the sequence TATAAT of six nucleotides (thymine-adenine-thymine-etc. ... In molecular biology, a transcription factor is a protein that binds DNA at a specific promoter or enhancer region or site, where it regulates transcription. ... An open reading frame or ORF is any sequence of DNA or RNA that can be (translation) into a protein. ... In genetics, two nucleotides on opposite complementary DNA or RNA strands that are connected via hydrogen bonds are called a base pair (often abbreviated bp). ...


Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail. Kingdoms Eukaryotes are organisms with complex cells, in which the genetic material is organized into membrane-bound nuclei. ... CpG islands are regions of DNA near and in the promoter of a eukaryotic gene where a large concentration of CpG sites exist. ... Polyadenylation is the covalent linkage of a polyadenylyl moiety to a messenger RNA molecule. ...


Second, splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in human might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes. In genetics, splicing is a modification of genetic information prior to translation. ... The exon portion of a DNA strand encodes a specific portion of a protein. ... Diagram of the location of introns and exons within a gene. ...


Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic and computational linguistic models, especially hidden Markov models, in order to combine information from a variety of different signal and content measurements. The Glimmer system is a widely used and highly accurate gene finder for prokaryotes. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; a notable example is the GENSCAN program. A hidden Markov model (HMM) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. ... GLIMMER stands for Gene Locator and Interpolated Markov ModelER. GLIMMER was the first bioinformatics system for finding genes that used the interpolated Markov model formalism. ...


Comparative Genomics Approaches

As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach. This is based on the principle that the forces of natural selection cause genes and other functional elements to undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be identified by comparing the genomes of related species to detect this evolutionary pressure for conservation. In 2003, comparative genomics analysis of several yeast species led to significant revisions of the yeast gene catalog, and similar approaches will likely lead to a significantly refined understanding of the human genome within the next few years. Comparative genomics attempts to extract information by comparing the genomes of different species. ... Natural selection is the primary mechanism within the scientific theory of evolution, in that it alters the frequency of alleles within a population. ... Yeast is a group of single-celled (unicellular) fungi a few species of which are commonly used to leaven bread and ferment alcoholic beverages. ...


External links

  • http://www.genefinding.org
  • http://www.binf.ku.dk/users/krogh/genefinding.html
  • http://www.swbic.org/links/1.4.3.2.php
  • http://www.tigr.org/software/glimmer
  • http://www.tigr.org/software/GlimmerHMM

  Results from FactBites:
 
Genome Biology | Full text | A comprehensive transcript index of the human genome generated using microarrays and ... (9545 words)
These deficiencies are addressed in full-length gene cloning strategies [13], but cloning is still a laborious process which could be accelerated if we were able to start from a more accurate view of a putative gene [13].
Figure 4c highlights tiling data under one condition for the beta-actin gene, a gene that is constitutively expressed in all tissues and often serves as a positive control in mRNA and protein expression experiments.
An internal database of RNA genes and bacterial and vector sequences was aligned to the genome with BLASTN.
  More results at FactBites »


 
 

COMMENTARY     

There are 1 more (non-authoritative) comments on this page

Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m