|
In bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...
A file format is a particular way to encode information for storage in a computer file. ...
Genetics (from the Greek genno γεννώ= give birth) is the science of genes, heredity, and the variation of organisms. ...
In the field of bioinformatics, a sequence database is a large collection of DNA, protein, or other sequences stored on a computer. ...
>SEQUENCE_1 ;comment line 1 (optional) MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 ;comment line 1 (optional) ;comment line 2 (optional) SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH It consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standarized headers, which helps when automatically extracting information from the header. Often the first 'word' of the header is a unique identifier for the sequence. The header line may contain more than one header, separated by a ^A (Control-A) character (as in ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz ). This should be interpreted as both headers pointing to the same sequence below. In the field of bioinformatics, a sequence database is a large collection of DNA, protein, or other sequences stored on a computer. ...
After the header line, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur. Most databases and bioinformatics applications do not recognize these comments so their use is discouraged, but they are part of the official format. After the header line and comments, one or more sequence lines may follow: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or DNA sequences, and they can contain gaps or alignment characters (see sequence alignment). A protein primary structure is a chain of amino acids. ...
part of a DNA sequence A DNA sequence (sometimes genetic sequence) is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, The possible letters are A, C, G, and T, representing the four nucleotide subunits of a DNA strand (adenine, cytosine, guanine...
Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. ...
FASTA format files often have file extensions like .fa, .mpfa, fna, or .fsa (and probably many more!). A filename extension or filename suffix is an extra set of (usually) alphanumeric characters that is appended to the end of a filename to allow computer users (as well as various pieces of software on the computer system) to quickly determine the type of data stored in the file. ...
The simple format of FASTA files makes them easy to manipulate using text processing tools and scripting languages like Perl. Scripting languages (commonly called scripting programming languages or script languages) are computer programming languages initially designed for scripting the operations of a computer. ...
Perl, (also backronymed as Practical Extraction and Report Language, see below) is an interpreted procedural programming language designed by Larry Wall. ...
The NCBI have gone so far as to define a standard for their fasta header (although generally this is a bit messy). The formatdb man page has this to say on the subject of FASTA format databases, "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format." Almost all substantial UNIX and Unix-like operating systems have extensive documentation available as an electronic manual, split into multiple sections called man pages (short for manual pages and based on the command used to display them). ...
However they do not give a difinitive description of the FASTA defline format, an attempt to create such a format is given below. GenBank gi|gi-number|gb|accession|locus EMBL Data Library gi|gi-number|emb|accession|locus DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|name Brookhaven Protein Data Bank (1) pdb|entry|chain Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier See also
FASTA is a sequence alignment package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 in the article Rapid and sensitive protein similarity searches. ...
External link |