FACTOID # 41: On the probability of not reaching 40 graph, the top 34 countries are all African.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Chemical file format

This article discusses some common molecular file formats, including usage and converting between them. It also lists a few sources for freely obtaining chemical data on the Internet. Image File history File links Broom_icon. ...


Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format can be found by three means (see chemical MIME section) A computer file is a collection of information that is stored in a computer system and can be identified by its full path name. ... In computing, the term stream is used in a number of ways, in all cases referring to a succession of data elements made available over time. ...

  • file extension (usually 3 letters). This is widely used, but fragile as common suffixes such as ".mol" and ".dat are used by many systems, including non-chemical ones.
  • self-describing files where the format information is included in the file. Examples are CIF and CML.
  • chemical/MIME type added by a chemically-aware server.

Contents

Sources of Chemical Data

Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.

  1. The US National Institute of Health PubChem database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
  2. The Protein Data Bank is an excellent source of protein molecular data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
  3. eMolecules (formerly Chmoogle) is a commercial data base for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. eMolecules supports fast substructure searching based on parts of the molecular structure.
  4. ChemExper is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds.
  5. New York University Library of 3-D Molecular Structures.
  6. The US Environmental Protection Agency's The Distributed Structure-Searchable Toxicity (DSSTox) Database Network is a project of EPA's Computational Toxicology Program. The database provides SDF molecular files with a focus on carcinogenic and otherwise toxic substances.

The National Institutes of Health is an institution of the United States government which focuses on medical research. ... PubChem is a database of chemical molecules. ... eMolecules is a database of chemical molecules. ...

Chemical Markup Language

Chemical Markup Language (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, XDrawChem and MarvinView. CML (Chemical Markup Language) is a new approach to manage molecular information using tools such as XML and Java. ... Jmol is molecule viewer for use in chemistry and biochemistry. ... XDrawChem is a free software program for drawing chemical structural formulas, available for Windows, Unix, and Mac OS. It is distributed under the GNU GPL. Major features Fixed length and fixed angle drawing Automatic alignment of figures Detection of structures, text, and arrows, and their automatically placement Can automatically draw...


Protein Data Bank Format

The Protein Data Bank Format is commonly used for proteins but it can be used for other types of molecules as well. It was originally designed as a fixed-column-width format and thus officially has a built-in maximum number of atoms; however, many tools can read files that exceed the limit. Some PDB files contain an optional section describing atom connectivity as well as position. Because these files are sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they can grow very large and are often compressed. Some tools, such as Jmol, can read PDB files in gzipped format. The PDB maintains the specifications of the PDB file format and its XML alternative, PDBML. The typical file extension for a PDB file is .pdb, although some older files use .ent or .brk. Some molecular modeling tools write nonstandard PDB-style files that adapt the basic format to their own needs. Through the years the Protein Data Bank has undergone many, many changes and revisions. ... The term molecular mechanics refers to the use of Newtonian mechanics to model molecular systems. ...


GROMACS format

The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from molecular dynamics simulations, so it allows for additional numerical precision and optionally retains information about particle velocity as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is .gro. GROMACS (Groningen Machine for Chemical Simulations) is a molecular dynamics simulation package developed in the University of Groningen. ... Molecular dynamics (MD) is a form of computer simulation where atoms and molecules are allowed to interact for a period of time under known laws of physics. ... The velocity of an object is its speed in a particular direction. ...


CHARMM format

The CHARMM molecular dynamics package can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF (protein structure file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are .crd and .psf respectively. Many external analysis tools such as VMD can read the PSF format, while they typically cannot read GROMACS topology files. CHARMM (Chemistry at HARvard Macromolecular Mechanics) is the name of a widely used set of force fields for molecular dynamics as well as the name for the molecular dynamics simulation and analysis package associated with them. ... PSF is a three-letter abbreviation or three-letter acronym (TLA) that may stand for: Pacific Salmon Foundation Penile scrotal fusion Perfect Storm Foundation Permanent School Fund Pharmaciens sans Frontières Phelps-Stokes Fund Piano Santa Foundation Planetary Studies Foundation Plastic Surgeons Forum PlayStation Sound Format or Portable Sound Format... Proteins are an important class of biological macromolecules present in all biological organisms, made up of such elements as carbon, hydrogen, nitrogen, oxygen and sulfur. ... Screenshot of VMD 1. ...


Ghemical file format

The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (!Header, !Info, !Atoms, !Bonds, !Coord, !PartialCharges and !End). Ghemical is a computational chemistry software package written in C++ and released under the GNU GPL. The program has GUI based on GTK+2 and supports quantum mechanical and molecular mechanic models, with geometry optimization, molecular dynamics, and a large set of visualization tools. ...


The proposed MIME type for this format is application/x-ghemical.


SMILES

The Simplified Molecular Input Line Entry Specification (SMILES) is a line notation for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates. The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. ... Line notation is a typographical notation system using ASCII characters, most often used for chemical nomenclature. ...


Atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol `=' represents double bonds and `#' represents triple bonds. Branching is indicated by (). Rings are indicated by pairs of digits.


Some examples are

Name Formula SMILES String
Methane CH4 C
Ethanol C2H6O CCO
Benzene C6H6 C1=CC=CC=C1 or c1ccccc1
Ethylene C2H4 C=C

Methane is the principal component of natural gas. ... Ethanol, also known as ethyl alcohol or grain alcohol, is a flammable, colorless, slightly toxic chemical compound with a distinctive perfume-like odor, and is the alcohol found in alcoholic beverages. ... Benzene, also known as benzol, is an organic chemical compound with the formula C6H6. ... Ethylene (or IUPAC name ethene) is the chemical compound with the formula CH2CH2. ...

Other Common Formats

One of the widest used industry standards are chemical table file formats, like the Structure Data Format (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of the white paper Media:MDL® CTfile Formats. MDL publishes a specification of their Connection Table formats, which include Molfile and SD formats. ...


PubChem also has XML and ASN1 file formats, which are export options from the PubChem online database. They are both text based (ASN1 is most often a binary format).


There are a large number of other formats listed in the table below.


Converting Between Formats

OpenBabel/JOELib are a freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables. OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. ... JOELib is a freeware chemical expert system mainly used for converting chemical file formats. ...

 babel -i input_format input_file -o output_format output_file 

For example, to convert the file epinephrine.sdf in SDF to CML use the command

 babel -i sdf epinephrine.sdf -o cml epinephrine.cml 

The resulting file is epinephrine.cml.


A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint (based on the Chemistry Development Kit), XDrawChem (based on OpenBabel), Chime, and Jmol fit into this category. The Chemistry Development Kit is an open source Java library for Chemoinformatics and Bioinformatics. ... XDrawChem is a free software program for drawing chemical structural formulas, available for Windows, Unix, and Mac OS. It is distributed under the GNU GPL. Major features Fixed length and fixed angle drawing Automatic alignment of figures Detection of structures, text, and arrows, and their automatically placement Can automatically draw... OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. ... For the musical instrument, see tubular bell. ... Jmol is molecule viewer for use in chemistry and biochemistry. ...


The Chemical MIME Project

"Chemical MIME" is a de facto approach for adding MIME types to chemical streams. Multipurpose Internet Mail Extensions (MIME) is an Internet Standard that extends the format of e-mail to support text in character sets other than US-ASCII, non-text attachments, multi-part message bodies, and header information in non-ASCII character sets. ...

This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. The first version of an Internet draft was published during May-October 1994, and the second revised version during April-September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.

file extension MIME type proper name description
acc Accord Chemistry Binary chemistry file. Can contain reactions or molecules.
adf Accord Data Format Binary chemistry file; made up of one or more records of Accord Chemistry
alc chemical/x-alchemy Alchemy Format
AMBER (file format)
csf chemical/x-cache-csf CAChe MolStruct CSF
cbin, cascii, ctab chemical/x-cactvs-binary CACTVS format
cdx chemical/x-cdx ChemDraw eXchange file
cer chemical/x-cerius MSI Cerius II format
c3d chemical/x-chem3d Chem3D Format
chm chemical/x-chemdraw ChemDraw file
cif chemical/x-cif Crystallographic Information File, Crystallographic Information Framework Promulgated by the International Union of Crystallography
cmdf chemical/x-cmdf CrystalMaker Data format
cml chemical/x-cml Chemical Markup Language XML based Chemical Markup Language.
cpa chemical/x-compass Compass program of the Takahashi
bsd chemical/x-crossfire Crossfire file
csm, csml chemical/x-csml Chemical Style Markup Language
ctx chemical/x-ctx Gasteiger group CTX file format
cxf, cef chemical/x-cxf Chemical eXchange Format
emb, embl chemical/x-embl-dl-nucleotide EMBL Nucleotide Format
spc chemical/x-galactic-spc SPC format for spectral and chromatographic data
inp, gam, gamin chemical/x-gamess-input GAMESS Input format
fch, fchk chemical/x-gaussian-checkpoint Gaussian Checkpoint Format
cub chemical/x-gaussian-cube Gaussian Cube (Wavefunction) Forma
gau, gjc, gjf chemical/x-gaussian-input Gaussian Input Format
gcg chemical/x-gcg8-sequence Protein Sequence Format
gen chemical/x-genbank ToGenBank Format
istr,ist chemical/x-isostar IsoStar Library of Intermolecular Interactions
jdx, dx chemical/x-jcamp-dx JCAMP Spectroscopic Data Exchange Format
kin chemical/x-kinemage Kinetic (Protein Structure) Images
mcm chemical/x-macmolecule MacMolecule File Format
mmd, mmod chemical/x-macromodel-input MacroModel Molecular Mechanics
mol chemical/x-mdl-molfile MDL Molfile
smiles, smi chemical/x-daylight-smiles Simplified molecular input line entry specification A line notation for molecules.
sdf chemical/x-mdl-sdfile Structure-Data File

The definitive specification is at http://www.ch.ic.ac.uk/chemime/ which is updated when major new types appear. A filename extension or filename suffix is an extra set of (usually) alphanumeric characters that is appended to the end of a filename to allow computer users (as well as various pieces of software on the computer system) to quickly determine the type of data stored in the file. ... Multipurpose Internet Mail Extensions (MIME) is an Internet Standard that extends the format of e-mail to support text in character sets other than US-ASCII, non-text attachments, multi-part message bodies, and header information in non-ASCII character sets. ... This article describes several AMBER file formats See also AMBER Chemical file format External links AMBER file formats Categories: | | ... Crystallographic Information File (CIF) is a standard text file format for representing crystallographic information, promulgated by the International Union of Crystallography (IUCr). ... CML (Chemical Markup Language) is a new approach to manage molecular information using tools such as XML and Java. ... The Extensible Markup Language (XML) is a W3C-recommended general-purpose markup language that supports a wide variety of applications. ... CML (Chemical Markup Language) is a new approach to manage molecular information using tools such as XML and Java. ... The format defined by the Joint Commitee on Atomic and Molecular Physical Data (JCAMP) supports chemical and spectroscopic information. ... A file format created (and owned) by MDL, for holding information about the atoms, bonds, connectivity and coordinates of a molecule. ... The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. ... SDF - one of a family of file formats from MDL holding chemical data, especially structure information. ...


Chemical MIME Support

For Unix/Linux there is a tar.gz available which registers chemical MIME types on your system. Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.


chemical-mime-data: http://sourceforge.net/project/showfiles.php?group_id=159685&package_id=179318


See also

A file format is a particular way to encode information for storage in a computer file. ... OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. ... JOELib is a freeware chemical expert system mainly used for converting chemical file formats. ... OELib was an Open Source Cheminformatics library. ... The Chemistry Development Kit is an open source Java library for Chemoinformatics and Bioinformatics. ... CML (Chemical Markup Language) is a new approach to manage molecular information using tools such as XML and Java. ...

References

External links


  Results from FactBites:
 
File format - Wikipedia, the free encyclopedia (1744 words)
Other file formats, however, are designed for storage of several different types of data: the GIF format supports storage of both still images and simple animations, and the QuickTime format can act as a container for many different types of multimedia.
Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program treats a particular file format correctly.
Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the filesystem—an example of metadata.
  More results at FactBites »


 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m