|
To meet Wikipedia's content policies, the external links section for this article may require cleanup. This article may contain excessive or inappropriate external links. See Wikipedia's external links guidelines for further information. Please improve the article or discuss proposed changes on the talk page. Please remove this tag after the section has been cleaned. This article has been tagged since October 2006. A chemical database is a database specifically designed to store chemical information. Most chemical databases store information on stable molecules. Chemical structures are traditionally represented using lines indicating chemical bonds between atoms and drawn on paper (2D structural formulae). While these are ideal visual representations for the chemist, they are unsuitable for computational use and especially for search and storage. The term database originated within the computer industry, though its meaning has been broadened by popular use,includes non-electronic databases within its definition. ...
Cheminformatics is the use of computer and informational techniques, applied to a range of problems in the field of chemistry. ...
In chemistry, a molecule is an aggregate of at least two atoms in a definite arrangement held together by special forces. ...
Chemical structure refers to the spatial arrangement of atoms in a molecule and the chemical bonds that hold the atoms together. ...
In chemistry, a chemical bond is the force which holds together atoms in molecules or crystals. ...
Properties For alternative meanings see atom (disambiguation). ...
Many chemical compounds, especially hydrocarbons, can exist in different geometric configurations. ...
A chemist pours from a Florence flask. ...
In computer science, a search algorithm, broadly speaking, is an algorithm that takes a problem as input and returns a solution to the problem, usually after evaluating a number of possible solutions. ...
Computer storage, computer memory, and often casually memory refer to computer components, devices and recording media that retain data for some interval of time. ...
Large chemical databases are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory. A terabyte is a unit of measurement in computers. ...
Representation
There are two principal techniques for representing chemical structures in digital databases These approaches have been refined to allow representation of stereochemical differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search. In mathematics and computer science, the adjacency matrix of a finite directed or undirected graph G on n vertices is the n à n matrix where the nondiagonal entry is the number of edges from vertex i to vertex j, and the diagonal entry is either twice the number of loops...
A chemical bond is the physical phenomenon of chemical substances being held together by attraction of atoms to each other through sharing, as well as exchanging, of electrons or electrostatic forces. ...
A file format created (and owned) by MDL, for holding information about the atoms, bonds, connectivity and coordinates of a molecule. ...
The Protein Data Bank (PDB) is a repository for 3-D structural data of proteins and nucleic acids. ...
CML (Chemical Markup Language) is a new approach to manage molecular information using tools such as XML and Java. ...
Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph. ...
In computer science, breadth-first search (BFS) is a tree search algorithm used for traversing or searching a tree, graph. ...
The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. ...
Dutch Sign Language (Nederlandse Gebarentaal or NGT; Sign Language of the Netherlands or SLN) is the sign language used by deaf people in the Netherlands. ...
The IUPAC International Chemical Identifier (InChI), developed by IUPAC and NIST, is a digital equivalent of the IUPAC name for any particular covalent compound. ...
The different types of isomers. ...
Organometallic have classically been compounds having bonds between one or more metal atoms and one or more carbon atoms of an organyl group. ...
Search Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for sub-graph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of Ullman's algorithm or variations of it. Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This precomputation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. The International Union of Pure and Applied Chemistry (IUPAC) is an international non-governmental organization devoted to the advancement of chemistry. ...
In the context of abstract algebra or universal algebra, a monomorphism is simply an injective homomorphism. ...
A pictorial representation of a graph In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise relations between objects from a certain collection. ...
It has been suggested that this article or section be merged into Asymptotic notation. ...
It has been suggested that this article or section be merged into Asymptotic notation. ...
A sequence of bits. ...
A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). ...
Descriptors All properties of molecules beyond their structure can be split up into either physico-chemical or pharmacological attributes also called descriptors. On top of that their exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and synonyms. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable and unique string although it becomes unwieldy for larger molecules. Trivial names on the other hand abound with homonyms and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, (partial) charge, solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (screening, bioassay) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are. Pharmacology (in Greek: pharmacon is drug, and logos is science) is the study of how chemical substances interfere with living systems. ...
Look up Synonym in Wiktionary, the free dictionary. ...
IUPAC nomenclature is a systematic way of naming organic chemical compounds. ...
In computer programming and some branches of mathematics, strings are sequences of various simple objects. ...
In chemistry, a trivial name (also common or vernacular name) is a non-systematic name. ...
A homonym is one of a group of two or more words that have the same phonetic form (i. ...
In database design, a primary key is a value that can be used to identify a unique row in a table. ...
The molecular mass of a substance (less accurately called molecular weight and abbreviated as MW) is the mass of one molecule of that substance, relative to the unified atomic mass unit u (equal to 1/12 the mass of one atom of carbon-12). ...
A partial charge is a charge with an absolute value of less than one elementary charge unit. ...
It has been suggested that Solid solubility be merged into this article or section. ...
Screening, in general, is the investigation of a great number of something (for instance, people) looking for those with a particular problem or feature. ...
Also known as a biological assay, a bioassay is a measurement of the effects of a substance on living organisms. ...
Similarity There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverse of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds. In mathematics, the inverse of an element x, with respect to an operation *, is an element x such that their compose gives a neutral element. ...
Distance is a numerical description of how far apart things lie. ...
The molecular mass of a substance (less accurately called molecular weight and abbreviated as MW) is the mass of one molecule of that substance, relative to the unified atomic mass unit u (equal to 1/12 the mass of one atom of carbon-12). ...
In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between the two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem. ...
In mathematics, triangle inequality is the theorem stating that for any triangle, the measure of a given side must be less than the sum of the other two sides but greater than the difference between the two sides. ...
Chemicals in the databases may be clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick (k-nearest neighbours) algorithm. Clustering can refer to Computer clustering - (in Computer science) the connection of many low-cost computers using special hardware and software such that they can be used as one larger computer. ...
In pharmacologically-oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (ADME/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods. Pharmacology (in Greek: pharmacon is drug, and logos is science) is the study of how chemical substances interfere with living systems. ...
ADME is an acronym in pharmacokinetics and pharmacology for Absorption, Distribution, Metabolism, and Excretion, and describes the disposition of a pharmaceutical compound within an organism. ...
QSAR (Quantitative Structure-Activity Relationship, sometimes the A stands also for Affinity=reactivity) is the quantitative correlation of the biological (ecological, toxicological or pharmacological) activity to the structure of chemical compounds, which allows the prediction of the so-called drug efficacy of a structurally related compound. ...
Registration systems Databases systems for maintaining unique records on chemical compounds are termed as Registration systems. These are often used for chemical indexing, patent systems and industrial databases. A chemical compound is a chemical substance consisting of two or more different chemically bonded chemical elements, with a fixed ratio determining the composition. ...
A patent is a set of exclusive rights granted by a state to a person for a fixed period of time in exchange for the regulated, public disclosure of certain details of a device, method, process or composition of matter (substance) (known as an invention) which is new, inventive, and...
Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective. Canonical is an adjective derived from canon. ...
Smile (album), for the musical, see Smile (Musical) and for the bank, see smile (bank) Smile was also the name of the band later known as Queen. ...
A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). ...
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic. Each of these would be considered a different record in a chemical registry system. The different types of isomers. ...
In chemistry, a racemate is a mixture of equal amounts of left- and right-handed stereoisomers of a chiral molecule. ...
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals. The halogens exist as diatomic molecules in the gas, liquid and solid phases. ...
An example is the Chemical Abstracts Service (CAS) registration system [1]. See also CAS registry number. Chemical Abstracts Service (CAS) is a division of the American Chemical Society which produces the Chemical Abstracts, an index of the scientific literature in chemistry and related fields. ...
CAS registry numbers are unique numerical identifiers for chemical compounds, polymers, biological sequences, mixtures and alloys. ...
Tools The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations. There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example a query to search for records having a benzene ring in their structure represented as a SMILES string in a SMILESCOL column could be OpenBabel is free software, a chemical expert system mainly used for converting chemical file formats. ...
PostgreSQL is a free object-relational database server (database management system), released under a flexible BSD-style license. ...
Structured Query Language (SQL) is the most popular computer language used to create, modify and query databases. ...
- SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1')).
Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI). The International Union of Pure and Applied Chemistry (IUPAC) is an international non-governmental organization devoted to the advancement of chemistry. ...
Text mining, sometimes alternately referred to as text data mining or knowledge discovery in text (KDT), refers generally to the process of deriving high quality information from text. ...
The IUPAC International Chemical Identifier (InChI), developed by IUPAC and NIST, is a digital equivalent of the IUPAC name for any particular covalent compound. ...
References A good discussion group for topics on this subject is the CCL (email list). http://www.ccl.net/chemistry/
See also The Beilstein database is one of the largest databases in the area of organic chemistry. ...
External links To meet Wikipedia's quality standards, this list may require cleanup. This list is poorly defined, unverified or indiscriminate. If you are familiar with the subject, please improve the list, prune it, or discuss its parameters on the talk page. Editing help is available. - CDK a Java open source library for chemical data handling
- Chemfolder, a PC-based software application for structure and reaction database manipulation
- ChemADVISOR, Inc., creator of the LOLI Database
- Chemical Abstracts Service
- eMolecules, the free chemistry search engine
- JOELib, a Java chemical data handling software library
- JChem: a Java-based suite API toolkits for structure management, search, editing and other things cheminformatic - free for Academic teaching and research and for free web pages online implementation
- OpenBabel, a program to interconvert different chemical formats
- Organic synthesis database
- ZINC, a free database for virtual screening
- JCHEM.INFO, Free Organic And Inorganic Chemicals Database, Compound Physical Data
|