The Protein Ontology

Home
About PO
PO Hierarchy
PO People
Downloads
PO Instances
Documentation
Contact PO

Visitor Count:

screen resolution stats

 


An Introduction to the Protein Ontology

What does the Protein Ontology do?

Protein Ontology (PO) is a means of formalizing protein data and knowledge; protein ontology includes concepts or terms relevant to the domain, definitions of concepts, and defined relationships between the concepts. PO integrates protein data formats and provide a structured and unified vocabulary to represent protein synthesis concepts. PO provides integration of heterogeneous protein and biological data sources. PO converts the enormous amounts of data collected by geneticists and molecular biologists into information that scientists, physicians and other health care professionals and researchers can use to easily understand the mapping of relationships inside protein molecules, interaction between two protein molecules and interactions between protein and other macromolecules at cellular level. PO also helps to codify proteomics data for analysis by researchers.

What PO is NOT

Traditional approaches to integrate protein data generally involved keyword searches, which immediately excludes unannotated or poorly annotated data. It also excludes proteins annotated with synonyms unknown to the user. Of the protein data that is retrieved in this manner, some biological resources do not record information about the data source, so there is no evidence of the annotation.

An alternative protein annotation approach is to rely on sequence identity, or structural similarity, or functional identification. The success of this method is dependent on the family the protein belongs to. Some proteins have high degree of sequence identity, or structural similarity, or similarity in functions that are unique to members of that family alone. Consequently, this approach can’t be generalized to integrate the protein data.

Clearly, these traditional approaches have limitations in capturing and integrating data for Protein Annotation. For these reasons, we have adopted an alternative method that does not rely on keywords or similarity metrics, but instead uses ontology.

Protein Annotation using Protein Ontology

In the context of protein data, annotation generally refers to all information about protein other than protein sequence. In collection of protein data, each protein is labeled at least by an identifier and is usually complemented by annotations as free text or as codified information, like names of authors responsible for that protein, submission date of protein data, etc. Annotations became a challenge in proteomics considering the size and complexity of protein complexes and their structures.

For our prospects of Protein Ontology, we will mainly deal with two main sources of protein annotations: (1) those taken from various protein data sources submitted by authors of protein data themselves from their published experimental results and (2) those that we name annotation that are obtained by an annotator or group of annotators by analysis of raw data (typically a protein sequence or atomic structure description) with various tools extracting biological information from other protein data collections.

PO Instance Store

The Protein Instance Store is created containing instances of protein data using the PO format. PO provides technical and scientific infrastructure to allow evidence based description and analysis of relationships between proteins. PO uses data sources like PDB, SCOP, OMIM and various published scientific literature to gather protein data. PO Database is represented using Web Ontology Language (OWL) Protein Data currently in PO format is available on the Downloads Page.

Mining PO Instance Store

PO instance store covers various species of proteins from bacterial and plant proteins to human proteins. Such a generic representation using PO shows the strength of PO format representation. We used some standard hierarchical and tree mining algorithms on PO Data. We compared MB3-Miner (MB3), X3-Miner (X3), VTreeMiner (VTM) and PatternMatcher (PM) for mining embedded subtrees and IMB3-Miner (IMB3), FREQT (FT) for mining induced subtrees of PO Data. Quite interestingly, with PO dataset the number of frequent candidate subtrees generated is identical for all the major data mining algorithms (See figure below). This means that the subtrees generated of the PO dataset are same for every algorithm. Therefore the conceptual framework of PO provides a powerful hierarchical classification of concepts, which provides consistency and accuracy in observations of various analysis and reasoning methodologies.

Future Work for PO

For Protein Functional Classification, in addition to presence of domains, motifs or functional residues, following factors are relevant: (a) similarity of three dimensional protein structures, (b) proximity to genes (may indicate that proteins they produce are involved in same pathway), (c) metabolic functions of organisms and (d) evolutionary history of the protein. At the moment PO’s Functional Domain Classification does not address the issues of proximity of genes and evolutionary history of proteins. These factors will be added in future to complete the Functional Domain Classification System in PO. Also the Constraints defined in PO are not mapped back to protein sequence, structure and function they affect. Achieving this in future will inter-link all the concepts of PO.

 

 

Home | About PO | PO Hierarchy | PO People | Downloads | PO Instances | Documentation | Contact PO   

All information on this website is copyright © 2002-2008 Protein Ontology Consortium. Permission to use the information contained in this ontology was given by the researchers and institutes who contributed or published the information. Users of the ontology are solely responsible for compliance with any copyright restrictions, including those applying to the author abstracts. Documents from this server are provided "AS-IS" without any warranty, expressed or implied.