Institute of Bioinformatics

DATABASES

NetPath (http://www.netpath.org):

NetPath is currently one of the largest open-source repository of human signaling pathways that is all set to become a community standard to meet the challenges in functional genomics and systems biology. Signaling networks are the key to deciphering many of the complex networks that govern the machinery inside the cell. Several signaling molecules play an important role in disease processes that are a direct result of their altered functioning and are now recognized as potential therapeutic targets. Understanding how to restore the proper functioning of these pathways that have become deregulated in disease, is needed for accelerating biomedical research. This resource is aimed at demystifying the biological pathways and highlights the key relationships and connections between them. Apart from this, pathways provide a way of reducing the dimensionality of high throughput data, by grouping thousands of genes, proteins and metabolites at functional level into just several hundreds of pathways for an experiment. Identifying the active pathways that differ between two conditions can have more explanatory power than just a simple list of differentially expressed genes and proteins.
A thorough data-mining of scientific literature was carried out to catalog all the significant molecular interactions in single ligand-stimulated, receptor-mediated signaling pathways. Apart from protein-protein interactions, enzyme-substrate reactions, protein translocation events and gene regulation were also cataloged. Each of the pathway reactions are linked to their experimental evidence in the form of PubMed IDs of the respective articles from which they were mined. However, taking into account the heterogeneity inherent in experimental validation of different pathway reactions and the diversity of publicly available data, a set of very stringent criteria were applied for curation and for the generation of the pathway maps.
These pathways are freely available for download in various formats such as, BioPAX, PSI-MI and SBML. The availability of data in different formats allows interoperability between various pathway analysis software tools such as Cytoscape and VISIBIOweb. In order to provide a better visual interface of the molecular reactions in NetPath, pathway maps were generated using PathVisio, which is an improved visualization tool incorporating features of GenMAPP. These pathway maps are available through another resource called NetSlim (http:www.netpath.org/netslim) that was also developed at the Institute. The NetSlim versions of various pathways can be downloaded in .gpml, .GenMAPP, .png and pdf formats.

Human Protein Reference Database (http://www.hprd.org/):

The Human Protein Reference Database (HPRD) represents a centralized platform to visually depict and integrate information pertaining to each protein in the human proteome. It contains manually curated scientific information pertaining to the biology of most human proteins. The HPRD is a result of an international collaborative effort between the Institute of Bioinformatics and the Pandey lab at Johns Hopkins University in Baltimore, USA. The National Center for Biotechnology Information provides link to HPRD through its human protein databases (e.g. Entrez Gene, RefSeq protein) pertaining to genes and proteins.

All the information in HPRD has been manually curated by critical reading from published literature by expert biologists who read, interpret and analyze the published data. This resource depicts information on human protein functions including protein–protein interactions, post-translational modifications, enzyme-substrate relationships and disease associations. The protein–protein interaction and subcellular localization data from HPRD have been used to develop a human protein interaction network. Information regarding proteins involved in human diseases is also annotated and linked to Online Mendelian Inheritance in Man (OMIM) database.

HPRD was created using an object oriented database in Zope, an open source web application server that provides versatility in query functions and allows data to be displayed dynamically. As HPRD continues to evolve with newer entries, the number of unannotated genes and proteins is rapidly reducing consequently allowing us to expand the scope of our curation data. The data from HPRD can be freely accessed and used by academic users while commercial entities are required to obtain a license for use.

Goals:

•

The main goal in creating HPRD was to curate the world's literature on known and well characterized proteins which will inturn create a centralized knowledgebase of protein data.

•

Create a more robust curation system. Curation systems need be continually updated to include current research being done.

•

Enable future discoveries and empower scientists in their work. As we move into Next Generation Sequencing technologies, world is less focused on individual genes and instead the focus is more on high throughput studies, involving thousands of genes at a time.

•

Study systems biology approaches and aid in biomarker discovery.

•

Perform complex queries involving multiple features of proteins.

Highlights of HPRD are as follows:

•

From 10,000 protein–protein interactions (PPIs) annotated for 3,000 proteins in 2003, HPRD has grown to over 39,194 unique PPIs annotated for 30,047 proteins including more than 6,360 isoforms by the end of 2012.

•

More than 50% of molecules annotated in HPRD have at least one PPI and 10% have more than 10 PPIs.

•

Experiments for PPIs are broadly grouped into three categories namely in vitro, in vivo and yeast two hybrid (Y2H). Sixty percent of PPIs annotated in HPRD are supported by a single experiment whereas 26% of them are found to have two of the three experimental methods annotated.

•

HPRD contains 18,000 manually curated Post-Translational Modifications (PTM) data belonging to 26 different types of modifications.

•

All the phosphorylation based motifs for any protein of interest can be analyzed using PhosphoMotifFinder in HPRD. This tool connects the proteomic data in HPRD to over 320 experimentally proven phosphorylation based motifs curated from literature. Phosphorylation is the leading type of modification of protein contributing to 63% of PTM data annotated in HPRD.

•

HPRD data is available for download in tab delimited and XML file formats.

•

HPRD also integrates data from Human Proteinpedia, a community portal for integrating human protein data.

Milestones achieved and comparison with other publicly available databases:

•

HPRD is currently one of the richest sources of various aspects of PPI data as compared to other publicly available databases as shown in a comparative study.

•

This is the only completely manually curated database that assimilates PPIs, PTMs, subcellular localization, tissue expression, biological motifs and domains derived from variety of experimental platforms.

•

HPRD database gets nearly 1,48,000 hits in a year and about 400 visitors per day.

•

To date, it has been cited nearly 1,827 times by the scientific community in literature.

•

To the best of our knowledge, data from HPRD, Human Proteinpedia and RAPID databases are the only datasets from India that have been incorporated into NCBI databases such as Entrez Gene and RefSeq.

Human Proteinpedia (http://www.humanproteinpedia.org/):

Human Proteinpedia was developed as a community portal for sharing and integrating human proteomic data over the world wide web. Through this portal, research labs all over the world can contribute and upload their experimental data. This initiative is an effort to bring together the entire biomedical community and will enable dissemination of valuable proteomic data. This will empower scientists to take advantage of information that is at presently confined to particular research labs. Such a concerted effort will help enrich this database and minimize redundancy inherent in most other publicly available databases.

Data pertaining to post-translational modifications, protein-protein interactions, tissue expression, expression in cell lines, subcellular localization and enzyme substrate relationships can be submitted to Human Proteinpedia. It even allows proteomic investigators to share unpublished data and provides an effective means of sharing such data.

Human Proteinpedia currently contains over 4.8 million MS/MS spectra and ~2 million peptides and is an important resource for cataloging proteotypic peptides (which serve as a unique identifier of a given protein or isoform in tandem MS experiments) that can be used for biomarker analysis using MRM (Multiple Reaction Monitoring).

Human proteinpedia also provides a list of phophopeptides identified in Mass-Spectrometry based phosphoproteomic studies and the phosphorylation or dephosphorylation data curated from literature has been mapped to corresponding site and residue of sequences in HPRD. This is useful to investigators in the development of phospho-specific antibodies and peptide arrays.

Protein annotations present in Human Proteinpedia are derived from a number technology platforms such as co-immunoprecipitation, fluorescence based or western blotting or mass spectrometry based experiments, immunohistochemical analysis, yeast two-hybrid or protein and peptide microarrays.

Statistics to date:

Annotations	HPRD	Human Proteinpedia
Protein entries	30047	15231
Mode of Data entry	Manual curation by experts from literature	Experimentally verified data over the web
Number of contributing labs	2 to 6	75
Protein Protein interactions	39194	34624
PTMs	93710	17410
Protein Expression	112158	150368
Subcellular Localization	22490	2906
Domains	470	NA
PubMed Links	453521	NA
MS/MS Spectra	NA	4855122
Number of experiments	NA	2710

Plasma Proteome Database (http://www.plasmaproteomedatabase.org/):

The Plasma Proteome Database (PPD), the first of its kind ensures a comprehensive resource for all human plasma proteins. The database includes information pertaining to isoform specific expression, disease, localization, post translational modification and single nucleotide polymorphism. The information provided in this database is through manual annotation done by exhaustive mining of published literature.

Statistics to date
Unique Genes	9,297
Proteins & Isoforms	15,747
PTMs	40,997
PubMed Links	24,838

Other Databases Developed at IOB:

Resource of Asian Primary Immunodeficiency Diseases (http://rapid.rcai.riken.jp/RAPID):

Resource of Asian Primary Immunodeficiency Diseases (RAPID) is a web-based compendium of molecular alterations in primary immunodeficiency diseases. Detailed information about genes and proteins that are affected in primary deficiency diseases is presented along with other pertinent information about protein-protein interactions, microarray gene expression profiles in various organs and cells of the immune system and mouse studies. RAPID also hosts a tool, the mutation viewer, to predict deleterious and novel mutations and also to visualize the mutation positions on the DNA sequence, protein sequence and three-dimensional structure for PID genes. The information in this database should be useful to researchers as well as clinicians.
RAPID is a result of collaboration between the Institute of Bioinformatics and Immunogenomics research group at RIKEN Research Center for Allergy and Immunology in Yokohama, Japan.

India Cancer Research Database (http://www.incredb.org/):

India Cancer Research Database (ICRD) provides details of scientists and physicians involved in cancer research in India along with the information about their areas of expertise, research publications and funded grants. The main goal of the database was to foster collaborations among researchers and to provide a snapshot of ongoing research initiatives and activities in India.

TBnet (http://tbnetindia.ibioinformatics.org/):

TBNet India was developed by IOB as an initiative by Department of Biotechnology, Government of India, with active collaboration from 13 institutions all across India. This resource places special focus on Indian contributions to research and issues related to tuberculosis. M. tuberculosis is a gram-positive bacterium which causes tuberculosis, the leading cause of infectious disease mortality. The M. tuberculosis genome was sequenced in 1998. About 1.5 million people die from tuberculosis each year, and it is thought that as many as 2 billion people (one third of the human population) may be infected with M. tuberculosis. It is estimated that 80% of the Asian and African population test positive in tuberculin tests while only 5-10% of the United States population test positive. People with compromised immunity, largely due to high rates of HIV infection have higher chances of developing the disease. This problem is compounded by appearance of drug resistant TB strains, including strains with multiple drug resistance (MDR) and, more recently, strains with extensive drug resistance (XDR), which are much more difficult to treat, posing a significant public health threat. Tuberculosis has an estimated mortality rate of ~49 per 100,000 people per year in India. TBNet India endeavors to gather clinical, epidemiological and molecular data and make it available to the biomedical community.