Selecting a Data Repositories
The choice of data repository largely depends on the data one wants to share with the scientific community. In order to make full use of the data, the data and accompanying metadata are to be made accessible. For this, data format standards, controlled vocabularies and MI (Minimum Information) reporting guidelines are important for annotating the data with all the information required.
For sharing raw and minimally processed XML data Proteome Commons Tranche could be the repository of choice. For processed data on the other hand, these requirements are best satisfied by centralized and standard compliant data repositories. In proteomics, the three most prominent databases are the Global Proteome Machine Database (GPMDB) (Craig et al., 2004), The PRoteomics IDEntification database (PRIDE) (Vizcaino et al., 2010), and PeptideAtlas (Deutsch et al., 2010). Below a brief description of the repositories is given to guide the choice of which repository is most suitable for a given dataset.
The MASCP recommendation is to submit data to PRIDE because this database is compliant with standards developed by the Proteomics Standards Initiative (PSI) (Taylor et al.,2007) and because it can harbor basically any kind of processed proteome data. Instructions on how to proceed if you want to submit data to PRIDE can be found in the following instructions.
For sharing raw and minimally processed XML data Proteome Commons Tranche could be the repository of choice. For processed data on the other hand, these requirements are best satisfied by centralized and standard compliant data repositories. In proteomics, the three most prominent databases are the Global Proteome Machine Database (GPMDB) (Craig et al., 2004), The PRoteomics IDEntification database (PRIDE) (Vizcaino et al., 2010), and PeptideAtlas (Deutsch et al., 2010). Below a brief description of the repositories is given to guide the choice of which repository is most suitable for a given dataset.
The MASCP recommendation is to submit data to PRIDE because this database is compliant with standards developed by the Proteomics Standards Initiative (PSI) (Taylor et al.,2007) and because it can harbor basically any kind of processed proteome data. Instructions on how to proceed if you want to submit data to PRIDE can be found in the following instructions.
Data Repositories
PRoteomics IDEntification database - PRIDE
PRIDE is a centralized, standards compliant, open source, public data repository for proteomics data that allows to query, submit and retrieve the data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications. PRIDE does not provide an analysis pipeline, instead, it enables the submission of the actual experimental results obtained by the researcher as result files in PRIDE XML format. The PRIDE XML files can be created from a variety of commonly used data formats (e.g. Mascot Dat Files, Sequest result files, X!Tandem, mzXML, OMSSA, PeptideProphet/ProteinProphet) using the PRIDE Converter Tool. PRIDE therefore enables submission of data that were created by various search algorithms or were generated with own data analysis workflows. PRIDE is now starting to support some quantitative proteomics approaches as well.
Together with PRIDE comes the PRIDE BioMart, which allows to retrieve public PRIDE data from a query-optimised data warehouse that is synchronised with the main PRIDE database. The BioMart interface allows to build simple or complex queries, with total control over both how the data is filtered and also which attributes are included in the results. In addition to this, the BioMart interface allows the integration of PRIDE data with other widely used resources such as Ensembl, UniProt or InterPro, among many others, through the BioMart Central Portal.
PeptideAtlas
PeptideAtlas is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. Mass spectrometer output files are loaded into the PeptideAtlas data repository, after which they are analyzed through the Trans-Proteomic Pipeline (TPP) using either Sequest or X!Tandem as database-dependent search algorithm followed by PeptideProphet and ProteinProphet to derive a probability of correct identification for all results in a uniform manner to insure a high quality database, along with false discovery rates at the whole atlas level. Results may be queried and browsed at the PeptideAtlas web site. The raw data, search results, and full builds can also be downloaded for other uses.
Global Proteome Machine Database - GPMDB
The Global Proteome Machine Organization was set up so that scientists involved in proteomics using tandem mass spectrometry could use that data to analyze proteomes. The GPMDB was constructed to utilize the information obtained by GPM servers to aid in the process of validating peptide MS/MS spectra as well as protein coverage patterns. The GPMDB has been integrated into the GPM server pages, allowing users to compare their experimental results with the results that have been previously observed by users of the machine. For using GPMDB, an own installation of GPMDB can be created and all the scripts necessary to install and operate it are available on their FTP Site. The data format of the analysis results enables upload to the GPMDB, which integrates the results with all the information already present in the database.
PRIDE is a centralized, standards compliant, open source, public data repository for proteomics data that allows to query, submit and retrieve the data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications. PRIDE does not provide an analysis pipeline, instead, it enables the submission of the actual experimental results obtained by the researcher as result files in PRIDE XML format. The PRIDE XML files can be created from a variety of commonly used data formats (e.g. Mascot Dat Files, Sequest result files, X!Tandem, mzXML, OMSSA, PeptideProphet/ProteinProphet) using the PRIDE Converter Tool. PRIDE therefore enables submission of data that were created by various search algorithms or were generated with own data analysis workflows. PRIDE is now starting to support some quantitative proteomics approaches as well.
Together with PRIDE comes the PRIDE BioMart, which allows to retrieve public PRIDE data from a query-optimised data warehouse that is synchronised with the main PRIDE database. The BioMart interface allows to build simple or complex queries, with total control over both how the data is filtered and also which attributes are included in the results. In addition to this, the BioMart interface allows the integration of PRIDE data with other widely used resources such as Ensembl, UniProt or InterPro, among many others, through the BioMart Central Portal.
PeptideAtlas
PeptideAtlas is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. Mass spectrometer output files are loaded into the PeptideAtlas data repository, after which they are analyzed through the Trans-Proteomic Pipeline (TPP) using either Sequest or X!Tandem as database-dependent search algorithm followed by PeptideProphet and ProteinProphet to derive a probability of correct identification for all results in a uniform manner to insure a high quality database, along with false discovery rates at the whole atlas level. Results may be queried and browsed at the PeptideAtlas web site. The raw data, search results, and full builds can also be downloaded for other uses.
Global Proteome Machine Database - GPMDB
The Global Proteome Machine Organization was set up so that scientists involved in proteomics using tandem mass spectrometry could use that data to analyze proteomes. The GPMDB was constructed to utilize the information obtained by GPM servers to aid in the process of validating peptide MS/MS spectra as well as protein coverage patterns. The GPMDB has been integrated into the GPM server pages, allowing users to compare their experimental results with the results that have been previously observed by users of the machine. For using GPMDB, an own installation of GPMDB can be created and all the scripts necessary to install and operate it are available on their FTP Site. The data format of the analysis results enables upload to the GPMDB, which integrates the results with all the information already present in the database.