Standards

From Mascp

Jump to: navigation, search

Contents

Standards

The utilization of proteomic technologies in Arabidopsis research have in recent years become common practice. In recent years the advanced development of these technologies has seen the adoption of large-scale analysis procedures, protein and ion profiling and quantitation methods and protein modification analysis techniques. Such techniques require complicated experimental procedures and produce vast amounts of data causing further difficulties in analysis and interpretation.

The subcommittee intends to create an information portal for Arabidopsis researchers providing information on practical techniques as well as guidelines on experimental methods and data analysis. It is hoped that the site will become a guide for Arabidopsis researchers working with proteomic technologies and provide an interaction point for issues in the field.

The information provided on this page does not intend to duplicate ongoing work from organisations such as the HUPO Proteomics Standards Initiative (PSI) or to re-invent journal publication guidelines but will instead attempt to provide a distillation of ongoing proteomics initiatives from within the Arabidopsis community and other sources.

Minimum Information About a Proteomics Experiment (MIAPE)

by Joshua Heazlewood

The Human Proteome Organisation (HUPO) instigated the development of a standards initiative in 2002 (Science: 296 May 2002). The Proteome Standards Initiative (PSI) has already produced several documents containing standards for the facilitation of data comparison, exchange and verification. These documents are not intended to dictate an approach or analysis method but outline the sort of information that should be provided by a researchers to readily allow assessment and interpretation of proteomic experiments.

There are currently a series of documents completed or in various draft states:

  • MIAPE (Parent document describing the modules)
  • MIAPE - Mass Spectrometry (module completed)
  • MIAPE - Gel Electrophoresis (module completed)
  • MIAPE - Mass spectrometry Informatics (module completed)
  • MIMIx - Molecular Interaction Experiments (module completed)
  • MIAPE - Gel Image Informatics (draft)
  • MIAPE - Column Chromatography (draft)
  • MIAPE - Capillary Electrophoresis (draft)
  • PSI-MOD - Standard for representation of protein modification data

Completed MIAPE documents are available through the community consultation portal at Nature Biotechnology with draft modules available through PSI.

Publication Guidelines

by Joshua Heazlewood

Publication guidelines have evolved for a variety of reasons but are generally to ensure that published data is of high quality and that reviewers have the necessary information to assess submitted experimental data. Other outcomes have been the development of standardized data formats (see below) and the initiation of data storage programs for future reference and analysis.

Thus far no plant journal has outlined or developed specific guidelines for the publication of proteomic data. Nonetheless it is reasonable to assume that either the requirements outlined by proteomic journals will be adopted or cut down versions developed to address many of the issues of data assessment, quality and experimental design that currently faces the community.

Currently only two scientific journals have specific requirements or guidelines for the publication of proteomic data:

MCP was the first journal to respond to general concerns about data inconsistencies and in 2004 released an editorial outlining a set of guidelines for authors (Mol. Cell. Proteomics, Jun 2004; 3: 531 - 533). Further input and discussions were sought from invited parties with he support of ASBMB at a two day workshop in Paris France in 2005, with the result being a single document based on the original guidelines (Mol. Cell. Proteomics, Sep 2005; 4: 1223 - 1225). These draft guidelines were made available for community comment and in early 2006 were finalized (Mol. Cell. Proteomics 5:787-788, 2006). This document now represents publication requirements for proteomic data submitted to MCP. These guidelines are available as a single document through the MCP homepage and are often referred to as the Paris Report.

MCP Guidlines Update: A recent meeting in Philadelphia (May 2009) by the American Society for Mass Spectrometry identified two topics for improvement of the MCP guidelines.

1. Quantitative proteomics: update to reflect the use of non isotopic labeling techniques

2. Online repositories: encourage the use of such resources e.g. Tranche at Proteome Commons.

For similar reasons to those outlined above, in late 2006 the journal Proteomics developed and released a set of publication guidelines for authors submitting proteomic data to the journal (Proteomics. 2006 Sep;6(18):4887-9).

Recently JBC have added a specific set of instructions to their Instructions to Authors section. These requirements outline very basic information that should be included with proteomics data submitted to JBC and are based on those found in MCP.

Proteomic Data Formats

by Katja Baerenfaller

The amount of data being generated by proteomics laboratories has increased exponentially over the last few years. It is therefore increasingly important that scientists are able to exchange, compare and retrieve datasets and to disseminate their data to the scientific community. This even more, as many journals by now request the underlying data supporting proteomics data. To achieve this, there is an obvious need for data format standardisation given the multitude of instrument providers, instrument types and software components presently used.

The Human Proteome Organisation (HUPO) was formed in 2001 to consolidate national and regional proteome organizations. The Proteome Standards Initiative (PSI) was established by HUPO in April 2002 with the aim to enable public domain databases, where all proteomics data can be deposited, exchanged between databases, and accessed and utilised by laboratory workers. According to its mission statement, HUPO-PSI defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification. These community standards not only apply to human proteomics, but to the whole proteomics field.

The HUPO-PSI workgroups are developing the MIAPE (minimum information about a proteomics experiment) guidelines and work on data exchange formats for proteomics, which facilitate data management and exchange. Furthermore, they develop controlled vocabularies, coordinate and promote the implementation of PSI standards in tools and databases, and work on the aim to make more proteomics data easily accessible in the public domain.

The HUPO PSI MS workgroup had two major issues to tackle. First, standardising the output from a wide range of commercial hardware in a format that could be read by any search engine or stored in any compatible database and second, the development of a spectral analysis output format unifying results from different search engines.

In 2004 the mzData interchange format was presented. Looking at the peaklists there are currently many different data formats, some examples are .asc (Finnigan), .pkl (Micromass), .dta (Sequest), .wiff (QSTAR) and .mgf (MatrixScience). The purpose of mzData is to unify all these different formats into one. The mzData format allows the storage of proteomic-related mass spectral data, ranging from basic details about the sample, instrument details and data processing steps, through to the actual spectral lists of mass-to-charge values and intensities. The base format of mzData, and all other current PSI standards, is the widely accepted XML (extensible markup language). XML documents are themselves standardized for describing and interchanging data in a structured way. The general grammatical structure of an XML document can be described in the form of an XML schema definition (XSD). XSDs are used to control that the grammatical structure of an XML document is conformant to an agreed standard. This approach is also done for the mzData standard, the XSD document being called mzData.xsd. The mzData format has, up to date, been implemented by a number of manufacturers.

In 2004 a second open, generic XML representation of MS data was published by the Institute of Systems Biology, mzXML. mzXML data is produced from the instrumental raw data file with converters using propriety software of the vendors. However, once the raw data is converted to mzXML no vendor-specific software is needed anymore. Converters exist for a number of mass spectrometers. Software using mzXML is also available: the Trans Proteomic Pipeline, Insilicios viewer, the visualization and analysis tool MSight, and the commercial search engine Phenyx.

In 2006 the designers of mzData and mzXML, including representatives of instrument vendors, analysis software developers and end users, have joined under the auspices of the PSI and jointly developed a single format intended to replace the previous two. This new format is named mzML and consists of a much improved XML schema, which contains the best aspects of the two original formats so that it may be widely adopted. A number of converters have already been made available to produce data in this format such as ReAdW for XCalibur (Thermo) .raw files, wolf for Mass Lynx (Waters) .raw directories, mzWiff for Analyst (ABI, Agilent) .wiff files and Trapper: for MassHunter (Agilent) .d directories. These are available for use at HUPO-PSI.

Shotgun Proteomics Data Interpretation

by Jonas Grossmann and Sacha Baginsky

In common proteomics workflows, tandem mass spectrometry data are searched against a protein database of the organism under investigation. In this procedure, experimentally measured tandem mass spectra are scored against in silico generated tandem mass spectra from the underlying protein database. In high-throughput experiments several hundred or even thousands of proteins can be identified from complex LC-MS/MS runs. The enormous amount of data that can be generated with such an approach makes manual data analysis impossible. Therefore software tools are used for the interpretation of high-throughput MS/MS data. In order to tag such a 'black box' approach with some parameter estimates concerning identification reliability and 'false discovery rates', additional statistical analyses are necessary that allow an assessment of the reliability of the identifications.

One common approach is the target/decoy strategy, in which the dataset is searched against a concatenated database consisting of the target organism database and a decoy database, consisting of a reverted form of the target database. The number of hits in the decoy database (multiplied by 2, because the same rate of false assignments can be expected in the target database as well) compared to the number of hits in the target database determines the false discovery rate of a specific search strategy. The target/decoy search strategy has the advantage, that the decoy search space has the same size and the same elemental composition as the target database, thus providing a realistic estimate on false discovery rates (Balgley et al., 2007). Using the target/decoy search strategy it was clearly shown that single hit protein identifications (proteins which are identified upon a single peptide only) are enriched for false identifications. Because single hit identifications can make up to 30 percent of all identified proteins for complex protein mixtures, this group of identifications is especially critical, but also potentially important because low abundance protein are usually detected with only one or two peptides at maximum. It is therefore highly desirable to distinguish single hit identifications that are correct from those that are incorrect. Additional validation is thus required to accept single hit proteins. Possibilities are manual inspection of the underlying tandem mass spectra or alternative search algorithms.

A comprehensive survey of data from recent large scale proteome analyses revealed that false discovery rates can be dramatically decreased if peptides are only accepted when identified by multiple search engines. In our study five search engines were used Sequest, Mascot, X!Tandem, Omssa and PepSplice the latter three are open source. High confidence hit lists were assembled together with single hit identifications at a false discovery rate of 1% for each of the aforementioned search engines (0.5% of reverse hits were accepted which corresponds to a false discovery rate of 1% (see above)). The combined results obtained with all five search engines showed that about 50% of all identifications were single hit protein identifications that were only identified by one of the search engines. About 50% of these single hit identifications were identifications in the decoy database. This indicates that the combination of data acquired from different tools at a false discovery rate of 1% can translate into 15-30% false discovery rate at the protein level, if single hit identifications are accepted [because the FDR is usually at the spectra level: several spectra usually identify one peptide and several peptides usually identify one protein (in case 100,000 spectra were identified at a false discovery rate of 1%, this means 1000 are potentially wrong). Using the criterion that single hit proteins must be identified by at least two different search engines up to 90% of the single hit proteins for an individual search engine could be tagged with a higher reliability. Applying this criterion, all proteins only identified by one search engine with only one peptide need to be discarded, resulting in a much lower false discovery rate (factor of 10 reduced).

Quantitative Proteomics

Coming soon

2-DE (Two-Dimensional Gel Electrophoresis)

Coming soon