News

Mapping the human proteome

28 May 2014
Mapping the human proteome

by ecancer reporter Clare Sansom

Since the human genome sequence was first published about a decade ago, genomics has transformed both basic and clinical medical research, and this is beginning to bear fruit in clinical practice, not least in oncology.

Genes, however, only hold and replicate information: they are transcribed into messenger RNA molecules and the information held in these is translated into the sequences of the functional molecules, the proteins.

The proteome is defined as the entire complement of proteins encoded by a genome, but while the genome of an individual remains more or less unchanged during that individual’s life, the set of proteins synthesised will differ between cell types and even over time.

Proteins also undergo chemical modification after translation so one gene sequence may code for several proteins, and they also vary greatly in abundance. The task of obtaining a summary of the complete complement of proteins in a healthy human – “the” human proteome – has therefore proved even less tractable than that of obtaining the genome sequence.

Now, however, two independent research groups have completed draft maps of the human proteome that are published back-to-back in the 29 May issue of Nature.

Both groups have made their catalogues of proteins available online. A large, international group of researchers led by Akhilesh Pandey of Johns Hopkins University School of Medicine, Baltimore, MD, USA and with the Institute of Bioinformatics in Bangalore, India as a second lead institution used high-resolution Fourier transform mass spectrometry to generate their map of the human proteome [1].

They analysed a total of 30 histologically normal tissue samples, comprising 17 different adult tissues, seven foetal tissues and six purified samples of different types of primary haematopoietic cells.

Mass spectra were generated from these samples for proteins encoded by 17,294 genes, representing approximately 84% of those in the known human genome. Most of the cells and tissues in this set had not previously been analysed so comprehensively using mass spectrometry, and half the peptides in the dataset obtained were not present in either of two well characterised and widely used human peptide databases, PeptideAtlas and GPMDB.

Furthermore, protein products were identified for about two-thirds of the 3,844 human proteins currently characterised as “missing” or “hypothetical”.

The protein products of 2,350 genes were detected in all the tissue samples studied; these proteins are “housekeeping” proteins that are necessary to maintain life and are generally expressed at high abundance.

Many other proteins were found to be expressed in only one specialist cell type, and others were expressed in foetal but not in adult tissues; some of those identified in foetal tissue are known to be biomarkers for adult cancer types.

Interestingly, a comprehensive bioinformatics analysis of this data identified novel proteins coded by regions of the genome that had not been predicted to code for proteins.

These included some pseudogenes – DNA sequences that were once protein-coding genes but are thought to have been degraded and therefore are no longer expressed – and sequences predicted to be transcribed into non-coding RNA molecules.

This comprehensive catalogue of human proteins has been made freely available.

A second independent draft of the human proteome has been produced by Bernhard Kuster of Technische Universität München, Freising, Germany and his co-workers.

Kuster and his co-workers assembled their draft proteome from 16,857 tandem mass spectrometry experiments with human tissues, cell lines and body fluids, and from studies of post-translational modifications.

The data was assembled into a database, ProteomicsDB, which had been designed for the real-time analysis of so-called “big data”.

ProteomicsDB is publicly accessible at https://www.proteomicsdb.org.

This database contains physical evidence for 18,097 human genes, which represents 92% of the 19,629 that are listed in the SwissProt database of fully annotated protein sequences.

All chromosomes except chromosome 21 and Y are evenly represented in this dataset.

Like the Pandey dataset, it was unexpectedly found to contained protein sequences derived from parts of the human genome that are not expected to code for protein.

ProteomicsDB currently contains data on 430 of these sequences, derived from 409 different long intergenic non-coding RNAs (lincRNAs) and so-called transcripts of unknown coding potential (TUCPs).

Comparing datasets from different tissue samples suggested that there is a core set of 10,000-12,000 human proteins that are expressed in almost all tissues and that can be assumed to be involved in the general control and maintenance of cells.

Kuster and his co-workers generated functional profiles of the proteins present in 27 different tissues and body fluids by combining their quantitative data with information in public bioinformatics databases.

To demonstrate the utility of this data in clinical decision making they cross-referenced it with drug-sensitivity data from a cancer cell line encyclopaedia to identify proteins expressed in cancer cell lines that correlate with either sensitivity to or resistance to 24 cancer drugs. This example highlights one of the many uses that cancer researchers and clinical oncologists are bound to make of these and their successor proteomics datasets in the years to come.

References

[1]: Kim, M-S., Pinto, S.M., Getnet, D. and 69 others (2014). A draft map of the human proteome. Nature 509, 575-581

[2]: Wilhelm, M., Schleg, J., Hahne, H. and 17 others (2014). Mass-spectrometry-based draft of the human proteome. Nature 509, 582-587.