Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis

Chern Sing Goh, Ning Lan, Shawn M. Douglas, Baolin Wu, Nathaniel Echols, Andrew Smith, Duncan Milburn, Gaetano T. Montelione, Hongyu Zhao, Mark Gerstein

Research output: Contribution to journalArticlepeer-review

111 Scopus citations


Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized "pipeline schematics". We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation.

Original languageEnglish (US)
Pages (from-to)115-130
Number of pages16
JournalJournal of molecular biology
Issue number1
StatePublished - Feb 6 2004

All Science Journal Classification (ASJC) codes

  • Biophysics
  • Structural Biology
  • Molecular Biology


  • COGs
  • Charged residues
  • Decision trees
  • Hydrophobicity
  • Structural genomics

Fingerprint Dive into the research topics of 'Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis'. Together they form a unique fingerprint.

Cite this