TY - JOUR
T1 - Mining the Structural Genomics Pipeline
T2 - Identification of Protein Properties that Affect High-throughput Experimental Analysis
AU - Goh, Chern Sing
AU - Lan, Ning
AU - Douglas, Shawn M.
AU - Wu, Baolin
AU - Echols, Nathaniel
AU - Smith, Andrew
AU - Milburn, Duncan
AU - Montelione, Gaetano T.
AU - Zhao, Hongyu
AU - Gerstein, Mark
N1 - Funding Information:
This work was supported, in part, by grant 5P50GM062413-03 from the Protein Structure Initiative of the Institute of General Medical Sciences, National Institutes of Health and grant DMS-0241160 (to H.Y.Z.) from the NSF. We thank Tom Acton for helpful discussions.
PY - 2004/2/6
Y1 - 2004/2/6
N2 - Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized "pipeline schematics". We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation.
AB - Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized "pipeline schematics". We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation.
KW - COGs
KW - Charged residues
KW - Decision trees
KW - Hydrophobicity
KW - Structural genomics
UR - http://www.scopus.com/inward/record.url?scp=9144261138&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=9144261138&partnerID=8YFLogxK
U2 - 10.1016/j.jmb.2003.11.053
DO - 10.1016/j.jmb.2003.11.053
M3 - Article
C2 - 14741208
AN - SCOPUS:9144261138
SN - 0022-2836
VL - 336
SP - 115
EP - 130
JO - Journal of molecular biology
JF - Journal of molecular biology
IS - 1
ER -