Automatic target selection for structural genomics on eukaryotes

Jinfeng Liu, Hedi Hegyi, Thomas B. Acton, Gaetano Montelione, Burkhard Rost

Research output: Contribution to journalArticle

57 Citations (Scopus)

Abstract

A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.

Original languageEnglish (US)
Pages (from-to)188-200
Number of pages13
JournalProteins: Structure, Function and Genetics
Volume56
Issue number2
DOIs
StatePublished - Aug 1 2004

Fingerprint

Genomics
Eukaryota
Proteins
Protein Databases
National Institutes of Health (U.S.)
Proteome
Sequence Homology
Open Reading Frames
Cluster Analysis
Throughput
Health
Protein Domains

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Biochemistry
  • Molecular Biology

Keywords

  • Cluster
  • Domains
  • Protein structure family
  • Proteome analysis
  • Structural genomics
  • Target selection

Cite this

Liu, Jinfeng ; Hegyi, Hedi ; Acton, Thomas B. ; Montelione, Gaetano ; Rost, Burkhard. / Automatic target selection for structural genomics on eukaryotes. In: Proteins: Structure, Function and Genetics. 2004 ; Vol. 56, No. 2. pp. 188-200.
@article{f89b2714154e4d26b4986fea88df46c7,
title = "Automatic target selection for structural genomics on eukaryotes",
abstract = "A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of {"}all families{"} on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.",
keywords = "Cluster, Domains, Protein structure family, Proteome analysis, Structural genomics, Target selection",
author = "Jinfeng Liu and Hedi Hegyi and Acton, {Thomas B.} and Gaetano Montelione and Burkhard Rost",
year = "2004",
month = "8",
day = "1",
doi = "10.1002/prot.20012",
language = "English (US)",
volume = "56",
pages = "188--200",
journal = "Proteins: Structure, Function and Genetics",
issn = "0887-3585",
publisher = "Wiley-Liss Inc.",
number = "2",

}

Automatic target selection for structural genomics on eukaryotes. / Liu, Jinfeng; Hegyi, Hedi; Acton, Thomas B.; Montelione, Gaetano; Rost, Burkhard.

In: Proteins: Structure, Function and Genetics, Vol. 56, No. 2, 01.08.2004, p. 188-200.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Automatic target selection for structural genomics on eukaryotes

AU - Liu, Jinfeng

AU - Hegyi, Hedi

AU - Acton, Thomas B.

AU - Montelione, Gaetano

AU - Rost, Burkhard

PY - 2004/8/1

Y1 - 2004/8/1

N2 - A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.

AB - A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.

KW - Cluster

KW - Domains

KW - Protein structure family

KW - Proteome analysis

KW - Structural genomics

KW - Target selection

UR - http://www.scopus.com/inward/record.url?scp=3042726394&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=3042726394&partnerID=8YFLogxK

U2 - 10.1002/prot.20012

DO - 10.1002/prot.20012

M3 - Article

C2 - 15211504

AN - SCOPUS:3042726394

VL - 56

SP - 188

EP - 200

JO - Proteins: Structure, Function and Genetics

JF - Proteins: Structure, Function and Genetics

SN - 0887-3585

IS - 2

ER -