TY - JOUR
T1 - Ortholog clustering on a multipartite graph
AU - Vashist, Akshay
AU - Kulikowski, Casimir A.
AU - Muchnik, Llya
N1 - Funding Information:
Akshay Vashist was, in part, supported by DIMACS awards. Ilya Muchnik was supported by US National Science Foundation grant CCF-0325398.
PY - 2007/1
Y1 - 2007/1
N2 - We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E| + |V| log |V|), where V and E are the sets of vertices and edges, respectively, in the graph. However, if we discretize the similarity scores into a constant number of bins, the runtime improves to O(|E| + |V\). The proposed method was applied to seven complete eukaryote genomes on which the manually curated database of eukaryotic ortholog clusters, KOG, is constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters.
AB - We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E| + |V| log |V|), where V and E are the sets of vertices and edges, respectively, in the graph. However, if we discretize the similarity scores into a constant number of bins, the runtime improves to O(|E| + |V\). The proposed method was applied to seven complete eukaryote genomes on which the manually curated database of eukaryotic ortholog clusters, KOG, is constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters.
KW - Biology
KW - Clustering algorithms
KW - Genetics
KW - Graph-theoretic methods
UR - http://www.scopus.com/inward/record.url?scp=33847751065&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33847751065&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2007.1004
DO - 10.1109/TCBB.2007.1004
M3 - Article
C2 - 17277410
AN - SCOPUS:33847751065
VL - 4
SP - 17
EP - 27
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
SN - 1545-5963
IS - 1
ER -