A fast, large-scale learning method for protein sequence classification

Pavel Kuksa, Pai Hsi Huang, Vladimir Pavlovic

Research output: Contribution to conferencePaperpeer-review

8 Scopus citations

Abstract

Motivation: Establishing structural and functional relationships between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability can be critical for tasks such as making inferences of the structural class of unannotated proteins when no secondary or tertiary structure is available. Recent computational methods based on profile and mismatch neighborhood kernels have significantly improved one's ability to elucidate such relationships. However, the need for additional reduction in computational complexity and improvement in predictive accuracy hinders the widespread use of these powerful computational tools. Results: We present a new general approach for sequence analysis based on a class of efficient string-based kernels, sparse spatial sample kernels (SSSK). The approach offers state-of-the-art accuracy for sequence classification, low computational cost, and scales well with the size of sequence databases, in both supervised and semi-supervised learning settings. Application of the proposed methods to a remote homology detection and a fold recognition problems yields performance equal to or better than existing state-of-the-art algorithms. We also demonstrate the benefit of the spatial information and multi-resolution sampling for achieving this accuracy and for discriminative sequence motif discovery. The proposed methods can be applied to very large partially-labeled databases of protein sequences because of low computational complexity and show substantial improvements in computing time over the existing methods.

Original languageEnglish (US)
Pages29-37
Number of pages9
StatePublished - 2008
Event8th International Workshop on Data Mining in Bioinformatics, BIOKDD 2008 - Held in conjunction with 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, United States
Duration: Aug 24 2008Aug 24 2008

Conference

Conference8th International Workshop on Data Mining in Bioinformatics, BIOKDD 2008 - Held in conjunction with 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2008
Country/TerritoryUnited States
CityLas Vegas
Period8/24/088/24/08

All Science Journal Classification (ASJC) codes

  • Computer Graphics and Computer-Aided Design
  • Software

Keywords

  • Large-scale semi-supervised learning
  • Sequence classification
  • String kernels

Fingerprint

Dive into the research topics of 'A fast, large-scale learning method for protein sequence classification'. Together they form a unique fingerprint.

Cite this