Abstract
Motivation: Establishing structural and functional relationships between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability can be critical for tasks such as making inferences of the structural class of unannotated proteins when no secondary or tertiary structure is available. Recent computational methods based on profile and mismatch neighborhood kernels have significantly improved one's ability to elucidate such relationships. However, the need for additional reduction in computational complexity and improvement in predictive accuracy hinders the widespread use of these powerful computational tools. Results: We present a new general approach for sequence analysis based on a class of efficient string-based kernels, sparse spatial sample kernels (SSSK). The approach offers state-of-the-art accuracy for sequence classification, low computational cost, and scales well with the size of sequence databases, in both supervised and semi-supervised learning settings. Application of the proposed methods to a remote homology detection and a fold recognition problems yields performance equal to or better than existing state-of-the-art algorithms. We also demonstrate the benefit of the spatial information and multi-resolution sampling for achieving this accuracy and for discriminative sequence motif discovery. The proposed methods can be applied to very large partially-labeled databases of protein sequences because of low computational complexity and show substantial improvements in computing time over the existing methods.
Original language | English (US) |
---|---|
Pages | 29-37 |
Number of pages | 9 |
State | Published - 2008 |
Event | 8th International Workshop on Data Mining in Bioinformatics, BIOKDD 2008 - Held in conjunction with 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, United States Duration: Aug 24 2008 → Aug 24 2008 |
Conference
Conference | 8th International Workshop on Data Mining in Bioinformatics, BIOKDD 2008 - Held in conjunction with 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2008 |
---|---|
Country/Territory | United States |
City | Las Vegas |
Period | 8/24/08 → 8/24/08 |
All Science Journal Classification (ASJC) codes
- Computer Graphics and Computer-Aided Design
- Software
Keywords
- Large-scale semi-supervised learning
- Sequence classification
- String kernels