Jump to ContentJump to Main Navigation
Pattern Discovery in Biomolecular DataTools, Techniques, and Applications$
Users without a subscription are not able to see the full content.

Jason T. L. Wang, Bruce A. Shapiro, and Dennis Shasha

Print publication date: 1999

Print ISBN-13: 9780195119404

Published to Oxford Scholarship Online: November 2020

DOI: 10.1093/oso/9780195119404.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2021. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use. date: 25 October 2021

Motif Discovery in Protein Structure Databases

Motif Discovery in Protein Structure Databases

(p.77) Chapter 5 Motif Discovery in Protein Structure Databases
Pattern Discovery in Biomolecular Data

Janice Glasgow

Evan Steeg

Oxford University Press

The field of knowledge discovery is concerned with the theory and processes involved in the representation and extraction of patterns or motifs from large databases. Discovered patterns can be used to group data into meaningful classes, to summarize data, or to reveal deviant entries. Motifs stored in a database can be brought to bear on difficult instances of structure prediction or determination from X-ray crystallography or nuclear magnetic resonance (NMR) experiments. Automated discovery techniques are central to understanding and analyzing the rapidly expanding repositories of protein sequence and structure data. This chapter deals with the discovery of protein structure motifs. A motif is an abstraction over a set of recurring patterns observed in a dataset; it captures the essential features shared by a set of similar or related objects. In many domains, such as computer vision and speech recognition, there exist special regularities that permit such motif abstraction. In the protein science domain, the regularities derive from evolutionary and biophysical constraints on amino acid sequences and structures. The identification of a known pattern in a new protein sequence or structure permits the immediate retrieval and application of knowledge obtained from the analysis of other proteins. The discovery and manipulation of motifs—in DNA, RNA, and protein sequences and structures—is thus an important component of computational molecular biology and genome informatics. In particular, identifying protein structure classifications at varying levels of abstraction allows us to organize and increase our understanding of the rapidly growing protein structure datasets. Discovered motifs are also useful for improving the efficiency and effectiveness of X-ray crystallographic studies of proteins, for drug design, for understanding protein evolution, and ultimately for predicting the structure of proteins from sequence data. Motifs may be designed by hand, based on expert knowledge. For example, the Chou-Fasman protein secondary structure prediction program (Chou and Fasman, 1978), which dominated the field for many years, depended on the recognition of predefined, user-encoded sequence motifs for α-helices and β-sheets. Several hundred sequence motifs have been cataloged in PROSITE (Bairoch, 1992); the identification of one of these motifs in a novel protein often allows for immediate function interpretation.

Keywords:   AutoClass, Chou-Fasman program, Knowledge discovery, Machine learning, Probability, Statistics

Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.

Please, subscribe or login to access full text content.

If you think you should have access to this title, please contact your librarian.

To troubleshoot, please check our FAQs , and if you can't find the answer there, please contact us .