Jump to ContentJump to Main Navigation
Pattern Discovery in Biomolecular DataTools, Techniques, and Applications$
Users without a subscription are not able to see the full content.

Jason T. L. Wang, Bruce A. Shapiro, and Dennis Shasha

Print publication date: 1999

Print ISBN-13: 9780195119404

Published to Oxford Scholarship Online: November 2020

DOI: 10.1093/oso/9780195119404.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2021. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use. date: 19 October 2021

Pattern Discovery and Classification in Biosequences

Pattern Discovery and Classification in Biosequences

Chapter:
(p.55) Chapter 4 Pattern Discovery and Classification in Biosequences
Source:
Pattern Discovery in Biomolecular Data
Author(s):

Jason T. L. Wang

Thomas G. Marr

Publisher:
Oxford University Press
DOI:10.1093/oso/9780195119404.003.0009

With the significant growth of the amount of biosequence data, it becomes increasingly important to develop new techniques for finding “knowledge” from the data. Pattern discovery is a fundamental operation in such applications. It attempts to find patterns in biosequences that can help scientists to analyze the property of a sequence or predict the function of a new entity. The discovered patterns may also help to classify an unknown sequence, that is, assign the sequence to an existing family. In this chapter, we show how to discover active patterns in a set of protein sequences and classify an unlabeled DNA sequence. We use protein sequences as an example to illustrate our discovery algorithm, though the algorithm applies to sequences of any sort, including both protein and DNA. The patterns we wish to discover within a set of sequences are regular expressions of the form *X1 * X2 * ... . The X1,X2,... are segments of a sequence, that is, subsequences made up of consecutive letters, and * represents a variable length don’t care (VLDC). In matching the expression *X1 * X2 * ... with a sequence S, the VLDCs may substitute for zero or more letters in S at zero cost. The dissimilarity measure used in comparing two sequences is the edit distance, that is, the minimum cost of edit operations used to transform one subsequence to the other after an optimal and zero-cost substitution for the VLDCs, where the edit operations include insertion, deletion, and change of one letter to another (Wagner and Fischer, 1974; K. Zhang et al., 1994). That is, we find a one-to-one mapping from each VLDC to a subsequence of the data sequence and from each pattern subsequence to a subsequence of the data sequence such that the following two conditions are satisfied, (i) The mapping preserves the left-to-right ordering (if a VLDC at position i in the pattern maps to a subsequence starting at position i1 and ending at position i2 in the data sequence, and a VLDC at position j in the pattern maps to a subsequence starting at position j1 and ending at position j2 in the data sequence, and i < j, then i2 < j2).

Keywords:   Alu sequence, Classification, Fasta, Hashing, Probability, Sampling

Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.

Please, subscribe or login to access full text content.

If you think you should have access to this title, please contact your librarian.

To troubleshoot, please check our FAQs , and if you can't find the answer there, please contact us .