Jump to ContentJump to Main Navigation
Computational Text Analysisfor functional genomics and bioinformatics$
Users without a subscription are not able to see the full content.

Soumya Raychaudhuri

Print publication date: 2006

Print ISBN-13: 9780198567400

Published to Oxford Scholarship Online: November 2020

DOI: 10.1093/oso/9780198567400.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2021. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use. date: 22 June 2021

Using text in Sequence Analysis

Using text in Sequence Analysis

4 (p.106) (p.107) Using text in Sequence Analysis
Computational Text Analysis

Soumya Raychaudhuri

Oxford University Press

Text about genes can be effectively leveraged to enhance sequence analysis (MacCallum, Kelley et al. 2000; Chang, Raychaudhuri et al. 2001; McCallum and Ganesh 2003; Eskin and Agichtein 2004; Tu, Tang et al. 2004). Most of the emerging methods utilize textual representations similar to the one we introduced in the previous chapter. To analyze sequences, a numeric vector that contains information about the counts of different words in references about that sequence can be used in conjunction with the actual sequence information. Experienced biologists understand the value of using the information in scientific text during sequence searches, and commonly use scientific text and annotations to guide their intuition. For example, after a quick BLAST search, a trained expert might quickly look over the hits and their associated annotations and literature references and assess the validity of the hits. The apparently valid sequence hits can then be used to draw conclusions about the query sequence by transferring information from the hits. In most cases, the text serves as a proxy for structured functional information. High quality functional annotations that succinctly and thoroughly describe the function of a protein are often unavailable. Defining appropriate keywords for a protein requires a considerable amount of effort and expertise, and in most cases, the results are incomplete as there is an evergrowing collection of knowledge about proteins. So, one option is to use text to compare the biological function of different sequences instead. There are different ways in which the functional information in text could be used in the context of sequence analysis. One possibility is to first run a sequence analysis algorithm, and then to use text profiles to summarize or organize results. Functional keywords can be assigned to the whole group of hit sequences. Additionally, given a series of sequences, they can be grouped according to like function. In either case, quick assessment of the content of text associated with sequences offers insight about exactly what we are seeing. These approaches are particularly useful if we are querying a large database of sequences with a novel sequence that we have very little information about.

Keywords:   accession number (AC), SWISS-PROT, functional assignment, gold standards, homologous sequences, profile drift, sequence hits description by keywords, twilight zone sequences, use to summarize sequence hits

Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.

Please, subscribe or login to access full text content.

If you think you should have access to this title, please contact your librarian.

To troubleshoot, please check our FAQs , and if you can't find the answer there, please contact us .