Textual Profiles of Genes
Textual Profiles of Genes
Using algorithms to analyze natural language text is a challenging task. Recent advances in algorithms, and increased availability of computational power and online text has resulted in incremental progress in text analysis (Rosenfeld 2000). For certain specific applications natural language processing algorithms can rival human performance. Even the simplest algorithms and approaches can glean information from the text and do it at a rate much faster than humans. In the case of functional genomics, where an individual assay might include thousands of genes, and tens of thousands of documents pertinent to those genes, the speed of text mining approaches offers a great advantage to investigators trying to understand the data. In this chapter, we will focus on techniques to convert text into simple numerical vectors to facilitate computation. Then we will go on to discuss how these vectors can be combined into textual profiles for genes; these profiles offer additional biologically meaningful information that can complement available genomics data sets. The previous chapter introduced methods to analyze gene expression data and sequence data. The focus of many analytical methods was comparing and grouping genes by similarity. Some sequence analysis methods like dynamic programming and BLAST offer opportunities to compare two sequences, while multiple sequence alignment and weight matrices provide a means to compare families of sequences. Similarly, gene expression array analysis approaches are mostly contingent on distance metrics that compare gene expression profiles to each other; clustering and classification algorithms provide a means to group similar genes. The primary goal of applying these methods was to transfer knowledge between similar genes. We can think of the scientific literature as yet another data type and define document similarity metrics. Algorithms that tap the knowledge locked in the scientific literature require sophisticated natural language processing approaches. On the other hand, assessing document similarity is a comparatively easier task. A measure of document similarity that corresponds to semantic similarity between documents can also be powerful. For example, we might conclude that two genes are related if documents that refer to them are semantically similar.
Keywords: as function of latent dimension, collection frequency, dice coefficient, feature selection, gene annotation, hierarchical clustering, keyword queries, latent semantic indexing (LSI), matching coefficient, negations
Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.
If you think you should have access to this title, please contact your librarian.