Analyzing Groups of Genes
Analyzing Groups of Genes
The analysis of large-scale genomic data (such as sequences or expression patterns) frequently involves grouping genes based on common experimental features. The goal of manual or automated analysis of genomics data is to define groups of genes that have shared features within the data, and also have a common biological basis that can account for those commonalities. In utilizing algorithms that define groups of genes based on patterns in data it is critical to be able to assess whether the groups also share a common biological function. In practice, this goal is met by relying on biologists with an extensive understanding of diverse genes that decipher the biology accounting for genes with correlated patterns. They identify the relevant functions that account for experimental results. For example, experts routinely scan large numbers of gene expression clusters to see if any of the clusters are explained by a known biological function. Efficient definition and interpretation of these groups of genes is challenging because the number and diversity of genes exceed the ability of any single investigator to master. Here, we argue that computational methods can utilize the scientific literature to effectively assess groups of genes. Such methods can then be used to analyze groups of genes created by other bioinformatics algorithms, or actually assist in the definition of gene groups. In this chapter we explore statistical scoring methods that score the ‘‘coherence’’ of a gene group using only the scientific literature about the genes—that is whether or not a common function is shared between the genes in the group. We propose and evaluate such a method, and compare it to some other possible methods. In the subsequent chapter, we apply these concepts to gene expression analysis. The major concepts of this chapter are described in the frame box. We begin by introducing the concept of functional coherence. We describe four different strategies to assess the functional coherence of a group of genes. The final part of the chapter emphasizes the most effective of these methods, the neighbor divergence per gene. We present a discussion of its performance properties in general and on its robustness given imperfect groups. Finally we present an example of an application to gene expression array data.
Keywords: best article score (BAS), carbohydrate metabolism genes, diversity, genomics literature, empirical distribution, article scores, functional coherence, gene groups, inverse document frequency weighted word vectors, metabolism genes, reference indices
Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.
If you think you should have access to this title, please contact your librarian.