Jump to ContentJump to Main Navigation
Codon EvolutionMechanisms and Models$

Gina M. Cannarozzi and Adrian Schneider

Print publication date: 2012

Print ISBN-13: 9780199601165

Published to Oxford Scholarship Online: May 2015

DOI: 10.1093/acprof:osobl/9780199601165.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2020. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 13 August 2020

Robust estimation of natural selection using parametric codon models

Robust estimation of natural selection using parametric codon models

Chapter:
(p.111) Chapter 8 Robust estimation of natural selection using parametric codon models
Source:
Codon Evolution
Author(s):

Gavin A. Huttley

Von Bing Yap

Publisher:
Oxford University Press
DOI:10.1093/acprof:osobl/9780199601165.003.0008

Abstract and Keywords

This chapter focuses on the use of parametric codon models to test the mode of natural selection. It describes the nature of the different models, demonstrating the importance of model assumptions to reliably estimating selective neutrality. It examines the relationship between context-dependent processes on non-overlapping units and those relevant to codon evolutionary processes. To illustrate theoretical properties, the chapter presents both simple simulations and analyses of neutrally evolving DNA sequences from primates. It shows that estimates of natural selection using the commonly employed codon models of Muse and Gaut (1994), and of Goldman and Yang (1994), can both be strongly biased by sequence composition. It further illustrates that the Goldman and Yang model can have counterintuitive behaviour in hierarchical hypothesis-testing contexts.

Keywords:   natural selection, parametric codon models, context-dependent process, DNA sequence, primates

8.1 Introduction

By formally defining the dynamics of selectively neutral genetic variation, the neutral theory of molecular evolution (Kimura, 1983) provided the essential null hypothesis for identifying the operation of natural selection. This work, along with technological advances in DNA sequencing, stimulated development of statistical tests of selective neutrality for population genetic and molecular evolutionary data (Hudson et al., 1987; Hughes and Nei, 1988; McDonald and Kreitman, 1991). As evidenced by this book, analyses of the DNA units that encode amino acids (codons) have proven a popular target for these methods.

The appeal of codon based approaches for measuring the influence of natural selection derives from a built-in presumed neutral control. Because the number of DNA trinucleotides (64) used to encode amino acids outnumbers the number of amino acids (20), genetic codes are necessarily redundant. This redundancy has the effect that not all nucleotide point mutations cause a change in the encoded amino acid. If natural selection only operated on the encoded protein, then synonymous changes, which do not change the encoded amino acid, would evolve in a selectively neutral manner. From this intuitive expectation has come the notion that the ratio of nonsynonymous to synonymous substitution (conventionally termed ω‎) can be used to indicate the mode of natural selection: when ω‎ = 1, the rates of nonsynonymous and synonymous substitution are the same and a sequence is considered to be evolving in a selectively neutral manner; ω‎ < 1 indicates suppression of amino acid changing DNA variation, signifying the influence of purifying natural selection; ω‎ > 1 indicates the amino acid substitution rate is faster than the neutral rate, signifying the operation of positive natural selection.

Of the many approaches developed to test the mode of natural selection, the parametric codon models provide substantial flexibility. Maximumlikelihood techniques (Felsenstein, 1981) can be applied to identify evolutionary epochs during which individual genes exhibited positive selection (Yang, 1998), to identify individual residues within genes that have consistently been subjected to positive selection (Nielsen and Yang, 1998), or both (Yang and Nielsen, 2002). These and other analyses are possible because parametric models allow formal testing of the null hypothesis of neutrality, ω‎ = 1.

As should be evident, the robustness of the parametric codon modelling approach in identifying the mode of natural selection hinges on a number of critical assumptions. The assumption that synonymous substitutions are selectively neutral is likely to be incorrect, at least in some circumstances. While a full review of how natural selection can operate on synonymous sites is beyond the scope of this chapter, striking evidence has been presented from across diverse lineages (Hershberg and Petrov, 2008; The Chimpanzee Sequencing and Analysis Consortium, 2005).

Parametric models are also specified with constraints introduced in the form of assumptions that are typically introduced in the interests of mathematical and/or computational tractability. In addition to these assumptions, we will also demonstrate, in this chapter, that the form of the assumed neutral process embodied by the parametric models (p.112) is critical in accommodating the complexity of biological sequences.

In back-to-back articles in Molecular Biology and Evolution (Muse and Gaut, 1994; Goldman and Yang, 1994), two seminal parametric models were introduced. They both assume that codons within a gene are independently evolving as a continuoustime Markov process, and, furthermore, that only one nucleotide in a codon may be substituted at a time. Since the nucleotide substitution rates depend on the others in the same codon, these models fall into a broader class that we will refer to as context-dependent substitution models. Examples are RNA base-pairs (Schoninger and von Haeseler, 1994) and (Tillier, 1994), and dinucleotides and trinucleotides (Siepel and Haussler, 2004). Others extend the dependence across codon boundaries (Jensen and Pedersen, 2000); they will not be discussed any further. We will examine the relationship between context-dependent processes on nonoverlapping units and those relevant to codon evolutionary processes.

Below, we will detail the nature of the different models, demonstrating the importance of model assumptions to reliably estimating selective neutrality. To illustrate theoretical properties, we present both simple simulations and analyses of neutrally evolving DNA sequences from primates. We will show that estimates of natural selection using the commonly employed codon models of Muse and Gaut (1994), and of Goldman and Yang (1994), can both be strongly biased by sequence composition. We further show the Goldman and Yang model can have counterintuitive behaviour in hierarchical hypothesis-testing contexts. All software models described in this chapter are available in PyCogent (Knight et al., 2007) and all scripts and data used in this chapter are available on request from the authors. The definitions for terms used throughout thus chapter are presented in Box 8.1.

8.2 Context-dependent substitution models

Suppose that in a DNA sequence, the nucleotides independently undergo substitution according to a reversible Markov chain with rate matrix Q, also known as GTR (general time-reversible) (Yang, 1994). For distinct nucleotides x and y, Q(x, y) = S(x, y)π‎(y) for some symmetric S and equilibrium distribution π‎. In a dinucleotide substitution process, we assume that non-overlapping dinucleotides independently follow a reversible Markov chain with rate matrix Q. Here, we restrict attention to processes with the property that each substitution involves only one nucleotide, so that Q(a, b) = 0, whenever the dinucleotides a = a1a2 and b = b1b2 differ in both positions. Following Q, a natural parametrization is as follows: for distinct dinucleotides a and b, Q(a, b) = S(a, b)π‎(b), where (p.113) S is symmetric and π‎ is the equilibirum distribution. This will be called TF (tuple frequency), in reference to the appearance of the dinucleotide frequencies π‎ in Q.

Viewing the GTR at the dinucleotide level, we have:

(8.1)
Q(a,b)=Q*(a1,b1)=S*(a1,b1)π*(b1)

if a1b1 and a2 = b2, and a similar relationship for a1 = b1 and a2b2.

We call the set of all Q’s of this form NF (nucleotide frequency), which consists of precisely all GTR processes. To see how NF is represented in TF, note that π‎(b) = π‎(b1)π‎(b2): π‎ is homogeneous multiplicative (HM), which allows us to solve Q(a, b) = S(a, b)π‎(b) to get:

(8.2)
S(a,b)={S*(a1,b1)π*(b2)a1b1,a2=b2S*(a1,b1)π*(b1)a1=b1,a2b2.

Thus the GTR processes are parametrized very simply in NF, but in TF it is more complicated.

The substitution of a1 by b1 is context-dependent if the rate depends on the second nucleotide, i.e. Q(a, b) depends on the state of a2 = b2. Otherwise it is contextfree. Similarly, a substitution at the second nucleotide can be context-dependent or context-free. The GTR processes are context-free, as can be seen clearly from the NF parametrization; this is less clear from the TF parametrization. By definition, NF processes are all context-free, while the TF processes are in general context-dependent, hence more flexible. The good features of both can be brought out with another parametrization CNF (conditional nucleotide frequency):

(8.3)
Q(a,b)=S(a,b){π1|b2(b1)a1b1,a2=b2π2|b1(b2)a1=b1,a2b2,

where π‎1|b2(b1) is the conditional probability that the first nucleotide is b1 given that the second nucleotide is b2.

The processes specified by CNF are identical to TF (Lindsay et al., 2008). But unlike TF, CNF represents NF very nicely:

(8.4)
π1|b2(b1)=π*(b1),π1|b1(b2)=π*(b2),S(a,b)=S*(a1,b1)

if a1b1, a2 = b2, and

(8.5)
R(a,b)=R*(a2,b2)

if a1 = b1, a2b2.

It is natural to reduce parameters for TF and CNF by replacing R with R, as in NF. We use TFGTR and CNFGTR to denote the constrained models. For consistent notation, we will also use NFGTR instead of the equivalent NF. It follows that NFGTR are precisely the GTR processes, and that these are exactly those processes in CNFGTR with the constraint that π‎ is HM. However, the GTR processes in TFGTR must have uniform nucleotide frequencies. To see this, suppose there is a symmetric R and HM π‎ that yields a GTR process in TFGTR. Since Q(TT, CT) = Q(TA, CA), we must have S(T, C)π‎(C)π‎(T) = S(T, C)π‎(C)π‎(A), or π‎(T) = π‎(A), etc. Thus, π‎ must be uniformly distributed on the nucleotides. In particular, even the special case S≡ 1 is in general not a GTR process. Indeed, this process has context effects: in general Q(TT, CT) ≠ Q(TA, CA). In summary, most GTR processes are not in TFGTR, and some of the simplest processes in TFGTR are not GTR processes. As described in the next paragraph, the disparity between CNFGTR and TFGTR regarding the GTR processes, the basic context-free processes, is of fundamental importance. In general, TFGTR and CNFGTR are context-dependent because π‎ is not HM. If S is constrained to be of HKY form (Hasegawa et al., 1985), then the models are denoted: TFHKY, NFHKY, and CNFHKY.

The processes of mutagenesis affecting real sequences are far more complicated than GTR processes. A substantial body of evidence indicates that point mutation processes (affecting a single nucleotide position) are influenced by sequence neighbourhoods (Cooper and Krawczak, 1990, Cooper and Youssoufian, 1988; Gojobori et al., 1982; Krawczak et al., 1998, Morton et al., 1997). This influence manifests as non-HM frequencies of di-and trinucleotides; for instance, the frequency of a dinucleotide is not the product of the component nucleotide frequencies. The implication for evolutionary modelling should be clear. The putatively dominant context-dependent substitution is the well-known elevation of C to T substitutions when within CpG dinucleotides in many lineages (Cooper and Krawczak, 1990; Cooper and (p.114) Youssoufian, 1988, Coulondre et al., 1978). This context effect can be incorporated into a GTR process by multiplying a factor r to Q(CG, TG), and also to Q(TG, CG) to preserve reversibility. Thus r measures the effect of the neighbouring G relative to any other neighbouring nucleotide. However, this one-parameter extension NFGTR,r is not expected to fit real data, where π‎ is not HM. Analogous extensions TFGTR,r and CNFGTR,r, where π‎ is completely general, should fit better, at the cost of interpretability of r: it represents the residual effect of G that is not accounted for by π‎, rather than the net effect.

Consider fitting these models to data generated from a GTR process. Because the process is always in NFGTR and CNFGTR, NFGTR,r and CNFGTR,r are consistent: in the limit of infinite data, r^, the maximum-likelihood estimate (MLE) of r, will be around 1. However, since the GTR process is in general not in TFGTR, it is likely that based on TFGTR,r, r^ will be away from 1, falsely indicating a context effect. On the other hand, take a process in NFGTR,r with r ≠ 1, say. Now CNFGTR,r and NFGTR,r will still be consistent, but TFGTR,r may think r is around 1, a false negative. To conclude: if the underlying process is in NFGTR,r, CNFGTR,r is consistent for r, but not TFGTR,r. This conclusion will be illustrated by simulations. The erroneous assumption that TFGTR contains most context-free processes cannot be detected by internal scrutiny: simulations within TFGTR,r, where processes with r = 1 are assumed to include the GTR processes, will not reveal the exclusion of most of them from TFGTR.

The conclusion in the last paragraph largely carries over to codon models. TFGTR, NFGTR, and CNFGTR can be generalized directly to trinucleotides. For the sense codons, some modifications are required because of the exclusion of stop codons. We also rename TF as CF (C for codon). In both CFGTR and CNFGTR, π‎ should sum to 1 over sense codons. In NFGTR, the equilibrium frequency of a sense codon is:

(8.6)
π(a)=π*(a1)π*(a2)π*(a3)/k,

where κ‎ is the sum of sense codon frequencies.

In CNFGTR, there are rate terms where the conditional nucleotide frequencies need to be computed with some care because of stop codons. The extensions CFGTR,ω‎, NFGTR,ω‎, and CNFGTR,ω‎ have ω‎ multiplied to the rate of each nonsynonymous substitution, so it indicates positive, neutral, or negative selection. In fact, CFHKY,ω‎ is essentially the popular model of (Goldman and Yang, 1994) and NFHKY,ω‎ was introduced by (Muse and Gaut, 1994). Because of the exclusion of the stop codons, the GTR processes are not contained in any of the codon models. Unlike dinucleotides and trinucleotides, a process in CNFGTR with a given S and an HM π‎ is in general distinct from the process in NFGTR with the same S and the corresponding π‎. This is shown by computing Q(TAT, TAC)/S(T, C) for both models. Since TAA and TAG are stop codons, for CNFGTR, the ratio is:

(8.7)
π3|TA(C)=π(TAC)π(TAT)+π(TAC)=π*(Cπ*(T)+π*(C)

but it is π‎(C) for NFGTR. Thus, the lessons from dinucleotides are modified as followed: if the underlying process is NFGTR,ω‎, CNFGTR,ω‎ is slightly inconsistent for ω‎, but CFGTR,ω‎ is quite inconsistent. This mathematical statement can be illustrated by simulations, but its relevance to codon evolution needs clarification. The fact that the GTR processes are not part of any model makes it necessary to distinguish neutral and non-neutral codon-substitution processes by some other means. We propose two ways around the difficulty. First, it seems plausible that NFGTR processes are approximately neutral, as assumed by (Muse and Gaut, 1994). It follows that it is not a good idea to believe that CFGTR processes are approximately neutral. The second approach is empirical: suppose that intron sequences evolve neutrally with respect to the genetic code. This can be used to compare the models, and it turns out the clear winner is CNFGTR,ω‎. So CNFGTR may be defined as a class of neutral codon-substitution processes with the same richness as CFGTR. These considerations for choosing an appropriate model are critical, for, as explained in the last paragraph, the deficiency of a certain model will not be discovered by simulations within the model.

Detection of a context effect in neutral DNA or a selection effect in coding DNA is typically accomplished by testing the hypothesis that the process belongs to a certain class of models M0, which are nested in a bigger class M1. The natural approach (p.115) is to compute L0 and L1, the logarithm of the maximum likelihood in the respective classes. Always L0L1, but a larger difference is stronger evidence for rejecting the null hypothesis that the process is in M0. In many cases, standard statistical theory suggests that 2(L1L0) should be compared to the χ‎2 distribution of ν‎ degrees of freedom, where ν‎ is the difference in the numbers of parameters needed to specify M1 and M0. For example, let M0 = CNFGTR and M1 = CNFGTR,ω‎. Since ω‎ is the extra parameter, ν‎ = 1. The P value obtained is more reliable as the sequences become longer.

8.3 Evaluating properties of dinucleotide models

8.3.1 Analysis of simulated data

We first demonstrate that when sequences evolve under a simple nucleotide process, an analysis based on the dinucleotide TF model can lead to the incorrect conclusion that context affects substitution. We simulated AT-rich sequence alignments using the F81 nucleotide substitution model (Felsenstein, 1981) with π‎(A) = π‎(T) = 0.3 and π‎(C) = π‎(G) = 0.2. Each alignment consisted of two 10Kb sequences, and the branch length between them was 0.04. (The latter is of the rough order of branch lengths separating humans and chimpanzees.) We fit dinucleotide models with a single-context parameter r for substitutions between CG and its six neighbouring dinucleotides TG, AG, GG, CT, CC, and CA. Model fitting was done using the Powell numerical optimizer with exit tolerance 1e − 6, a maximum of 5 restarts and maximum evaluations set to 100k. All model optimizations were checked to see whether the exit was due to exceeding the maximum evaluation number, which would indicate the functions were not correctly fit. This condition never arose. All simulations and analyses reported here were conducted using PyCogent version 1.5.0.dev (Knight et al., 2007).

Since the data were generated under a contextfree process, the theory predicts that the MLE of r should be near 1 for CNFGTR,r and NFGTR,r, but not for TFGTR,r. The results (Figure 8.1) confirm the

Robust estimation of natural selection using parametric codon models

Figure 8.1 Distribution of MLEs of the context parameter r for six changes from CG from the three models. Sequences were simulated under the F81 process, a special case of the GTR. The dotted vertical line indicates the true value of r is 1.

(p.116) theory outlined above. In particular, the similarity of CNF and NF is borne out by the tight overlap between their MLEs. Estimates of r based on TF are biased upwards, which can be explained using a simplified model. The rate Q(CX, TX) = 0.3 for any X, but under TF they are different:

(8.8)
π(TA)=π(TT)=0.3×0.3=0.09

and

(8.9)
π(TC)=π(TG)=0.3×0.2=0.06

similarly in the reverse substitution, Q(TX, CX) = 0.2 for any X, but under TF:

(8.10)
π(CA)=π(CT)=0.2×0.3=0.06

and

(8.11)
π(CC)=π(CG)=0.2×0.2=0.04

To equalize the rates, r^ tends to be larger than 1.

8.3.2 Analysis of primate introns

The fact that real dinucleotide frequencies are typically not uniform or even HM, has important implications for the NF and TF model forms. As the NF model assumes that the dinucleotide frequencies π‎ are HM, the estimated π‎ will not match those observed in real sequences well. Therefore, estimates of other parameters are expected to be biased. The non-uniform frequencies in most real datasets indicate that parameter estimates from the TF model will also be affected. Our theoretical analyses establish that estimates with the TF model will only be the same as those with the CNF and NF models under special restrictions on the frequencies of the sequence states that are unlikely to occur in nature. Thus we expect analogous parameters estimated by the models to be different from each other.

We evaluated the consistency in parameter estimates from the different models by analysis of primate intron sequences with specific focus on comparing estimates of CpG substitution processes. We chose CpG substitution processes due to the well-characterized hypermutability of this dinucleotide context in primates. The CpG context is the characteristic motif at which DNA methylases operate to add a methyl group to cytosine, producing 5-methyl-cytosine (5mC) (Gruenbaum et al., 1982). This modified base spontaneously deaminates to T at a much greater rate than the deamination of unmethylated C to U. The U-G mismatch resulting from the latter lesion is also repaired more efficiently than the T-G mismatch, leading to an ~ 10-fold increased mutation rate of 5mC (Krawczak et al., 1998). Studies from human mutation data suggest an excess of both transitions and transversions (Sommer et al., 2001). An excess of transitions is expected based on biochemistry (Coulondre et al., 1978; Duncan and Miller, 1980) but questions remain about whether the transversion rate should be elevated or not. Our previous work on dinucleotide models indicated that the TF supported the excess transversion rate whereas the NF model did not (Lindsay et al., 2008).

Here we implemented analogous parametrizaions in the three different models to measure each of the six distinct CpG substitution processes. We followed the approach of (Lindsay et al., 2008) in defining comparable models. The baseline model is the GTR, with S(G, T) set to 1 for calibration. We included the 6 CpG context parameters r(CG ↔ TG), r(CG ↔ AG), r(CG ↔ GG), r(CG ↔ CT), r(CG ↔ CC), r(CG ↔ CA). The models were fit to primate datasets (described below) using the same numerical optimization procedure as that employed for the simulated datasets.

We used the same dataset of primate alignments as that used by (Lindsay et al., 2008) and briefly recapitulate the sampling process here. Using Ensembl release 50, all human proteincoding genes were sampled and multiple sequence alignments of human, chimpanzee, and macaque sequences obtained. Sequence regions that did not evolve by a pointmutation process (e.g. low complexity sequence or gaps) were masked and alignment columns containing a masked character were removed in a manner that preserved ‘real’ dinucleotides. To facilitate comparability of parameter estimates between alignments, alignments were truncated to exactly same length of 50,000 nucleotides. There were 470 such alignments.

Figure 8.2 shows that MLEs of the 6 CpG-context parameter are systematically different among the (p.117)

Robust estimation of natural selection using parametric codon models

Figure 8.2 MLEs of CpG-context parameters from primate introns. All multiple alignments of chimpanzee, human, and macaque were exactly 50,000 nucleotides long after eliminating sequences unlikely to evolve by a point mutation process. There were 470 alignments. MLEs of all reversible CpG substitutions are shown for each indicated model.

three models. We observed that the empirical dinucleotide frequencies are generally not HM in both frames (data not shown). By forcing the fitted frequencies to be MH, NF misses the context effects in π‎ and detects little of the residual effects represented by the r terms. Both TF and CNF account for the context effects both in π‎ and r, with r^ generally smaller and less spread out by CNF than by TF, but both much larger than by NF. Of particular significance here is that the transversion context parameters CG ↔ AG, CG ↔ GG, CG ↔ CT, and CG ↔ CC estimated under NF were distributed around 1, a result that led (Lindsay et al., 2008) to conclude that CpG transversion rates were not elevated. The comparable CNF distributions, on the other hand, seem to support an elevated rate of transversions, with the caveat that π‎ and r should be interpreted jointly to give a sensible idea of context effects. We prefer the CNF for reasons given above; this is supported by the fact that for ~ 92% of the alignments, the likelihood for CNF exceeds that for TF.

8.4 Evaluating properties of codon models

Differing handling of the omission of stop codons leads to different behaviour of the substitution models. We illustrate these differences using both simulated and naturally occurring biological sequences. The work reported in this section is substantially that described in (Yap et al., 2010) but updated to account for new insights into the models.

(p.118) 8.4.1 Analysis of simulated data

We use data generated under known codon evolution processes to further establish the relationship among parameters estimated from the different models. We specifically considered two different types of equilibrium codon frequencies: homogeneous multiplicative (HM), where the equilibrium codon frequencies are defined as the product of their nucleotide frequencies with normalization to account for the omission of stop codons; non-HM. In both conditions there is a choice between two different models.

Given R and HM π‎, CNF, and NF, specify the same process for dinucleotides or trinucleotides, but not for codons, because of the omission of stop codons. Whichever model is chosen for generating neutral data, ω̂‎, the MLE of ω‎, based on the same model will be around 1, while those based on the other model will be slightly shifted. We chose the NF model for simulation here in part to show that CNF MLEs will differ from NF even for this case. A similar but larger effect is seen when comparing CNF and TF. For this particular case, we chose the CNF model for generating data because of its clear relationship to the independent process.

We previously assessed the influence of changes to sequence composition on ω̂‎ by sampling codon frequency sets with different GC% (Yap et al., 2010). GC% is a crude measure of sequence composition, reflecting only the frequency of different nucleotides, but it has been widely employed in molecular evolutionary analyses for decades (Cuny et al., 1981) and is known to vary considerably between and within genomes (Karlin et al., 1998). Microbial species in particular exhibit pronounced differences in nucleotide frequencies (Kanehisa, 2002).

Proteincoding genes from three different clades were arbitrarily sampled to span the range of GC% observed in nature: (1) a gene (K01873) from the AT-rich (~ 33% GC) Borrelia burgdorferi, B. afzelli, and B. garinii; (2) a gene (ENSG00000143520) from the primates Homo sapiens, Pan troglodytes, and Macaca mulatta with approximately equal AT and GC; and (3) a gene (ML0101/Rv3800c) from the GC-rich (~ 64%) genomes of Mycobacterium tuberculosis and Mycobacterium leprae. The codon frecodon frequencies in these genes were used to determine an HM π‎, using the NF model, or the averaged codon frequencies estimated across all sequences was used for the CNF model. Data were simulated under the neutral model (ω‎ = 1) of both NF and CNF. We simulated long (90 Kbp) alignments to reduce the influence of sampling error. The results of the analyses are shown in Figure 8.3 (Yap et al., 2010).

From data simulated under the NF model, with HM π‎, ω^ from NF and CNF were typically very similar. The distributions of ω^ became increasingly similar as the GC% increased. We argued that this trend reflects the AT-richness of stop codons in the standard genetic code. The nucleotide composition of stop codons are expected to lead to differences between the models, since the NF codon model does not normalize the weighting of substitutions for the absence of these states from the model. In contrast, ω^ from TF exhibited a systematic shift with increasing GC% from negative to positive. This result confirms that, relative to the standard defined by the NF and CNF models for HM π‎, ω^ from the TF model will be biased.

For the non-HM π‎ used for the (Yap et al., 2010) simulation study, all models showed marked differences in the distribution of ω^. Particularly striking was that the differences did not systematically change with increasing GC%, unlike the HM case. Also striking was the substantial discrepancy between CNF and NF. This result is consistent with the results from the dinucleotide analysis (Figure 8.2) where non-HM π‎ was associated with r^ differing substantially between NF and CNF.

These analyses of simulated data confirm the insights from examining the analytical relationships between the models. Namely, for the typically restricted codon model parametrizations in use, all the models give different ω^ distributions. The magnitude of these differences will depend on how close the codon frequencies are to HM and the actual frequencies of the codons. This analysis does not resolve, however, which is the correct model that should be used for the measurement of natural selection. ω^ from a wrong model will tend to differ from 1. The magnitude of the difference is affected by the relationship between the fitting and generating models. The latter is of particular (p.119)

Robust estimation of natural selection using parametric codon models

Figure 8.3 Nucleotide composition effects on estimates of ω‎ from simulated neutrally evolving genes. Sequence simulations were based on an AT-rich gene sampled from Borrelia species, a primate gene with AT% ≈ GC%, and a GC-rich gene sampled from Mycobacterium species. Average GC% of the simulated alignments is shown. x-axis shows ω^, y-axis shows an estimate of density. (a) Data generated from a NFGTR model resulting in multiplicative codon frequencies. (b) Data generated from CNFGTR model with observed (non-HM) codon frequencies from the sampled genes. The dashed vertical line is the expected neutral value.

relevance to the NF model, which assumes that the equilibrium state frequencies are multiplicative. The TF and CNF models, however, make identical assumptions regarding the equilibrium distribution and differ only in their weighting of rates which in turn affects how these models nest the independent process. One approach to identifying which of these is the most suitable model is to apply them to real biological sequences known to have evolved in a selectively neutral manner, which is the objective of the next section.

8.4.2 Analysis of primate introns

The essential benchmark for evaluating the utility of the different models is an analysis of neutrally evolving real DNA sequences. While pseudogenes are a candidate for such an analysis, identifying multiple members of this sequence class that meet the necessary criteria of selective neutrality, and also span the spectrum of GC%, is a significant challenge. Moreover, such sequences typically contain frameshift mutations or stop codons that invalidate application of codon models. An alternative approach is to consider any sequence that does not encode proteins as selectively neutral with respect to protein coding content (Yap et al., 2010). The evolution of such sequences can be modelled using a trinucleotide process.

A modified post-sampling process was applied to the same introns used for the dinucleotide analyses. Part of the original sampling process of these introns involved masking sequence regions that could evolve by a non-point mutation process. (p.120) Examples include low-complexity sequence and insertion deletion (or gap) regions. Nucleotides within a sequence that fell into such a span were replaced by a special masking character. The alignment was then split into non-overlapping trinucleotides and aligned columns that contained a mask character in any sequence were eliminated. Alignments were then truncated to exactly the same length (~50kbp). This resulted in 470 alignments. This sampling process ensures that only naturally occurring trinucleotides are sampled and that different estimates are derived from equivalent amounts of data, homogenizing MLE variances.

We compared three pairs of nested models, CFHKY,ω‎ ⊂ CFGTR,ω‎, NFHKY,ω‎ ⊂ NFGTR,ω‎, and CNFHKY,ω‎ ⊂ CNFGTR,ω‎. The results identified CNFGTR,ω‎ as the model that produced MLEs of ω‎ that are most consistent with 1. The likelihood ratio test statistic from testing the null hypothesis of ω‎ = 1 against the alternative ω‎ ≠ 1 was very close to the theoretical expectation of the χ12 distribution for CNF (Figure 8.4). NFGTR,ω‎ was the next closest while CFHKY,ω‎ was the best-performing TF model and clearly worse than either of the other two models.

While we will consider the implications of the superiority of the GTR over HKY variant in more detail in the Conclusions, we expand here on the high similarity between the NF and CNF models. This similarity is due partly to the trinucleotide nature of the models, which means that the CNF and NF models are nested. Due to their different treatment of stop codon omission the codon model variants are not nested. As a conse

Robust estimation of natural selection using parametric codon models

Figure 8.4 Comparison of the effects of variation in real neutral processes on NF, CF, and CNF models. (a) Alignment GC% (x-axis) against ω^ (y-axis) from the ‘trinucleotide’ models CFHKY,ω‎, NFGTR,ω‎, and CNFGTR,ω‎. The estimate and significance of Kendall’s τ‎ measure of association between GC% and ω^ are shown for each panel. The dashed horizontal line is ω‎ = 1. (b) A quantile–quantile plot using quantiles from χ12 (the expected null distribution) against quantiles of the LRs from testing the alternative ω‎ ≠ 1 against the null ω‎ = 1 hypotheses (Yap et al., 2010).

(p.121) quence, we expect ω‎ to be more biased under NF codon model variants.

8.5 Impact of model definitions on statistical power

Since the original publication of the codon models (Goldman and Yang, 1994; Muse and Gaut, 1994), most development has concentrated on refining the measurement of natural selection. These include methods to detect differences in molecular adaptation: between lineages (Yang, 1998); between residues (Nielsen and Yang, 1998); and a combination of these two, examining lineage specific molecular adaptation affecting individual residues (Nielsen and Yang, 1998; Zhang et al., 2005). In each of these cases, an alternate hypothesis of molecular adaptation is defined as lineages/sites, where ω‎ > 1. As ω‎ under TF, CNF, and NF are not the same, then the critical benchmark used to classify the mode of selection (ω‎ = 1) will differ between the models.

Statistical power is defined as the probability of rejecting the null hypothesis when it is false. As the null model in codon evolutionary modelling cases includes the ω‎ = 1 constraint, the probability of making erroneous conclusions must be affected for models where estimates of ω‎ are biased by sequence composition. This statement was confirmed by (Yap et al., 2010), who showed for a single example that the quantiles of the likelihood ratio test (LRT) statistic from a test of a simple siteclass model were seriously in error for the TF model. The correct expected quantile distribution was not shown in that work as the calculation of the correct degrees-of-freedom for the LRT is complicated by the non-standard parameter space. It is most critical to realize, however, that the results of that analysis indicate that other conditions will be realized in which it is LRT from the NF model that will depart more from the expected theoretical distribution. In other words, departures from the expected quantiles will depend very much on the precise properties of the data being analysed and the statistical hypotheses being contrasted.

We further illustrate the challenge of understanding the impact of these model properties on statistical power here. We used exactly the same simulated dataset as that generated by Yap et al., (2010) for the examination of the error in tests for multiple siteclasses. Briefly, sequences from human, chimpanzee, and macaque that were one-to-one orthologs to the human gene with Ensembl ID ENSG00000143520 were used to fit a CNFGTR model. The fitted model was then used to simulate 250 alignments that were each 90,000 nucleotide long. This gene has ~ 50%GC.

We specifically compare hypothesis concerning equivalence of ω‎ between lineages. Departures from the neutral theory—purifying natural selection plus no natural selection—can also occur if natural selection changes between lineages. This can be examined, for instance, by specifying a null where ω‎ is the same across all lineages and an alternate where ω‎ is allowed to be different on one or more of the lineages. The degrees-of-freedom for such a test, all other parameters being unmodified, is just the number of ω‎ values less 1. The consistency of LRT statistics with that expected can be established used a quantile-quantile plot. For the current purpose, we test the null of a single ω‎ against the alternate of a separate ω‎ for the human lineage.

Prior to discussing the results, it is worth considering our prior expectations. First, note that we are not concerned with whether ω‎ = 1, only whether ω‎ is the same for all lineages. At first glance then, it seems then that all three models should exhibit the expected quantile distribution. That is, while estimates of ω‎ may be biased they can be expected to be consistently biased across lineages. Thus, we expect tests for differences in ω‎ between lineages to be consistent with our null distribution. As it turns out, this is correct only for TF and CNF.

The results in Figure 8.5 show that the NF model is prone to false positives for this hypothesis test. The LRT statistic quantiles from CNF and TF substantially overlap and lie approximately on the expected diagonal. This is not the case for the LRT quantiles from the NF model, which are typically too large. We conjecture the NF result derives from the contribution of the additional context term (human-specific ω‎) to compensate for the assumed HM frequencies. (p.122)

Robust estimation of natural selection using parametric codon models

Figure 8.5 Quantiles from the LRT statistic. H0 all lineages have the same ω‎, which need not be 1; H1 one lineage has a different ω‎. Data analysed were the same simulated data alignments from (Yap et al., 2010).

8.6 Conclusion

We have examined in detail the properties of three classes of context-dependent substitution processes, and demonstrated that analogous parameters have different interpretations. We previously demonstrated that the different model properties cause substantial differences in ω‎ estimates from real biological data. Analyses of proteincoding genes from diverse lineages revealed discordances in parameter estimates between the models as high as 30% (Yap et al., 2010). Sequence composition and non-HM codon frequencies substantially contribute to these discrepancies. The choice of model thus has a profound impact on the interpretation of the mode of natural selection.

The additional analyses undertaken here indicate that the constraint of HM frequencies can affect inference in ways that make the NF specifically unreliable for analyses of between-lineage changes in natural selection. We had previously demonstrated that testing for ω‎ = 1 is strikingly violated by the CF model, where the alternative hypothesis assumed a single ω‎ ≠ 1 for all lineages. Here we also considered the effect on error rates of a different, but also popular alternative hypothesis, where the null was a single ω‎ ≠ 1 and the alternate was different ω‎ between lineages. In this case, the precise value of ω‎ is not critical but that the value be equivalent between the lineages. For this analysis, both the CF and CNF models showed the expected Type 1 error rate, whereas NF showed an excess. This latter difference highlights the effect of assuming HM frequencies can affect the NF model in non-obvious ways. In this instance, the relative parametric sparseness of the NF model relative to the data-generating process makes the model prone to error because any additional parameters compensate for the poor fit between the model and the data.

Model fit is one essential element in choosing the best model for the analysis of contextual influences on DNA sequence evolution, but it is not sufficient by itself. The problem of model selection is a common one and systematic approaches to identifying the best model revolve around information theoretic measures (Burnham and Anderson, 2002). We compared the likelihood for the restricted (p.123) dinucleotide substitution models (Figure 8.2), where the CNF and CF models had analogous parametrizations and thus an identical number of free parameters. In this scenario, the CNF likelihood was higher than the CF likelihood for around 92% of the alignments. We note that under the most general possible stationary and reversible CNF and CF models, the likelihoods will be exactly the same. Yet because the restricted models are not the same, the r parameters have different interpretations. As these models are being applied to derive meaning about the influence of sequence neighbourhoods on the relative rates of substitution, we are principally concerned with the properties of r and thus likelihood alone (in this case) cannot guide our choice.

An additional essential step in model choice requires identifying the appropriate null hypothesis and the relationships of the models to it. We have argued that independence or the absence of context effects is the correct null hypothesis. While parametrizations can be constructed for all the models that are equivalent to independence, only the NF and CNF models do so without requiring additional parameters. In contrast, the largest possible number of parameters is required by the CF model in order to guarantee nesting of independence. Although NF does not suffer this flaw, its inability to specify nonHM frequencies affects estimates of both π‎ and r. Finally, the ultimate arbiter must be assessment of naturally occurring data known (as best as possible) to evolve in a manner consistent with the null hypothesis. Our analyses of trinucleotide data served this requirement and showed that our expectation regarding the behaviour of the model and the r terms was only reasonably met for CNF model.

As a result of these deliberations, we suggest the theoretical properties and analyses of real data confirm CNF as the most suitable foundation for building special context dependent models. An illustration of this is the extension of codon-substitution models to measuring the influence of CpG substitution processes on the evolution of protein coding genes (Huttley, 2004). We further suggest that the CNF codon model should be considered as reasonable models of neutral codon evolution. The practical implications of these results are discussed in Box 8.2.

We conclude by emphasizing that a number of observations indicate that important critical problems affecting reliable estimation of parameters remain to be addressed. The fact that only quantiles from the GTR (not HKY) variant of CNF were consistent with theoretical expectations (Yap et al., 2010) (p.124) indicates that parametrizations sufficiently

general to account for the variability in the neutral processes are essential to correctly control error rates. This means that in cases where evolutionary events are being examined that span changes in the underlying neutral process through time, those changes may need to be specifically represented in the models in order for estimates of ω‎ remain consistent with the null hypothesis. Evidence indicates that such changes to the neutral process may be quite common (Singh et al., 2009) and thus extending context dependent models to address the case of non-reversible and nonstationary processes is an important avenue for future work.

References

Bibliography references:

Burnham, K.P. and Anderson, D.R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. Springer-Verlag.

Cooper, D.N. and Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease: Patterns and predictions. Human Genetics 85(1): 55-74.

Cooper, D.N. and Youssoufian, H. (1988). The CpG dinucleotide and human genetic disease. Human Genetics 78(2):151-155.

Coulondre, C., Miller, J.H., Farabaugh, P.J., and Gilbert, W. (1978). Molecular basis f base substitution hotspots in escherichia coli. Nature 274(5673): 775-780.

Cuny, G., Soriano, P., Macaya, G., and Bernardi, G. (1981). The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity. European Journal of Biochemistry/FEBS 115(2): 227-233.

Duncan, B.K. and Miller, J.H. (1980). Mutagenic deamination of cytosine residues in DNA. Nature 287(5782):560-561.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6): 368-376.

Gojobori, T., Li, W.-H., and Graur, D. (1982). Patterns of nucleotide substitution in pseudogenes and functional genes. Journal of Molecular Evolution 18(5): 360-369.

Goldman, N. and Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution 11(5): 725-736.

Gruenbaum, Y., Cedar, H., and Razin, A. (1982). Substrate and sequence specificity of a eukaryotic DNA methylase. Nature 295(5850): 620-622.

Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22: 160-174.

Hershberg, R. and Petrov, D.A. (2008). Selection on codon bias. Annual Review of Genetics 42: 287-299.

Hudson, R.R., Kreitman, M., and Aguade, M. (1987). A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159.

Huelsenbeck, J.P. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8): 754-5.

Hughes, A.L. and Nei, M. (1988). Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335: 167-170.

Huttley, G.A. (2004). Modeling the impact of DNA methylation on the evolution of BRCA1 in mammals. Molecular Biology and Evolution 21(9): 1760-1768.

Jensen, J.L. and Pedersen, A.-M.K. (2000). Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Advances in Applied Probability 32(2): 499-517.

Kanehisa, M. (2002). The KEGG database. Novartis Foundation Symposium 247: 91-101; discussion 101-103, 119-128, 244-252.

Karlin, S., Campbell, A.M., and Mrázek, J. (1998). Comparative DNA analysis across diverse genomes. Annual Review of Genetics 32(0066-4197): 185-225.

Kimura, M. (1983). In The neutral theory of molecular evolution. Cambridge: Cambridge University Press.

Knight, R., Maxwell, P., Birmingham, A., Carnes, J., Caporaso, J., Easton, B. et al. (2007). PyCogent: a toolkit for making sense from sequence. Genome Biology 8(8): R171.

Krawczak, M., Ball, E.V., and Cooper, D.N. (1998). Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. American Journal of Human Genetic s 63(2): 474-488.

Lanave, C., Preparata, G., Saccone, C., and Serio, G. (1984). A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution, 20(1): 86-93.

Lindsay, H., Yap, V.B., Ying, H., and Huttley, G.A. (2008). Pitfalls of the most commonly used models of context dependent substitution. Biology Direct 3: 52.

McDonald, J.H. and Kreitman, M. (1991). Adaptive protein evolution at the Adh locus in Drosophila’. Nature 351(6328): 652-654.

Morton, B.R., Oberholzer, V.M., and Clegg, M.T. (1997). The influence of specific neighboring bases on substitution bias in noncoding regions of the plant chloroplast genome. Journal of Molecular Evolution 45(3): 227-231.

(p.125) Muse, S.V. and Gaut, B.S. (1994). A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Molecular Biology and Evolution 11(5): 715-724.

The Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 69-87.

Nielsen, R. and Yang, Z. (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148(3): 929-936.

Pond, S.L.K., Frost, S.D.W., and Muse, S.V. (2005). HyPhy: Hypothesis testing using phylogenies. Bioinformatics, 21(5): 676-679.

Schoninger, M. and von Haeseler, A. (1994). A stochastic model for the evolution of autocorrelated DNA sequences. Molecular Phylogenetics and Evolution 3: 240-247.

Siepel, A. and Haussler, D. (2004). Phylogenetic estimation of context-dependent substitution rates. Molecular Biology and Evolution 21: 468-488.

Singh, N.D., Arndt, P.F., Clark, A.G., and Aquadro, C.F. (2009). Strong evidence for lineage and sequence specificity of substitution rates and patterns in Drosophila. Molecular Biology and Evolution 26(7): 1591-1605.

Sommer, S.S., Scaringe, W.A., and Hill, K.A. (2001). Human germline mutation in the factor IX gene. Mutations Research 487: 1-17.

Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17: 57-86.

Tillier, E.R.M. (1994). Maximum likelihood with multiparameter models of substitution. Journal of Molecular Evolution 39: 409-417.

Yang, Z. (1994). Estimating the pattern of nucleotide substitution. Journal of Molecular Evolution 39: 105-111.

Yang, Z. (1998). Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Molecular Biology and Evolution 15: 568-573.

Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24(8):1586-91.

Yang, Z. and Nielsen, R. (2002). Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Molecular Biology and Evolution 19(6): 908-917.

Yap, V.B., Lindsay, H., Easteal, S., and Huttley, G.A. (2010). Estimates of the effect of natural selection on protein-coding content. Molecular Biology and Evolution 27(3): 726-734.

Zhang, J., Nielsen, R., and Yang, Z. (2005). Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Molecular Biology and Evolution 22: 2472-2479.