# Robust estimation of natural selection using parametric codon models

# Robust estimation of natural selection using parametric codon models

# Abstract and Keywords

This chapter focuses on the use of parametric codon models to test the mode of natural selection. It describes the nature of the different models, demonstrating the importance of model assumptions to reliably estimating selective neutrality. It examines the relationship between context-dependent processes on non-overlapping units and those relevant to codon evolutionary processes. To illustrate theoretical properties, the chapter presents both simple simulations and analyses of neutrally evolving DNA sequences from primates. It shows that estimates of natural selection using the commonly employed codon models of Muse and Gaut (1994), and of Goldman and Yang (1994), can both be strongly biased by sequence composition. It further illustrates that the Goldman and Yang model can have counterintuitive behaviour in hierarchical hypothesis-testing contexts.

*Keywords:*
natural selection, parametric codon models, context-dependent process, DNA sequence, primates

# 8.1 Introduction

By formally defining the dynamics of selectively neutral genetic variation, the neutral theory of molecular evolution (Kimura, 1983) provided the essential null hypothesis for identifying the operation of natural selection. This work, along with technological advances in DNA sequencing, stimulated development of statistical tests of selective neutrality for population genetic and molecular evolutionary data (Hudson *et al*., 1987; Hughes and Nei, 1988; McDonald and Kreitman, 1991). As evidenced by this book, analyses of the DNA units that encode amino acids (codons) have proven a popular target for these methods.

The appeal of codon based approaches for measuring the influence of natural selection derives from a built-in presumed neutral control. Because the number of DNA trinucleotides (64) used to encode amino acids outnumbers the number of amino acids (20), genetic codes are necessarily redundant. This redundancy has the effect that not all nucleotide point mutations cause a change in the encoded amino acid. If natural selection only operated on the encoded protein, then synonymous changes, which do not change the encoded amino acid, would evolve in a selectively neutral manner. From this intuitive expectation has come the notion that the ratio of nonsynonymous to synonymous substitution (conventionally termed *ω*) can be used to indicate the mode of natural selection: when *ω* = 1, the rates of nonsynonymous and synonymous substitution are the same and a sequence is considered to be evolving in a selectively neutral manner; *ω* < 1 indicates suppression of amino acid changing DNA variation, signifying the influence of purifying natural selection; *ω* > 1 indicates the amino acid substitution rate is faster than the neutral rate, signifying the operation of positive natural selection.

Of the many approaches developed to test the mode of natural selection, the parametric codon models provide substantial flexibility. Maximumlikelihood techniques (Felsenstein, 1981) can be applied to identify evolutionary epochs during which individual genes exhibited positive selection (Yang, 1998), to identify individual residues within genes that have consistently been subjected to positive selection (Nielsen and Yang, 1998), or both (Yang and Nielsen, 2002). These and other analyses are possible because parametric models allow formal testing of the null hypothesis of neutrality, *ω* = 1.

As should be evident, the robustness of the parametric codon modelling approach in identifying the mode of natural selection hinges on a number of critical assumptions. The assumption that synonymous substitutions are selectively neutral is likely to be incorrect, at least in some circumstances. While a full review of how natural selection can operate on synonymous sites is beyond the scope of this chapter, striking evidence has been presented from across diverse lineages (Hershberg and Petrov, 2008; The Chimpanzee Sequencing and Analysis Consortium, 2005).

Parametric models are also specified with constraints introduced in the form of assumptions that are typically introduced in the interests of mathematical and/or computational tractability. In addition to these assumptions, we will also demonstrate, in this chapter, that the form of the assumed neutral process embodied by the parametric models (p.112) is critical in accommodating the complexity of biological sequences.

In back-to-back articles in *Molecular Biology and Evolution* (Muse and Gaut, 1994; Goldman and Yang, 1994), two seminal parametric models were introduced. They both assume that codons within a gene are independently evolving as a continuoustime Markov process, and, furthermore, that only one nucleotide in a codon may be substituted at a time. Since the nucleotide substitution rates depend on the others in the same codon, these models fall into a broader class that we will refer to as context-dependent substitution models. Examples are RNA base-pairs (Schoninger and von Haeseler, 1994) and (Tillier, 1994), and dinucleotides and trinucleotides (Siepel and Haussler, 2004). Others extend the dependence across codon boundaries (Jensen and Pedersen, 2000); they will not be discussed any further. We will examine the relationship between context-dependent processes on nonoverlapping units and those relevant to codon evolutionary processes.

Below, we will detail the nature of the different models, demonstrating the importance of model assumptions to reliably estimating selective neutrality. To illustrate theoretical properties, we present both simple simulations and analyses of neutrally evolving DNA sequences from primates. We will show that estimates of natural selection using the commonly employed codon models of Muse and Gaut (1994), and of Goldman and Yang (1994), can both be strongly biased by sequence composition. We further show the Goldman and Yang model can have counterintuitive behaviour in hierarchical hypothesis-testing contexts. All software models described in this chapter are available in PyCogent (Knight *et al*., 2007) and all scripts and data used in this chapter are available on request from the authors. The definitions for terms used throughout thus chapter are presented in Box 8.1.

# 8.2 Context-dependent substitution models

Suppose that in a DNA sequence, the nucleotides independently undergo substitution according to a reversible Markov chain with rate matrix *Q*^{∗}, also known as GTR (general time-reversible) (Yang, 1994). For distinct nucleotides *x* and *y, Q*^{∗}(*x, y*) = *S*^{∗}(*x, y*)*π*^{∗}(*y*) for some symmetric *S*^{∗} and equilibrium distribution *π*^{∗}. In a dinucleotide substitution process, we assume that non-overlapping dinucleotides independently follow a reversible Markov chain with rate matrix *Q*. Here, we restrict attention to processes with the property that each substitution involves only one nucleotide, so that *Q*(*a, b*) = 0, whenever the dinucleotides *a* = *a*_{1}*a*_{2} and *b* = *b*_{1}*b*_{2} differ in both positions. Following *Q*^{∗}, a natural parametrization is as follows: for distinct dinucleotides *a* and *b, Q*(*a, b*) = *S*(*a, b*)*π*(*b*), where
(p.113)
*S* is symmetric and *π* is the equilibirum distribution. This will be called TF (tuple frequency), in reference to the appearance of the dinucleotide frequencies *π* in *Q*.

Viewing the GTR at the dinucleotide level, we have:

if *a*_{1} ≠ *b*_{1} and *a*_{2} = *b*_{2}, and a similar relationship for *a*_{1} = *b*_{1} and *a*_{2} ≠ *b*_{2}.

We call the set of all *Q*’s of this form NF (nucleotide frequency), which consists of precisely all GTR processes. To see how NF is represented in TF, note that *π*(*b*) = *π*^{∗}(*b*_{1})*π*^{∗}(*b*_{2}): *π* is homogeneous multiplicative (HM), which allows us to solve *Q*(*a, b*) = *S*(*a, b*)*π*(*b*) to get:

Thus the GTR processes are parametrized very simply in NF, but in TF it is more complicated.

The substitution of *a*_{1} by *b*_{1} is context-dependent if the rate depends on the second nucleotide, i.e. *Q*(*a, b*) depends on the state of *a*_{2} = *b*_{2}. Otherwise it is contextfree. Similarly, a substitution at the second nucleotide can be context-dependent or context-free. The GTR processes are context-free, as can be seen clearly from the NF parametrization; this is less clear from the TF parametrization. By definition, NF processes are all context-free, while the TF processes are in general context-dependent, hence more flexible. The good features of both can be brought out with another parametrization CNF (conditional nucleotide frequency):

where *π*_{1|b2}(*b*_{1}) is the conditional probability that the first nucleotide is *b*_{1} given that the second nucleotide is *b*_{2}.

The processes specified by CNF are identical to TF (Lindsay *et al*., 2008). But unlike TF, CNF represents NF very nicely:

if *a*_{1} ≠ *b*_{1}, *a*_{2} = *b*_{2}, and

if *a*_{1} = *b*_{1}, *a*_{2} ≠ *b*_{2}.

It is natural to reduce parameters for TF and CNF by replacing *R* with *R*^{∗}, as in NF. We use TF_{GTR} and CNF_{GTR} to denote the constrained models. For consistent notation, we will also use NF_{GTR} instead of the equivalent NF. It follows that NF_{GTR} are precisely the GTR processes, and that these are exactly those processes in CNF_{GTR} with the constraint that *π* is HM. However, the GTR processes in TF_{GTR} must have uniform nucleotide frequencies. To see this, suppose there is a symmetric *R*^{∗} and HM *π* that yields a GTR process in TF_{GTR}. Since *Q*(TT, CT) = *Q*(TA, CA), we must have *S*^{∗}(T, C)*π*^{∗}(C)*π*^{∗}(T) = *S*^{∗}(T, C)*π*^{∗}(C)*π*^{∗}(A), or *π*^{∗}(T) = *π*^{∗}(A), etc. Thus, *π*^{∗} must be uniformly distributed on the nucleotides. In particular, even the special case *S*^{∗}≡ 1 is in general not a GTR process. Indeed, this process has context effects: in general *Q*(TT, CT) ≠ *Q*(TA, CA). In summary, most GTR processes are not in TF_{GTR}, and some of the simplest processes in TF_{GTR} are not GTR processes. As described in the next paragraph, the disparity between CNF_{GTR} and TF_{GTR} regarding the GTR processes, the basic context-free processes, is of fundamental importance. In general, TF_{GTR} and CNF_{GTR} are context-dependent because *π* is not HM. If *S*^{∗} is constrained to be of HKY form (Hasegawa *et al*., 1985), then the models are denoted: TF_{HKY}, NF_{HKY}, and CNF_{HKY}.

The processes of mutagenesis affecting real sequences are far more complicated than GTR processes. A substantial body of evidence indicates that point mutation processes (affecting a single nucleotide position) are influenced by sequence neighbourhoods (Cooper and Krawczak, 1990, Cooper and Youssoufian, 1988; Gojobori *et al*., 1982; Krawczak *et al*., 1998, Morton *et al*., 1997). This influence manifests as non-HM frequencies of di-and trinucleotides; for instance, the frequency of a dinucleotide is not the product of the component nucleotide frequencies. The implication for evolutionary modelling should be clear. The putatively dominant context-dependent substitution is the well-known elevation of C to T substitutions when within CpG dinucleotides in many lineages (Cooper and Krawczak, 1990; Cooper and
(p.114)
Youssoufian, 1988, Coulondre *et al*., 1978). This context effect can be incorporated into a GTR process by multiplying a factor *r* to *Q*(CG, TG), and also to *Q*(TG, CG) to preserve reversibility. Thus *r* measures the effect of the neighbouring G relative to any other neighbouring nucleotide. However, this one-parameter extension NF_{GTR,r} is not expected to fit real data, where *π* is not HM. Analogous extensions TF_{GTR,r} and CNF_{GTR,r}, where *π* is completely general, should fit better, at the cost of interpretability of *r*: it represents the residual effect of G that is not accounted for by *π*, rather than the net effect.

Consider fitting these models to data generated from a GTR process. Because the process is always in NF_{GTR} and CNF_{GTR}, NF_{GTR,r} and CNF_{GTR,r} are consistent: in the limit of infinite data, $\widehat{r}$, the maximum-likelihood estimate (MLE) of *r*, will be around 1. However, since the GTR process is in general not in TF_{GTR}, it is likely that based on TF_{GTR,r}, $\widehat{r}$ will be away from 1, falsely indicating a context effect. On the other hand, take a process in NF_{GTR,r} with *r* ≠ 1, say. Now CNF_{GTR,r} and NF_{GTR,r} will still be consistent, but TF_{GTR,r} may think *r* is around 1, a false negative. To conclude: if the underlying process is in NF_{GTR,r}, CNF_{GTR,r} is consistent for *r*, but not TF_{GTR,r}. This conclusion will be illustrated by simulations. The erroneous assumption that TF_{GTR} contains most context-free processes cannot be detected by internal scrutiny: simulations within TF_{GTR,r}, where processes with *r* = 1 are assumed to include the GTR processes, will not reveal the exclusion of most of them from TF_{GTR}.

The conclusion in the last paragraph largely carries over to codon models. TF_{GTR}, NF_{GTR}, and CNF_{GTR} can be generalized directly to trinucleotides. For the sense codons, some modifications are required because of the exclusion of stop codons. We also rename TF as CF (C for codon). In both CF_{GTR} and CNF_{GTR}, *π* should sum to 1 over sense codons. In NF_{GTR}, the equilibrium frequency of a sense codon is:

where *κ* is the sum of sense codon frequencies.

In CNF_{GTR}, there are rate terms where the conditional nucleotide frequencies need to be computed with some care because of stop codons. The extensions CF_{GTR,ω}, NF_{GTR,ω}, and CNF_{GTR,ω} have *ω* multiplied to the rate of each nonsynonymous substitution, so it indicates positive, neutral, or negative selection. In fact, CF_{HKY,ω} is essentially the popular model of (Goldman and Yang, 1994) and NF_{HKY,ω} was introduced by (Muse and Gaut, 1994). Because of the exclusion of the stop codons, the GTR processes are not contained in any of the codon models. Unlike dinucleotides and trinucleotides, a process in CNF_{GTR} with a given *S*^{∗} and an HM *π* is in general distinct from the process in NF_{GTR} with the same *S*^{∗} and the corresponding *π*^{∗}. This is shown by computing *Q*(TAT, TAC)/*S*^{∗}(T, C) for both models. Since TAA and TAG are stop codons, for CNF_{GTR}, the ratio is:

but it is *π*^{∗}(C) for NF_{GTR}. Thus, the lessons from dinucleotides are modified as followed: if the underlying process is NF_{GTR,ω}, CNF_{GTR,ω} is slightly inconsistent for *ω*, but CF_{GTR,ω} is quite inconsistent. This mathematical statement can be illustrated by simulations, but its relevance to codon evolution needs clarification. The fact that the GTR processes are not part of any model makes it necessary to distinguish neutral and non-neutral codon-substitution processes by some other means. We propose two ways around the difficulty. First, it seems plausible that NF_{GTR} processes are approximately neutral, as assumed by (Muse and Gaut, 1994). It follows that it is not a good idea to believe that CF_{GTR} processes are approximately neutral. The second approach is empirical: suppose that intron sequences evolve neutrally with respect to the genetic code. This can be used to compare the models, and it turns out the clear winner is CNF_{GTR,ω}. So CNF_{GTR} may be *defined* as a class of neutral codon-substitution processes with the same richness as CF_{GTR}. These considerations for choosing an appropriate model are critical, for, as explained in the last paragraph, the deficiency of a certain model will not be discovered by simulations within the model.

Detection of a context effect in neutral DNA or a selection effect in coding DNA is typically accomplished by testing the hypothesis that the process belongs to a certain class of models *M*_{0}, which are nested in a bigger class *M*_{1}. The natural approach
(p.115)
is to compute *L*_{0} and *L*_{1}, the logarithm of the maximum likelihood in the respective classes. Always *L*_{0} ≤ *L*_{1}, but a larger difference is stronger evidence for rejecting the null hypothesis that the process is in *M*_{0}. In many cases, standard statistical theory suggests that 2(*L*_{1} − *L*_{0}) should be compared to the *χ*^{2} distribution of *ν* degrees of freedom, where *ν* is the difference in the numbers of parameters needed to specify *M*_{1} and *M*_{0}. For example, let *M*_{0} = CNF_{GTR} and *M*_{1} = CNF_{GTR,ω}. Since *ω* is the extra parameter, *ν* = 1. The *P* value obtained is more reliable as the sequences become longer.

# 8.3 Evaluating properties of dinucleotide models

## 8.3.1 Analysis of simulated data

We first demonstrate that when sequences evolve under a simple nucleotide process, an analysis based on the dinucleotide TF model can lead to the incorrect conclusion that context affects substitution. We simulated AT-rich sequence alignments using the F81 nucleotide substitution model (Felsenstein, 1981) with *π*^{∗}(A) = *π*^{∗}(T) = 0.3 and *π*^{∗}(C) = *π*^{∗}(G) = 0.2. Each alignment consisted of two 10Kb sequences, and the branch length between them was 0.04. (The latter is of the rough order of branch lengths separating humans and chimpanzees.) We fit dinucleotide models with a single-context parameter *r* for substitutions between CG and its six neighbouring dinucleotides TG, AG, GG, CT, CC, and CA. Model fitting was done using the Powell numerical optimizer with exit tolerance 1e − 6, a maximum of 5 restarts and maximum evaluations set to 100k. All model optimizations were checked to see whether the exit was due to exceeding the maximum evaluation number, which would indicate the functions were not correctly fit. This condition never arose. All simulations and analyses reported here were conducted using PyCogent version 1.5.0.dev (Knight *et al*., 2007).

Since the data were generated under a contextfree process, the theory predicts that the MLE of *r* should be near 1 for CNF_{GTR,r} and NF_{GTR,r}, but not for TF_{GTR,r}. The results (Figure 8.1) confirm the

*r*based on TF are biased upwards, which can be explained using a simplified model. The rate

*Q*(CX, TX) = 0.3 for any

*X*, but under TF they are different:

and

similarly in the reverse substitution, *Q*(TX, CX) = 0.2 for any *X*, but under TF:

and

To equalize the rates, $\widehat{r}$ tends to be larger than 1.

## 8.3.2 Analysis of primate introns

The fact that real dinucleotide frequencies are typically not uniform or even HM, has important implications for the NF and TF model forms. As the NF model assumes that the dinucleotide frequencies *π* are HM, the estimated *π* will not match those observed in real sequences well. Therefore, estimates of other parameters are expected to be biased. The non-uniform frequencies in most real datasets indicate that parameter estimates from the TF model will also be affected. Our theoretical analyses establish that estimates with the TF model will only be the same as those with the CNF and NF models under special restrictions on the frequencies of the sequence states that are unlikely to occur in nature. Thus we expect analogous parameters estimated by the models to be different from each other.

We evaluated the consistency in parameter estimates from the different models by analysis of primate intron sequences with specific focus on comparing estimates of CpG substitution processes. We chose CpG substitution processes due to the well-characterized hypermutability of this dinucleotide context in primates. The CpG context is the characteristic motif at which DNA methylases operate to add a methyl group to cytosine, producing 5-methyl-cytosine (5mC) (Gruenbaum *et al*., 1982). This modified base spontaneously deaminates to T at a much greater rate than the deamination of unmethylated C to U. The U-G mismatch resulting from the latter lesion is also repaired more efficiently than the T-G mismatch, leading to an ~ 10-fold increased mutation rate of 5mC (Krawczak *et al*., 1998). Studies from human mutation data suggest an excess of both transitions and transversions (Sommer *et al*., 2001). An excess of transitions is expected based on biochemistry (Coulondre *et al*., 1978; Duncan and Miller, 1980) but questions remain about whether the transversion rate should be elevated or not. Our previous work on dinucleotide models indicated that the TF supported the excess transversion rate whereas the NF model did not (Lindsay *et al*., 2008).

Here we implemented analogous parametrizaions in the three different models to measure each of the six distinct CpG substitution processes. We followed the approach of (Lindsay *et al*., 2008) in defining comparable models. The baseline model is the GTR, with *S*(G, T) set to 1 for calibration. We included the 6 CpG context parameters *r*(CG ↔ TG), *r*(CG ↔ AG), *r*(CG ↔ GG), *r*(CG ↔ CT), *r*(CG ↔ CC), *r*(CG ↔ CA). The models were fit to primate datasets (described below) using the same numerical optimization procedure as that employed for the simulated datasets.

We used the same dataset of primate alignments as that used by (Lindsay *et al*., 2008) and briefly recapitulate the sampling process here. Using Ensembl release 50, all human proteincoding genes were sampled and multiple sequence alignments of human, chimpanzee, and macaque sequences obtained. Sequence regions that did not evolve by a pointmutation process (e.g. low complexity sequence or gaps) were masked and alignment columns containing a masked character were removed in a manner that preserved ‘real’ dinucleotides. To facilitate comparability of parameter estimates between alignments, alignments were truncated to exactly same length of 50,000 nucleotides. There were 470 such alignments.

Figure 8.2 shows that MLEs of the 6 CpG-context parameter are systematically different among the (p.117)

*π*and detects little of the residual effects represented by the

*r*terms. Both TF and CNF account for the context effects both in

*π*and

*r*, with $\widehat{r}$ generally smaller and less spread out by CNF than by TF, but both much larger than by NF. Of particular significance here is that the transversion context parameters CG ↔ AG, CG ↔ GG, CG ↔ CT, and CG ↔ CC estimated under NF were distributed around 1, a result that led (Lindsay

*et al*., 2008) to conclude that CpG transversion rates were not elevated. The comparable CNF distributions, on the other hand, seem to support an elevated rate of transversions, with the caveat that

*π*and

*r*should be interpreted jointly to give a sensible idea of context effects. We prefer the CNF for reasons given above; this is supported by the fact that for ~ 92% of the alignments, the likelihood for CNF exceeds that for TF.

# 8.4 Evaluating properties of codon models

Differing handling of the omission of stop codons leads to different behaviour of the substitution models. We illustrate these differences using both simulated and naturally occurring biological sequences. The work reported in this section is substantially that described in (Yap *et al*., 2010) but updated to account for new insights into the models.

## (p.118) 8.4.1 Analysis of simulated data

We use data generated under known codon evolution processes to further establish the relationship among parameters estimated from the different models. We specifically considered two different types of equilibrium codon frequencies: homogeneous multiplicative (HM), where the equilibrium codon frequencies are defined as the product of their nucleotide frequencies with normalization to account for the omission of stop codons; non-HM. In both conditions there is a choice between two different models.

Given *R*^{∗} and HM *π*, CNF, and NF, specify the same process for dinucleotides or trinucleotides, but not for codons, because of the omission of stop codons. Whichever model is chosen for generating neutral data, *ω̂*, the MLE of *ω*, based on the same model will be around 1, while those based on the other model will be slightly shifted. We chose the NF model for simulation here in part to show that CNF MLEs will differ from NF even for this case. A similar but larger effect is seen when comparing CNF and TF. For this particular case, we chose the CNF model for generating data because of its clear relationship to the independent process.

We previously assessed the influence of changes to sequence composition on *ω̂* by sampling codon frequency sets with different GC% (Yap *et al*., 2010). GC% is a crude measure of sequence composition, reflecting only the frequency of different nucleotides, but it has been widely employed in molecular evolutionary analyses for decades (Cuny *et al*., 1981) and is known to vary considerably between and within genomes (Karlin *et al*., 1998). Microbial species in particular exhibit pronounced differences in nucleotide frequencies (Kanehisa, 2002).

Proteincoding genes from three different clades were arbitrarily sampled to span the range of GC% observed in nature: (1) a gene (K01873) from the AT-rich (~ 33% GC) *Borrelia burgdorferi, B. afzelli*, and *B. garinii*; (2) a gene (ENSG00000143520) from the primates *Homo sapiens, Pan troglodytes*, and *Macaca mulatta* with approximately equal AT and GC; and (3) a gene (ML0101/Rv3800c) from the GC-rich (~ 64%) genomes of *Mycobacterium tuberculosis* and *Mycobacterium leprae*. The codon frecodon frequencies in these genes were used to determine an HM *π*, using the NF model, or the averaged codon frequencies estimated across all sequences was used for the CNF model. Data were simulated under the neutral model (*ω* = 1) of both NF and CNF. We simulated long (90 Kbp) alignments to reduce the influence of sampling error. The results of the analyses are shown in Figure 8.3 (Yap *et al*., 2010).

From data simulated under the NF model, with HM *π*, $\widehat{\omega}$ from NF and CNF were typically very similar. The distributions of $\widehat{\omega}$ became increasingly similar as the GC% increased. We argued that this trend reflects the AT-richness of stop codons in the standard genetic code. The nucleotide composition of stop codons are expected to lead to differences between the models, since the NF codon model does not normalize the weighting of substitutions for the absence of these states from the model. In contrast, $\widehat{\omega}$ from TF exhibited a systematic shift with increasing GC% from negative to positive. This result confirms that, relative to the standard defined by the NF and CNF models for HM *π*, $\widehat{\omega}$ from the TF model will be biased.

For the non-HM *π* used for the (Yap *et al*., 2010) simulation study, all models showed marked differences in the distribution of $\widehat{\omega}$. Particularly striking was that the differences did not systematically change with increasing GC%, unlike the HM case. Also striking was the substantial discrepancy between CNF and NF. This result is consistent with the results from the dinucleotide analysis (Figure 8.2) where non-HM *π* was associated with $\widehat{r}$ differing substantially between NF and CNF.

These analyses of simulated data confirm the insights from examining the analytical relationships between the models. Namely, for the typically restricted codon model parametrizations in use, all the models give different $\widehat{\omega}$ distributions. The magnitude of these differences will depend on how close the codon frequencies are to HM and the actual frequencies of the codons. This analysis does not resolve, however, which is the correct model that should be used for the measurement of natural selection. $\widehat{\omega}$ from a wrong model will tend to differ from 1. The magnitude of the difference is affected by the relationship between the fitting and generating models. The latter is of particular (p.119)

## 8.4.2 Analysis of primate introns

The essential benchmark for evaluating the utility of the different models is an analysis of neutrally evolving real DNA sequences. While pseudogenes are a candidate for such an analysis, identifying multiple members of this sequence class that meet the necessary criteria of selective neutrality, and also span the spectrum of GC%, is a significant challenge. Moreover, such sequences typically contain frameshift mutations or stop codons that invalidate application of codon models. An alternative approach is to consider any sequence that does not encode proteins as selectively neutral with respect to protein coding content (Yap *et al*., 2010). The evolution of such sequences can be modelled using a trinucleotide process.

A modified post-sampling process was applied to the same introns used for the dinucleotide analyses. Part of the original sampling process of these introns involved masking sequence regions that could evolve by a non-point mutation process. (p.120) Examples include low-complexity sequence and insertion deletion (or gap) regions. Nucleotides within a sequence that fell into such a span were replaced by a special masking character. The alignment was then split into non-overlapping trinucleotides and aligned columns that contained a mask character in any sequence were eliminated. Alignments were then truncated to exactly the same length (~50kbp). This resulted in 470 alignments. This sampling process ensures that only naturally occurring trinucleotides are sampled and that different estimates are derived from equivalent amounts of data, homogenizing MLE variances.

We compared three pairs of nested models, CF_{HKY,ω} ⊂ CF_{GTR,ω}, NF_{HKY,ω} ⊂ NF_{GTR,ω}, and CNF_{HKY,ω} ⊂ CNF_{GTR,ω}. The results identified CNF_{GTR,ω} as the model that produced MLEs of *ω* that are most consistent with 1. The likelihood ratio test statistic from testing the null hypothesis of *ω* = 1 against the alternative *ω* ≠ 1 was very close to the theoretical expectation of the ${\chi}_{1}^{2}$ distribution for CNF (Figure 8.4). NF_{GTR,ω} was the next closest while CF_{HKY,ω} was the best-performing TF model and clearly worse than either of the other two models.

While we will consider the implications of the superiority of the GTR over HKY variant in more detail in the Conclusions, we expand here on the high similarity between the NF and CNF models. This similarity is due partly to the trinucleotide nature of the models, which means that the CNF and NF models are nested. Due to their different treatment of stop codon omission the codon model variants are not nested. As a conse

*ω*to be more biased under NF codon model variants.

# 8.5 Impact of model definitions on statistical power

Since the original publication of the codon models (Goldman and Yang, 1994; Muse and Gaut, 1994), most development has concentrated on refining the measurement of natural selection. These include methods to detect differences in molecular adaptation: between lineages (Yang, 1998); between residues (Nielsen and Yang, 1998); and a combination of these two, examining lineage specific molecular adaptation affecting individual residues (Nielsen and Yang, 1998; Zhang *et al*., 2005). In each of these cases, an alternate hypothesis of molecular adaptation is defined as lineages/sites, where *ω* > 1. As *ω* under TF, CNF, and NF are not the same, then the critical benchmark used to classify the mode of selection (*ω* = 1) will differ between the models.

Statistical power is defined as the probability of rejecting the null hypothesis when it is false. As the null model in codon evolutionary modelling cases includes the *ω* = 1 constraint, the probability of making erroneous conclusions must be affected for models where estimates of *ω* are biased by sequence composition. This statement was confirmed by (Yap *et al*., 2010), who showed for a single example that the quantiles of the likelihood ratio test (LRT) statistic from a test of a simple siteclass model were seriously in error for the TF model. The correct expected quantile distribution was not shown in that work as the calculation of the correct degrees-of-freedom for the LRT is complicated by the non-standard parameter space. It is most critical to realize, however, that the results of that analysis indicate that other conditions will be realized in which it is LRT from the NF model that will depart more from the expected theoretical distribution. In other words, departures from the expected quantiles will depend very much on the precise properties of the data being analysed and the statistical hypotheses being contrasted.

We further illustrate the challenge of understanding the impact of these model properties on statistical power here. We used exactly the same simulated dataset as that generated by Yap *et al*., (2010) for the examination of the error in tests for multiple siteclasses. Briefly, sequences from human, chimpanzee, and macaque that were one-to-one orthologs to the human gene with Ensembl ID ENSG00000143520 were used to fit a CNF_{GTR} model. The fitted model was then used to simulate 250 alignments that were each 90,000 nucleotide long. This gene has ~ 50%GC.

We specifically compare hypothesis concerning equivalence of *ω* between lineages. Departures from the neutral theory—purifying natural selection plus no natural selection—can also occur if natural selection changes between lineages. This can be examined, for instance, by specifying a null where *ω* is the same across all lineages and an alternate where *ω* is allowed to be different on one or more of the lineages. The degrees-of-freedom for such a test, all other parameters being unmodified, is just the number of *ω* values less 1. The consistency of LRT statistics with that expected can be established used a quantile-quantile plot. For the current purpose, we test the null of a single *ω* against the alternate of a separate *ω* for the human lineage.

Prior to discussing the results, it is worth considering our prior expectations. First, note that we are not concerned with whether *ω* = 1, only whether *ω* is the same for all lineages. At first glance then, it seems then that all three models should exhibit the expected quantile distribution. That is, while estimates of *ω* may be biased they can be expected to be consistently biased across lineages. Thus, we expect tests for differences in *ω* between lineages to be consistent with our null distribution. As it turns out, this is correct only for TF and CNF.

The results in Figure 8.5 show that the NF model is prone to false positives for this hypothesis test. The LRT statistic quantiles from CNF and TF substantially overlap and lie approximately on the expected diagonal. This is not the case for the LRT quantiles from the NF model, which are typically too large. We conjecture the NF result derives from the contribution of the additional context term (human-specific *ω*) to compensate for the assumed HM frequencies.
(p.122)

# 8.6 Conclusion

We have examined in detail the properties of three classes of context-dependent substitution processes, and demonstrated that analogous parameters have different interpretations. We previously demonstrated that the different model properties cause substantial differences in *ω* estimates from real biological data. Analyses of proteincoding genes from diverse lineages revealed discordances in parameter estimates between the models as high as 30% (Yap *et al*., 2010). Sequence composition and non-HM codon frequencies substantially contribute to these discrepancies. The choice of model thus has a profound impact on the interpretation of the mode of natural selection.

The additional analyses undertaken here indicate that the constraint of HM frequencies can affect inference in ways that make the NF specifically unreliable for analyses of between-lineage changes in natural selection. We had previously demonstrated that testing for *ω* = 1 is strikingly violated by the CF model, where the alternative hypothesis assumed a single *ω* ≠ 1 for all lineages. Here we also considered the effect on error rates of a different, but also popular alternative hypothesis, where the null was a single *ω* ≠ 1 and the alternate was different *ω* between lineages. In this case, the precise value of *ω* is not critical but that the value be equivalent between the lineages. For this analysis, both the CF and CNF models showed the expected Type 1 error rate, whereas NF showed an excess. This latter difference highlights the effect of assuming HM frequencies can affect the NF model in non-obvious ways. In this instance, the relative parametric sparseness of the NF model relative to the data-generating process makes the model prone to error because any additional parameters compensate for the poor fit between the model and the data.

Model fit is one essential element in choosing the best model for the analysis of contextual influences on DNA sequence evolution, but it is not sufficient by itself. The problem of model selection is a common one and systematic approaches to identifying the best model revolve around information theoretic measures (Burnham and Anderson, 2002). We compared the likelihood for the restricted
(p.123)
dinucleotide substitution models (Figure 8.2), where the CNF and CF models had analogous parametrizations and thus an identical number of free parameters. In this scenario, the CNF likelihood was higher than the CF likelihood for around 92% of the alignments. We note that under the most general possible stationary and reversible CNF and CF models, the likelihoods will be exactly the same. Yet because the restricted models are not the same, the *r* parameters have different interpretations. As these models are being applied to derive meaning about the influence of sequence neighbourhoods on the relative rates of substitution, we are principally concerned with the properties of *r* and thus likelihood alone (in this case) cannot guide our choice.

An additional essential step in model choice requires identifying the appropriate null hypothesis and the relationships of the models to it. We have argued that independence or the absence of context effects is the correct null hypothesis. While parametrizations can be constructed for all the models that are equivalent to independence, only the NF and CNF models do so without requiring additional parameters. In contrast, the largest possible number of parameters is required by the CF model in order to guarantee nesting of independence. Although NF does not suffer this flaw, its inability to specify nonHM frequencies affects estimates of both *π* and *r*. Finally, the ultimate arbiter must be assessment of naturally occurring data known (as best as possible) to evolve in a manner consistent with the null hypothesis. Our analyses of trinucleotide data served this requirement and showed that our expectation regarding the behaviour of the model and the *r* terms was only reasonably met for CNF model.

As a result of these deliberations, we suggest the theoretical properties and analyses of real data confirm CNF as the most suitable foundation for building special context dependent models. An illustration of this is the extension of codon-substitution models to measuring the influence of CpG substitution processes on the evolution of protein coding genes (Huttley, 2004). We further suggest that the CNF codon model should be considered as reasonable models of neutral codon evolution. The practical implications of these results are discussed in Box 8.2.

We conclude by emphasizing that a number of observations indicate that important critical problems affecting reliable estimation of parameters remain to be addressed. The fact that only quantiles from the GTR (not HKY) variant of CNF were consistent with theoretical expectations (Yap *et al*., 2010)
(p.124)
indicates that parametrizations sufficiently

general to account for the variability in the neutral processes are essential to correctly control error rates. This means that in cases where evolutionary events are being examined that span changes in the underlying neutral process through time, those changes may need to be specifically represented in the models in order for estimates of *ω* remain consistent with the null hypothesis. Evidence indicates that such changes to the neutral process may be quite common (Singh *et al*., 2009) and thus extending context dependent models to address the case of non-reversible and nonstationary processes is an important avenue for future work.

References

Bibliography references:

Burnham, K.P. and Anderson, D.R. (2002). *Model selection and multimodel inference: a practical information-theoretic approach*. Springer-Verlag.

Cooper, D.N. and Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease: Patterns and predictions. *Human Genetics* 85(1): 55-74.

Cooper, D.N. and Youssoufian, H. (1988). The CpG dinucleotide and human genetic disease. *Human Genetics* 78(2):151-155.

Coulondre, C., Miller, J.H., Farabaugh, P.J., and Gilbert, W. (1978). Molecular basis f base substitution hotspots in escherichia coli. *Nature* 274(5673): 775-780.

Cuny, G., Soriano, P., Macaya, G., and Bernardi, G. (1981). The major components of the mouse and human genomes. 1. Preparation, basic properties and compositional heterogeneity. *European Journal of Biochemistry/FEBS* 115(2): 227-233.

Duncan, B.K. and Miller, J.H. (1980). Mutagenic deamination of cytosine residues in DNA. *Nature* 287(5782):560-561.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. *Journal of Molecular Evolution* 17(6): 368-376.

Gojobori, T., Li, W.-H., and Graur, D. (1982). Patterns of nucleotide substitution in pseudogenes and functional genes. *Journal of Molecular Evolution* 18(5): 360-369.

Goldman, N. and Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. *Molecular Biology and Evolution* 11(5): 725-736.

Gruenbaum, Y., Cedar, H., and Razin, A. (1982). Substrate and sequence specificity of a eukaryotic DNA methylase. *Nature* 295(5850): 620-622.

Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. *Journal of Molecular Evolution* 22: 160-174.

Hershberg, R. and Petrov, D.A. (2008). Selection on codon bias. *Annual Review of Genetics* 42: 287-299.

Hudson, R.R., Kreitman, M., and Aguade, M. (1987). A test of neutral molecular evolution based on nucleotide data. *Genetics* 116: 153-159.

Huelsenbeck, J.P. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees. *Bioinformatics*, 17(8): 754-5.

Hughes, A.L. and Nei, M. (1988). Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. *Nature* 335: 167-170.

Huttley, G.A. (2004). Modeling the impact of DNA methylation on the evolution of BRCA1 in mammals. *Molecular Biology and Evolution* 21(9): 1760-1768.

Jensen, J.L. and Pedersen, A.-M.K. (2000). Probabilistic models of DNA sequence evolution with context dependent rates of substitution. *Advances in Applied Probability* 32(2): 499-517.

Kanehisa, M. (2002). The KEGG database. *Novartis Foundation Symposium* 247: 91-101; discussion 101-103, 119-128, 244-252.

Karlin, S., Campbell, A.M., and Mrázek, J. (1998). Comparative DNA analysis across diverse genomes. *Annual Review of Genetics* 32(0066-4197): 185-225.

Kimura, M. (1983). In *The neutral theory of molecular evolution*. Cambridge: Cambridge University Press.

Knight, R., Maxwell, P., Birmingham, A., Carnes, J., Caporaso, J., Easton, B. *et al*. (2007). PyCogent: a toolkit for making sense from sequence. *Genome Biology* 8(8): R171.

Krawczak, M., Ball, E.V., and Cooper, D.N. (1998). Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. *American Journal of Human Genetic* s 63(2): 474-488.

Lanave, C., Preparata, G., Saccone, C., and Serio, G. (1984). A new method for calculating evolutionary substitution rates. *Journal of Molecular Evolution*, 20(1): 86-93.

Lindsay, H., Yap, V.B., Ying, H., and Huttley, G.A. (2008). Pitfalls of the most commonly used models of context dependent substitution. *Biology Direct* 3: 52.

McDonald, J.H. and Kreitman, M. (1991). Adaptive protein evolution at the Adh locus in Drosophila’. *Nature* 351(6328): 652-654.

Morton, B.R., Oberholzer, V.M., and Clegg, M.T. (1997). The influence of specific neighboring bases on substitution bias in noncoding regions of the plant chloroplast genome. *Journal of Molecular Evolution* 45(3): 227-231.

(p.125)
Muse, S.V. and Gaut, B.S. (1994). A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. *Molecular Biology and Evolution* 11(5): 715-724.

The Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. *Nature* 437: 69-87.

Nielsen, R. and Yang, Z. (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. *Genetics* 148(3): 929-936.

Pond, S.L.K., Frost, S.D.W., and Muse, S.V. (2005). HyPhy: Hypothesis testing using phylogenies. *Bioinformatics*, 21(5): 676-679.

Schoninger, M. and von Haeseler, A. (1994). A stochastic model for the evolution of autocorrelated DNA sequences. *Molecular Phylogenetics and Evolution* 3: 240-247.

Siepel, A. and Haussler, D. (2004). Phylogenetic estimation of context-dependent substitution rates. *Molecular Biology and Evolution* 21: 468-488.

Singh, N.D., Arndt, P.F., Clark, A.G., and Aquadro, C.F. (2009). Strong evidence for lineage and sequence specificity of substitution rates and patterns in *Drosophila*. *Molecular Biology and Evolution* 26(7): 1591-1605.

Sommer, S.S., Scaringe, W.A., and Hill, K.A. (2001). Human germline mutation in the factor IX gene. *Mutations Research* 487: 1-17.

Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. *Lectures on Mathematics in the Life Sciences*, 17: 57-86.

Tillier, E.R.M. (1994). Maximum likelihood with multiparameter models of substitution. *Journal of Molecular Evolution* 39: 409-417.

Yang, Z. (1994). Estimating the pattern of nucleotide substitution. *Journal of Molecular Evolution* 39: 105-111.

Yang, Z. (1998). Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. *Molecular Biology and Evolution* 15: 568-573.

Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. *Molecular Biology and Evolution* 24(8):1586-91.

Yang, Z. and Nielsen, R. (2002). Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. *Molecular Biology and Evolution* 19(6): 908-917.

Yap, V.B., Lindsay, H., Easteal, S., and Huttley, G.A. (2010). Estimates of the effect of natural selection on protein-coding content. *Molecular Biology and Evolution* 27(3): 726-734.

Zhang, J., Nielsen, R., and Yang, Z. (2005). Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. *Molecular Biology and Evolution* 22: 2472-2479.