Microsoft PowerPoint - 04 Models of Amino Acid and Codon Substitution.ppt

Nei and Gojobori(986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * Pathway I: Pathway II: Pathway I: Pathway II: AAT(N) ACT(T) ACG(T) AAT(N) AAG(K) ACG(T) TTA(L) CTA(L) CTC(L) TTA(L) TTC(F) CTC(L) Number of synonymous substitutions: 5.5 Number of nonsynonymous substitutions: 4.5

Nei and Gojobori(986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * TTA ACA ATT CTA AAA ATC GTA AGA ATG Number of synonymous sites: 2/9 * 3 = 2/3 Number of nonsynonymous sites: 7/9 * 3 = 7/3

Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * 0 /3 2/3 2/3 /3 4/3 2/3 * 3 2 2 8/3 7/3 7/3 8/3 5/3 7/3 * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * 0 /3 0 2/3 2/3 * 3 2 2 8/3 2 3 2 7/3 7/3 * Number of synonymous sites: (8/3 + 7/3) / 2 = 35/6 Number of nonsynonymous sites: (63/3 + 64/3) / 2 = 27/6

Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * Number of synonymous substitutions: 5.5 Number of nonsynonymous substitutions: 4.5 Number of synonymous sites: (8/3 + 7/3) / 2 = 35/6 Number of nonsynonymous sites: (63/3 + 64/3) / 2 = 27/6 p-distance (synonymous): 5.5 / (35/6) = 0.942857 p-distance (nonsynonymous): 4.5 / (27/6) = 0.22598

Jukes and Cantor s (969) one-parameter model At time t, the probability that the nucleotide in both sequences is the same: I ( t) = 3 8αt 4 + 4 e Ancestral sequence Sequence Sequence 2 The probability that the two sequences are different at a site at time tis: 3 ( 8αt D = I ) ( t) = e 4D 8αt = ln 4 3 The actual number of substitutions per site since the divergence between the two sequences, K = 2(3αt) t t K = 3 4 ln D 4 3

Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * p-distance (synonymous): 5.5 / (35/6) = 0.942857 p-distance (nonsynonymous): 4.5 / (27/6) = 0.22598 K = 3 4 ln D 4 3 JC69 one parameter distance (synonymous): =? JC69 one parameter distance (nonsynonymous): = 0.249996

Modified Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * TTA ACA ATT CTA AAA ATC Ts / Tv = /2 α / β = Ts / Tv = α / β = 2 TTA ACA ATT CTA ACA ATC GTA AGA ATG GTA AAA ATG GTA AGA ATG Number of synonymous sites: 2/2 * 3 = 0.5 Number of nonsynonymous sites: 0/2 * 3 = 2.5

Modified Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * 0 /2 /2 /2 5/4 /2 * 3 2 2 5/2 2 5/2 5/2 7/4 5/2 * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * 0 /2 0 3/4 3/4 * 3 2 2 5/2 2 3 2 9/4 9/4 * Number of synonymous sites: (25/4 + 6) / 2 = 49/8 Number of nonsynonymous sites: (83/4 + 2) / 2 = 67/8

Modified Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * Number of synonymous substitutions: 5.5 Number of nonsynonymous substitutions: 4.5 Number of synonymous sites: (25/4 + 6) / 2 = 49/8 Number of nonsynonymous sites: (83/4 + 2) / 2 = 67/8 p-distance (synonymous): 5.5 / (49/8) = 0.897959 p-distance (nonsynonymous): 4.5 / (67/8) = 0.25569

Modified Nei and Gojobori (986) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * p-distance (synonymous): 5.5 / (49/8) = 0.897959 p-distance (nonsynonymous): 4.5 / (67/8) = 0.25569 K = 3 4 ln D 4 3 JC69 one parameter distance (synonymous): =? JC69 one parameter distance (nonsynonymous): = 0.25453

Li, Wu and Luo (985) Nondegenerate Twofold degenerate Fourfold degenerate If all possible changes at this site are nonsynonymous If one of the three possible changes is synonymous If all possible changes at the site are synonymous First, count the numbers of the three types of sites in each of the two sequences compared, and then compute the averages, denoting them by L 0 (nondegenerate), L 2 (twofold), L 4 (fourfold), respectively.

Nondegenerate Protein-coding sequences Li, Wu and Luo (985) Twofold degenerate Fourfold degenerate

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * 002 * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * 002 * L 0 (nondegenerate) = L 2 (twofold) = L 4 (fourfold) =

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA TAA M V T H L I N R I * 000 004 004 002 202 002 002 204 002 * ATG GTT ACG CAA CTC ATG ACG AGG ATT TGA M V T Q L M T R I * 000 004 004 002 004 000 004 202 002 * L 0 (nondegenerate) = 36/2 = 8 L 2 (twofold) = /2 = 5.5 L 4 (fourfold) = 7/2 = 3.5

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA CGG*0 M V T H L I N R I R 000 004 004 002 202 002 002 204 002 204 ATG GTT ACG CAA CTC ATG ACG AGG ATT CGG*0 M V T Q L M T R I R 000 004 004 002 004 000 004 202 002 204 L 0 (nondegenerate) = 36/2 + 0 = 28 L 2 (twofold) = /2 + 0 = 5.5 L 4 (fourfold) = 7/2 + 0 = 3.5

Li, Wu and Luo (985) First, count the numbers of the three types of sites in each of the two sequences compared, and then compute the averages, denoting them by L 0 (nondegenerate), L 2 (twofold), L 4 (fourfold), respectively. The nucleotide differences in each class are further classified into transitional (S i ) and transversional (V i ) differences (i = 0, 2, 4). S 0 S 2 synonymous S 4 V 0 nonsynonymous V 2 V 4

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA CGG*0 M V T H L I N R I R 000 004 004 002 202 002 002 204 002 204 S 4 ATG GTT ACG CAA CTC ATG ACG AGG ATT CGG*0 M V T Q L M T R I R 000 004 004 002 004 000 004 202 002 204 S 4 L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = S 2 = S 4 = V 0 = V 2 = V 4 =

Nondegenerate Protein-coding sequences Li, Wu and Luo (985) Twofold degenerate Fourfold degenerate V 2 S 2 S 2 S 2 V 2 V 2

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA CGG*0 M V T H L I N R I R 000 004 004 002 202 002 002 204 002 204 S 4 ATG GTT ACG CAA CTC ATG ACG AGG ATT CGG*0 M V T Q L M T R I R 000 004 004 002 004 000 004 202 002 204 S 4 V 2 S 0 L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 0.5 S 2 = S 4 = V 0 = V 2 = 0.5 V 4 =

Li, Wu and Luo (985) ATG GTC ACT CAT TTA ATA AAT CGG ATA CGG*0 M V T H L I N R I R 000 004 004 002 202 002 002 204 002 204 S 4 V 4 V 2 S 2 V 2 V 2 V 0 V 2 S 2 S 2 ATG GTT ACG CAA CTC ATG ACG AGG ATT CGG*0 M V T Q L M T R I R 000 004 004 002 004 000 004 202 002 204 S 4 V 4 V 2 S 0 V 4 S 0 V 0 V 4 S 2 S 2 L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2

Li, Wu and Luo (985) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 ts 0 = S 0 / L 0 = ts 2 = S 2 / L 2 = ts 4 = S 4 / L 4 = tv 0 = V 0 / L 0 = tv 2 = V 2 / L 2 = tv 4 = V 4 / L 4 =

Li, Wu and Luo (985) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 ts 0 = S 0 / L 0 = 0.0357 ts 2 = S 2 / L 2 = 0.63 ts 4 = S 4 / L 4 = 0.074 tv 0 = V 0 / L 0 = 0.0357 tv 2 = V 2 / L 2 = 0.63 tv 4 = V 4 / L 4 = 0.48

Kimura s (980) two-parameter model At time t, the probability that the nucleotide in both sequences is the same: A = 2 α B K = = ts A + B ( t) = = I ( t) = 4 + 4 e 8β t 4( α + β )t + 2 e Ancestral sequence t Sequence Sequence 2 8β t 4( α + β )t 8βt + e e tv( t ) = e 4 2 4 ( ) t = ln ln 2 ( 2 ) t = ln 2 β 2 2 2ts ( α + 2 ) t = ln ln 2 β 2ts tv 2tv 4 2 2tv 2 + tv 4 t 2tv

Li, Wu and Luo (985) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 A 0 = 0.038 A 2 = 0.2333 A 4 = 0.0878 B 0 = 0.037 B 2 = 0.947 B 4 = 0.757 K 0 = 0.0752 K 2 = K 4 = 0.2635 L A 2 2 K S = L2 + 3 + L 4 L 4 K 4 L B 2 2 K A = 2L2 + 3 + L 0 L K 0 0

Li, Wu and Luo (985) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 A 0 = 0.038 A 2 = 0.2333 A 4 = 0.0878 B 0 = 0.037 B 2 = 0.947 B 4 = 0.757 K 0 = 0.0752 K 2 = K 4 = 0.2635 L2 A2 + L4K 4 L2 B2 + L0K 0 K S = = 0.3844 K 0. 337 L A = = 2 2L2 + L4 + L0 3 3

Li (993) & Pamilo and Bianchi (993) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 A 0 = 0.038 A 2 = 0.2333 A 4 = 0.0878 B 0 = 0.037 B 2 = 0.947 B 4 = 0.757 K 0 = 0.0752 K 2 = K 4 = 0.2635 L A + L A K S = + 2 2 4 4 B4 L2 + L4 L B + 0 0 K A = A0 + L0 + L L 2 2 B 2

Li (993) & Pamilo and Bianchi (993) L 0 (nondegenerate) = 28 L 2 (twofold) =5.5 L 4 (fourfold) =3.5 S 0 = 2/2 = S 2 = 5/2 = 2.5 S 4 = 2/2 = V 0 = 2/2 = V 2 = 5/2 = 2.5 V 4 = 4/2 = 2 A 0 = 0.038 A 2 = 0.2333 A 4 = 0.0878 B 0 = 0.037 B 2 = 0.947 B 4 = 0.757 K 0 = 0.0752 K 2 = K 4 = 0.2635 K S L A + L A = B 2 2 4 4 + 4 = L2 + L4 0.343 K A = A L B L 0 0 2 2 0 + = 0 + L + L 2 B 0.34

Homework: Calculate S 0, S 2, S 4, V 0, V 2, and V 4 between human HLA-A and HLA-B genes for the first 240 nucleotides.

Estimation of distance between two protein sequences

* * * * * * * * * * * * * * * M T S L A V T P L I K H S V A S Y T L P C N T The actual number of substitutions per site, K, = 5 / 23 in this case We subdivide tinto npieces; t = The number of substitutions per site per unit time: t n p = K n For each site, the probability that we find xsubstitutions is a binomial probability: b ( x; n, p) = x p x! n! ( n x)! n x ( p)

國立交通大學國立交通大學國立交通大學國立交通大學生物資訊及系統生物研究所生物資訊及系統生物研究所生物資訊及系統生物研究所生物資訊及系統生物研究所林勇欣老師林勇欣老師林勇欣老師林勇欣老師 ( ) ( ) ( ) ( )( ) ( ) x K K n x x n x x n x n K n K K x n x n n n K n K x x n n n n p p x n x n p n x b = + = =! 2! 2!!!, ; For each site, the probability that we find xsubstitutions is a binomial probability: When ( )!, ; 0 x e K p n x b e n K t n K x K n expressed as p(x; K), Poisson probability

* * * * * * * * * * * * * * * M T S L A V T P L I K H S V A S Y T L P C N T p ( x; K ) = K e x! x K For each site, the Poisson probability that we find x substitutions I D = p K e 0! K K ( 0; K ) = = e = I = e 0 K The JC69 one-parameter model for nucleotide sequences K = 3 4 ln D 4 3 ( ) K = ln D K = 9 20 ln D 20 9

The substitution rate λ varies among sites according to the gamma distribution among sites: g(λ) 20 8 6 4 2 0 8 6 4 g Γ ( λ) b = Γ ( a) 2 a = λ / V b = λ / V e ( ) a a = t e 0 a λ λ bλ t dt λ a => shape => scale a=0.5 a=.0 a=2.0 2 0 0 0. 0.2 0.3 0.4 0.5 λ

Nonuniform rates K = 9 20 a 20D 9 / a K = a ( ) ) / a D

Li (997) Molecular Evolution

PAM matrix (for point-accepted mutations) Yang (2006) Computational Molecular Evolution

PAM matrix (for point-accepted mutations) PAM matrices are based on global alignments of closely related proteins..the PAM is the matrix calculated from comparisons of sequences with no more than % divergence. 2.Other PAM matrices are extrapolated from PAM. 3.This kind of empirical amino acid substitution matrices are also used in alignment of multiple protein sequences. PAM00 for t = substitution per site http://www.ncbi.nlm.nih.gov/education/blastinfo/scoring2.html

BLOSUM matrix (blocks substitution matrix) BLOSUM matrices are based on local alignments..blosum 62 is a matrix calculated from comparisons of sequences with no more than 62% similarity. 2.All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. 3.BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. http://www.ncbi.nlm.nih.gov/education/blastinfo/scoring2.html

BLOSUM matrix (blocks substitution matrix) Generally speaking... The Blosum matrices are best for detecting local alignments. The Blosum62 matrix is the best for detecting the majority of weak protein similarities. The Blosum45 matrix is the best for detecting long and weak alignments. http://www.ebi.ac.uk/help/matrix.html

Differences between PAM and BLOSUM PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. http://en.wikipedia.org/wiki/substitution_matrix

Differences between PAM and BLOSUM The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM50 is used for more distant sequences than PAM00; BLOSUM62 is used for closer sequences than Blosum50. http://en.wikipedia.org/wiki/substitution_matrix

Equivalent PAM and Blosum matrices The following matrices are roughly equivalent... PAM00 ==> Blosum90 PAM20 ==> Blosum80 PAM60 ==> Blosum60 PAM200 ==> Blosum52 PAM250 ==> Blosum45 http://www.ebi.ac.uk/help/matrix.html http://www.ncbi.nlm.nih.gov/education/blastinfo/scoring2.html

Homework: Use MEGA to calculate different genetic distances (different models including models for nucleotide, synonymous-nonsynonymous, and amino acids, etc.) for your mitochondrial cytochrome b sequences alignment file. Compare these distances.