791/7.36/BE490 Lecture #4 Mar.4,2004 Markov hidden markov models for DNA Sequence Analysis Chris burge
7.91 / 7.36 / BE.490 Lecture #4 Mar. 4, 2004 Markov & Hidden Markov Models for DNA Sequence Analysis Chris Burge
Organization of topics Dependence Lecture Object Model Structure 5678910111 Weight 3/2 Matrix Independence G SECCAA Model 10201908060.9020.000.0ot Hidden Markov Local 3/4 Dependence Pramater Stop ass 3ss Model 3/9 Energy model Non-local I Covariation ModelDependence Anticodon
Organization of Topics Model Dependence Lecture Object Structure Weight Matrix Model Hidden Markov Model 3/2 Independence Local 3/4 Dependence Energy Model, Covariation Model Non-local Dependence 3/9
Markov Hidden markov models for dna Markov Models for splice sites Hidden Markov models looking under the hood The Viterbi algorithm Real World HMMs See Ch, 4 of Mount
Markov & Hidden Markov Models for DNA • Hidden Markov Models - looking under the hood See Ch. 4 of Mount • Markov Models for splice sites • The Viterbi Algorithm • Real World HMMs
Review of DNA Motif Modeling Discovery WMMs for splice sites Information Content of a motif The Motif Finding/Discovery Problem The Gibbs Sampler TThe gibbs Sampling Algorithm Multimedia Experience Motif Modeling -beyond Weight Matrices See Ch, 4 of Mount
Review of DNA Motif Modeling & Discovery • Information Content of a Motif See Ch. 4 of Mount • WMMs for splice sites • The Motif Finding/Discovery Problem • The Gibbs Sampler • Motif Modeling - Beyond Weight Matrices
Information content of a dna/rna motif 3-2-1:123456 f, freq of nt k at position G 2Ml GeEc Shannon Entropy yH(O)=∑ flog, ( f)(ity Information/position )=2-H(0=2+2/log()=/log) Motif containing m bits of info. Will occur approximately once per 2 bases of random sequence
Information Content of a DNA/RNA Motif -3 -2 -1 1 2 3 4 5 6 f k = freq. of nt k at position Shannon Entropy H( G f ) = − ∑ f log 2( f k k ) (bits) k Information/position ) = 2 + ∑ f log 2 ( f ) = ∑ f log 2( 1 f k ) k k k (bits) k k 4 G f G I( f ) = 2 − H( Motif containing m bits of info. will occur approximately once per 2 m bases of random sequence
Variables Affecting Motif Finding gcggaagagggcactagcccatgtgagagggcaaggacca atctttctcttaaaaataacataattcagggccaggatgt gtcacgagctttatcctacagatgatgaatgcaaatcagc taaaagataatatcgaccctagcgtggcgggcaaggtgct gtagattcgggtaccgttcataaaagtacgggaatttcgg L avg sequence length tatacttttaggtcgttatgttaggcgagggcaaaagtca ctctgccgattcggcgagtgatcgaagagggcaatgcctc aggatggggaaaatatgagaccaggggagggccacactgc acacgtctagggctgtgaaatctctgccgggctaacagac N=no of sequences gtgtcgatgttgagaacgtaggcgccgaggccaacgctga atgcaccgccattagtccggttccaagagggcaactttgt gcgggcggcccagtgcgcaacgcacagggcaaggttta= info content of motif gtcgcctaccctggcaattgtaaaacgacggcaatgttcg cgtattaatgataaagaggggggtaggaggtcaactcttc aatgcttataacataggagtagagtagtgggtaaactacg tctgaaccttctttatgcgaagacgcgagggcaatcggga W=motif width tgcatgtctgacaacttgtccaggaggaggtcaacgactc cgtgtcatagaattccatccgccacgcggggtaatttgga tcccgtcaaagtgccaacttgtgccggggggctagcagct acagcccgggaatatagacgcgtttggagtgcaaacatac acgggaagatacgagttcgatttcaagagttcaaaacgtg cccgataggactaataaggacgaaacgagggcgatcaatg ttagtacaaacccgctcacccgaaaggagggcaaatacct agcaaggttcagatatacagccaggggagacctataactc gtccacgtgcgtatgtactaattgtggagagcaaatcatt
Variables Affecting Motif Finding gcggaagagggcactagcccatgtgagagggcaaggacca atctttctcttaaaaataacataattcagggccaggatgt gtcacgagctttatcctacagatgatgaatgcaaatcagc taaaagataatatcgaccctagcgtggcgggcaaggtgct gtagattcgggtaccgttcataaaagtacgggaatttcgg L = avg. sequence length tatacttttaggtcgttatgttaggcgagggcaaaagtca ctctgccgattcggcgagtgatcgaagagggcaatgcctc aggatggggaaaatatgagaccaggggagggccacactgc acacgtctagggctgtgaaatctctgccgggctaacagac N = no. of sequences gtgtcgatgttgagaacgtaggcgccgaggccaacgctga atgcaccgccattagtccggttccaagagggcaactttgt ctgcgggcggcccagtgcgcaacgcacagggcaaggttta tgtgttgggcggttctgaccacatgcgagggcaacctccc I = info. content of motif gtcgcctaccctggcaattgtaaaacgacggcaatgttcg cgtattaatgataaagaggggggtaggaggtcaactcttc aatgcttataacataggagtagagtagtgggtaaactacg tctgaaccttctttatgcgaagacgcgagggcaatcggga W = motif width tgcatgtctgacaacttgtccaggaggaggtcaacgactc cgtgtcatagaattccatccgccacgcggggtaatttgga tcccgtcaaagtgccaacttgtgccggggggctagcagct acagcccgggaatatagacgcgtttggagtgcaaacatac acgggaagatacgagttcgatttcaagagttcaaaacgtg cccgataggactaataaggacgaaacgagggcgatcaatg ttagtacaaacccgctcacccgaaaggagggcaaatacct agcaaggttcagatatacagccaggggagacctataactc gtccacgtgcgtatgtactaattgtggagagcaaatcatt …
How is the 5'ss recognized? U1 SnRNA CCAUUCAUAG-5 1|| Pre-mRNA 5 UUCGUGAGU c G ≤
How is the 5’ss recognized? U1 snRNA 3’ ………CCAUUCAUAG-5’ |||||| Pre-mRNA 5’…………UUCGUGAGU…………… 3’
RNA Energetics i CCAUUCAUAG-5′ 1|| Free energy of helix formation 5..CGUGAGU..3 derives from G G base pairing U U base stacking U GpA AY CpU Y A G Doug Turner's Energy rules A 1.30 2.40 2.10 1.00 T-0.90 1.30
RNA Energetics I …CCAUUCAUAG-5’ |||||| Free energy of helix formation 5’…CGUGAGU……… 3’ derives from: G A G • base pairing: > > C U U • base stacking: 5' --> 3' UX AY |G p A | 3' <-- 5’ C p U X Y A C G U A . . . -1.30 Doug Turner’s Energy Rules: C . . -2.40 . G . -2.10 . -1.00 T -0.90 . -1.30
RNA Energetics II npNpNpNpNpNpn Lots of consecutive XX NpNpNpNpNpNpN base pairs-good NpnpNpNpnpnpN X X Internal loop -bad NpnpNpNpnpnpN npNp NpNpNpN Terminal base pair X X X not stable- bad NpnpnpnpnpNpN Generally a will be more stable than B or c
RNA Energetics II N p N p N p N p N p N p N A) x | | | | xx N p N p N p N p N p N p N B) N p N p N p N p N p N p N x | | x | | x N p N p N p N p N p N p N N p N p N p N p N p N p N C) x | | | x | x N p N p N p N p N p N p N Lots of consecutive base pairs - good Internal loop - bad Terminal base pair not stable - bad Generally A will be more stable than B or C
Conditional Frequencies in 5'ss Sequences ≤ 1123456 5ss which have g at +5 5'ss which lack G at +5 Pos-1+3+4|+6 Pos-1+3+4|+6 A 447514 A2815122 C43418 C 32820 G78511319 G9715930 T93949 T021228 Data from Burge, 1998"Computational Methods in Molecular Biology
Conditional Frequencies in 5’ss Sequences -1123456 5’ss which have G at +5 5’ss which lack G at +5 Pos -1 +3 +4 +6 A 9 44 75 14 C 4 3 4 18 G 78 51 13 19 T 9 3 9 49 Pos -1 +3 +4 +6 A 2 81 51 22 C 1 3 28 20 G 97 15 9 30 T 0 2 12 28 Data from Burge, 1998 “Computational Methods in Molecular Biology