791/7.36/BE490 Lecture #3 Mar.2,2004 DNA Motif Modeling 8 Discovery Chris burge
7.91 / 7.36 / BE.490 Lecture #3 Mar. 2, 2004 DNA Motif Modeling & Discovery Chris Burge
Review of DNA Seq. Comparison/Alignment Target frequencies and mismatch penalties Eukaryotic gene structure Comparative genomics applications Pipmaker(2 species comparison) Phylogenetic Shadowing(many species) Intro to DNA sequence motifs
Review of DNA Seq. Comparison/Alignment • Target frequencies and mismatch penalties • Eukaryotic gene structure • Comparative genomics applications: (2 species comparison) • Intro to DNA sequence motifs - Pipmaker - Phylogenetic Shadowing (many species)
Organization of topics Dependence Lecture Object Model Structure 5678910111 Weight 3/2 Matrix Independence G SECCAA Model 10201908060.9020.000.0ot Hidden Markov Local 3/4 Dependence Pramater Stop ass 3ss Model 3/9 Energy Model, Non-local I Covariation ModelDependence
Organization of Topics Model Dependence Lecture Object Structure Weight Matrix Model Hidden Markov Model 3/2 Independence Local 3/4 Dependence Energy Model, Covariation Model Non-local Dependence 3/9
DNA Motif Modeling Discovery Review -WMMs for splice sites Information Content of a motif The Motif Finding/Discovery Problem The Gibbs Sampler TThe gibbs Sampling Algorithm Multimedia Experience Motif Modeling -beyond Weight Matrices See Ch, 4 of Mount
DNA Motif Modeling & Discovery • Information Content of a Motif See Ch. 4 of Mount • Review - WMMs for splice sites • The Motif Finding/Discovery Problem • The Gibbs Sampler • Motif Modeling - Beyond Weight Matrices
Splicing model 5 splice site c BPX GC G G ATP branch site 7-65-4-3-2-1 ATP cacac acca ATP 3 splice site 12-11-10-98-7-65-43-2-112 cCCCccC. CCIe G
Splicing Model I branch site 5’ splice site 3’ splice site
Weight matrix models ii 5'splice c signal GaY Gs G Background Con AG Pos Generic 2 5+6 A 0.25 A 03060.1 010.1 0.25 C 04010.0 0.102 G 0.25 G 020.208 0802 T 010101 0.00.5 T 0.25 S=S, S, SSg Odds ratio: R P(S|+)=P3S)P2S2)P1(S)P5(S)P6(S) P(S-)=Pbg(S1)Pb( S2)Pbg(S3)Pbg(S8)Pbg(Sg) Background model homogenous, assumes independence
Weight Matrix Models II 5’ splice signal C A G … G T Background Con: Pos -3 -2 -1 … +5 +6 A 0.3 0.6 0.1 … 0.1 0.1 C 0.4 0.1 0.0 … 0.1 0.2 G 0.2 0.2 0.8 … 0.8 0.2 T 0.1 0.1 0.1 … 0.0 0.5 Pos Generic A 0.25 C 0.25 G 0.25 T 0.25 S = S1 S2 S3 S4 S5 S6 S7 S8 S9 ( S1)P-2 ( S 2)P-1 ( S 3) ••• P 5 ( S 8)P 6 ( S 9 ) Odds Ratio: R = P(S|+) = P-3 P(S|-) = Pbg ( S1)Pbg ( S 2)Pbg ( S 3) ••• Pbg ( S 8)Pbg ( S 9) Background model homogenous, assumes independence
Weight matrix Models Iii S=S, S3 345063708 S9 P(S|+)P3(S)P2(S2)P1(S3)P5(S)P6(S) Odds ratio: R P(S)-)Pbg(S1)Pbg(S2)Pbg(S3)o Pbg(S8)Pbg(Sg) ∏P4+(SP(S) Score s=log2R =>log2(P 4+(Sk)/Pba(S) Neyman-Pearson Lemma Optimal decision rules are of the form>C EquiV. log2 (R)>c because log is a monotone function
Weight Matrix Models III S = S1 S2 S3 S4 S5 S6 S7 S8 S9 P(S|+) P-3 ( S1)P-2 ( S 2)P-1 ( S 3) ••• P 5 ( S 8)P 6 ( S 9 ) Odds Ratio: R = = P(S|-) Pbg ( S1)Pbg ( S 2)Pbg ( S 3) ••• Pbg ( S 8)Pbg ( S 9) k=9 = ∏ P-4+ k ( S k)/ Pbg ( S k) k=1 k=9 Score s = log 2R = ∑ log2 (P-4+ k ( S k)/ Pbg ( S k)) k=1 Neyman-Pearson Lemma: Optimal decision rules are of the form R > C Equiv.: log 2(R) > C ’ because log is a monotone function
Weight matrix Models iv Slide WMM along sequence ttgacctagatgagatgtcgttcactttactgagctacagaaaa Assign score to each 9 base window Use score cutoff to predict potential 5 splice sites
Weight Matrix Models IV Slide WMM along sequence: ttgacctagatgagatgtcgttcacttttactgagctacagaaaa …… Assign score to each 9 base window. Use score cutoff to predict potential 5’ splice sites
Histogram of 5'ss Scores 2000 PseudO-n。 eco Teon。2s True 1500 15 5 Splice 1000 1。 Splice Sites soo so Sites 200 150 100 Score(1/10 bit units) Measuring accuracy Sn:20%50%90% Sensitivity = of true sites w/score> cutoff Specificity = of sites w/ score cutoff sp:50%32%7% that are true sites
Histogram of 5’ss Scores True 5’ Splice Sites “Decoy” 5’ Splice Sites Score (1/10 bit units) Measuring Accuracy: Sensitivity = % of true sites w/ score > cutoff Specificity = % of sites w/ score > cutoff that are true sites Sn: 20% 50% 90% Sp: 50% 32% 7%
What does this result tell us? A) Splicing machinery also uses other information besides 5'ss motif to identify splice sites; OR B)WMM model does not accurately capture some aspects of the 5'ss that are used in recognition (or both) This is a pretty common situation in biology
What does this result tell us? A) Splicing machinery also uses other information besides 5’ss motif to identify splice sites; OR B) WMM model does not accurately capture some aspects of the 5’ss that are used in recognition (or both) This is a pretty common situation in biology