7. 91-Lecture #4 Michael Yaffe Database Searching Molecular Phylogenetics ABCD ABc ((A,B)c)D)
7.91 – Lecture #4 Database Searching & Molecular Phylogenetics A A B B C D D (((A,B)C)D) C Michael Yaffe
Outline FASTA, Blast searching, Smith-Waterman ·Psi- Blast Review of genomic dNA structure Substitution patterns and mutation rates Synonymous and non -synonymous substitutions · Jukes- Cantor model Kimura's Two-Parameter Model Molecular clocks Phylogenetic Trees-rooted and unrooted Distance matrix Methods Neighbor-Joining Method and Related Neighbor Methods · Maximum likelihood
Outline • FASTA, Blast searching, Smith-Waterman • Psi-Blast • Review of Genomic DNA structure • Substitution patterns and mutation rates • Synonymous and non-Synonymous substitutions • Jukes-Cantor Model • Kimura’s Two-Parameter Model • Molecular Clocks • Phylogenetic Trees – rooted and unrooted • Distance Matrix Methods • Neighbor-Joining Method and Related Neighbor Methods • Maximum Likelihood
Outline( cont) · Parsimony Branch and bound Heuristic Seaching · Consensus trees Software(PHYLIP, PAUP The tree of life Reading: Mount,p.237-280,283286,291-308
Outline (cont) • Parsimony Branch and Bound Heuristic Seaching • Consensus Trees • Software (PHYLIP, PAUP) • The Tree of Life Reading: Mount, p. 237-280, 283-286, 291-308
Database searching Problem is simple: I want to find homologues to my protein in the database How do i do it Do the obvious - compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh oh! n For k sequences in the 12345678 Database 12345678. this becomes an o(mnk) 12345678 problem 12345678 12345678 12345678 essentially an o(mn) problem
Database Searching Problem is simple: I want to find homologues to my protein in the database How do I do it? Do the obvious – compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh Oh! 1 n For k sequences in the 1 12345678…. Database 12345678…. this becomes an O(mnk) 12345678…. problem! 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem
Database Searching Still, this can be done -- 50x slower than blast/FASTA, Smith-Waterman algorithm SSEARCH (tp. virginia. edu/pub/fasta)-do it locally But in the old days, needed a faster method 2 approaches- Blast, FASTA -both heuristic l.e. tried and true)-almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea Search for matching sequence patterns or words Called k-tuples, which are exact matches of"k"characters between the two sequences i.e. RW= 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database Searching Still, this can be done - ~ 50x slower than Blast/FASTA, Smith-Waterman algorithm… SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally! But in the old days, needed a faster method… 2 approaches – Blast, FASTA – both heuristic (i.e. tried and true) – almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea 1- Search for matching sequence patterns or words Called k-tuples, which are exact matches of “k” characters between the two sequences i.e. RW = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. cV=2-tuple Seq 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE 3-Make a Hash Table Hashing) that has the position of each k-tuple in each sequence 2-tuple pos in Seg1 pos in Seg2 offset (pos1-p0s2 RW 5 2 CV 10 AH
Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 ---- ----
Database searching Seq 1: AHFYRWNKLEV Seg 2: DRNLFCVATYWE 3.Make a Hash Table HAshing)that has the position of each k-tuple heach sequence e 2-tuple pos in Seq1 pos in Seg ofset(pos1-pos2 RW CV 10 3 AH 4-Look for words(k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences 5-Build a local alignment based on these, extend it outwards Seg 1: AHFYRWNKLCV Seg 2: DRWNLFCVATYWE
Database Searching Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 ---- ---- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database searching With hashing, number of comparisons is proportional To the average sequence length i.e. an o(n) problem) Not an o(mn)problem as in dynamic programming Proteins-ktup=1-2 Nucleotides, ktup=4-6 One big problem- low complexity regions Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK
Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming. Proteins – ktup = 1-2, Nucleotides, ktup=4-6 One big problem – low complexity regions. Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK
Database searching BLAST Same basic idea as fasta but faster and more sensitivel How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant by using the log-odds values in the Blosum62 amino acid substitution matrix . e. look for WHK and might accept WHR but not HFK as a possible match(note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc Use any match to seed an ungapped alignment (old BLasT)
Database Searching BLAST Same basic idea as FASTA, but faster and more sensitive! How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant, by using the log-odds values in the Blosum62 amino acid substitution matrix i.e. look for WHK and might accept WHR but not HFK as a possible match (note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc. Use any match to seed an ungapped alignment (old BLAST)
Database searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions Determine if the alignment is statistically significant calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution Calculates an E-value expectation value This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50!(normal starting E=10
Database Searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions. Determine if the alignment is statistically significant. calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution. Calculates an E-value = expectation value: This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50! (normal starting E=10)