《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-1 3. Analysis and alignment of sequences 3.1 Compositional bias in biological sequences 3.2 Alignment of pairs of sequences 3.3 Database searching for similar sequences 3. 4 Multiple sequence alignment and domain finding
3. Analysis and alignment of sequences • 3.1 Compositional bias in biological sequences • 3.2 Alignment of pairs of sequences • 3.3 Database searching for similar sequences • 3.4 Multiple sequence alignment and domain finding 《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-1
CTACATTCCTATCCACTGGTGCATATCTAGO ETATCTITCTCTAACCTTAACACACITTAAGITCACAAAATTA 31c。mp。st。 aabbs in bfolocicalsecuences 我vM以EN TACATTTT GGAATCAGGGC://15 AGi ISoSoweai eolaFdistrbutione he石 CGTTGTT AAAATAATIGTCATAA合e
CACTAGTCTCTGTACTAGCCACTAGAAGTACTAACCTTTCACACTAATATATCTATCTCCTGCTGCATTTAGTACACAAGTTCATAAAAGCACCCTATTTCTATAAAAAAAATACGGTAAATGTA GCAACTTAC TAGTACCATAAGAAATTTTGCTGATCTAGCTAACTTATTACTAGCTACTTGCTAGGTCTGAACACTATTAAAATGTAACAATACACTTACCTCCTTGATCTGTGCAGCCCTGTTCTCACGCTGGCTTCTATGG TGCGAGTAGTATTCCTAGGTTTTCGTAGGCTTTTATAGCAACAGCTTTCTTCGGACCGAATGAGACACCTGCCTTGTTTATGAGAGGGATGGATAGCTTTCACCTGCTGGACATTTATTTGTTTTTTTTTACT GGTCACTACATTCCTATCCACTGGTGCATATCTATCCTATCCCCTTTGGTCAGTAAAATATACTGCCTCCCCCATTCTCTTTCTTTCTCTATCTTTCTCTAAGCTTAACACACTTTAAGTTCACA AAATTATTAT TATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTAGCAGGCTTCCCTCCTTTAGAAATTTCATCGTCGAAATTATTATACCTTGGTGATGGAAAA ACTGAGGCTAGT TTTTTCTGGAGATCATCTTCCTTCTCCCATGTGGCCTCATCCATGGTGTGATGACTCCATTGTACCTTTAAAAATCTAATTGTTTGGTTCCTTGTTTTTAGATCTTTAATATCCAAGATACAAACAGGATATTC CTGATATGTCAAATCGTTATGCAACTCAGCCATAGGAATTTCAACTTAATCACTTGGCCTCCGAAGGCATTTACGAAGCATGGAGATGTGGAATACATCATGTACCCCGGTGAAAGCATCTGGTA GCTTTA GCATGTAAGGCACTTCTCCTATTTGCTTAACAATTGTAAATGGTCCAACATATCTGGAACTTATTTTTTTTCCAAGTCCGAATCGCTTAATTCCCTTTATAGGTGATACTTTTAAATATACCCAGTCACCTATAT CAAAGTTAAGATCCCTTCTCCTATTATCTGCATAACTTTTTTTGTCTATTTTGAGCTGTTTGCAGTCGTTCCCGTATCAGTCGTATTGTTTCTTCTATCTGTTGTATTATATCCGGTCCTAACAA TTTTCTTTCTC CTACTTCGTTCCAGCAAACAGGTGTTCTGCATTTCCTTCCATATAAGGCTTCATACGGAGCCATTTGTATACTAGATTGATAACTATTGTTATATGCAAATTCTGCTAATGGCATAAATTCTTTCCATGATCCT TTAAATTCTAGGATGCAAGATCGTAAAATATTTTCAATTATTTGATTCACCCTTTCAGTTTGTCCATCGGTTTGGGGGTGATACGCTGCACTGAAATCTAATGTTGTTCCCACGGGCTTGTGTAGTCTTTTCT AGAAATTGGACAGAAACTGTGTATCTCTGTCTGACACAATCCTTCTTGGAACACCATGTAAAGATACTATTTCTTTGACATATAGTTTAGCTAACCTTTCCAAAGAAAATTTGCTTTTAACGGGTATGAAATGA GCAGATTTTGTTAACCGATCCACTATTCAGATACTATCATTTCCTGGAGGTGTGGTAGGTAATCCTTGAACAAAGTCCATACTGATTTCTTCTCATTTCCATAGTGGAATACTTAAGGGTTGTAA CAGTCTTG CCGGCCTTTGATGTTCAACTTTTACGCATTGGCAGATATCACATTCTGCAATGAATTTTGCAATTTCTATTTTCATGATACATTTTGGTACTTCCTGGATGTATGGTATAGGGAGAGAAATGTGA TTCTTCCAA TATTCTCTGTTTTAAATTAGGGTCGTTAGGCACACACAATCTATTTTTGAAACATATAGCACCATTATGATCAATTCGAAATTCAGACACCTTCCCTTCTTCAATATTTTTCTTTGCCTTTTGCA ATCCACTGTC GTCTCTTTGTTTCTCTAGAATATTTTCTTCTAAAGTAGGCTTTATTTGAAGCACGGGTAATAATACTCTGGGTTCATGGATCTTTAATTCCACATCCAATCTTTCCAAGTCTCTAAGTATATGTTGATCCTGTGT GATCTGAATAGCCATATTACAAAGAGCTTTTCGACTTAGAGCATCTTCCACAATGTTGGCTTTCAGAGGGTGATAATGAATATTCAAATCATAATCTTTCAATAATTCTAACCATCCCCTTTATCTCATATTCA ATTCCTTCTGAGTAAATATGTACTTTAAACTTTTGTGGTCAGTAAATATTTCACAATGCTCACCATATAGGTAATGTCTCCAGATTTTTAAGGCAAAAATAACAGCAGCTAATTCCATATCATGGGTTGGATAA TTTTGCTCGTATGGCTTTAATTGACGCGAAGCATAGGCAATTACCTTAGCTTTTTGCATGAGAACACAACCTAATCCAATTTTTGAAGCATCACAGTAAATAGTAAATTCTTCTCCCATTATAGGCAAGGCAA GAATAGTAAATTCTTTGCAATTCTGAGTCCACTCATATTTTACTCCCTTTTGTGTCAACCGGGTTAGAGGAGCTGCAATTCTAGCGAAGTTACTAATAAATCGACGGTAATATCCCGCCAACCCA AGAAAAC TTCGTATCTCGGTTACCGATGAGGGCCTTTTCCACTCTGAGACGGTTTTGACCTTTTCAGGGTCCACTGATATACCTTCACCCGAAATAACATGACCAAGCAAAAATACTTTATCCATCCAGAAA TCGCATT TCTTTAATTTGGCAAATAGTTTATGATCTCGCAATGTCTTGTAGTACTATTCTCAAATGATTTGCATGATCTTCCTTAGTCTTGGAATATATCAAAATATCATCTATATATAAATACAACTACAAATTAATCAAGA TAAGGCTTGAATTTACGATTCATTAAATCCATAAAAGCTGCCGGTGCATTAGTCAAACCAAATGGCATTACTAGATATTCATAGTGTCCATAGCATGCACGGAAAGCAGTCTTGGGTATATCACTAGGTTTA ATCTTTAGTTGATGGTAGCCTGATTGAAGATCAATTTTTGAGAAAACCCGAGCTCCTTGTAGTTGATCAAATAGATCGTCTATCCTTGGTAAAGGATATTTGTTTTTGATAGTCACCTTATTCAGTTCTCGGTA ATCCGTGCATAATCGCATAGTTCCATCCTTTTTCTTGACAAATAGAACAGGAACACCCCCACGGGGAGACACTAGGACAAATGAATCCTTTATCTTCTAATTCTTTTAATTGTACATTTAGTTCC TTTAGCTC AACAGGGGCCATTATGTAGGGTGCCTAATAAATCGGAGTAGTTCCTGGTCCTATTTCAATACCAAATTCAATCTCTCGATCTAGTGCTAATCCTGGTAATTCAGCTGGAAAAACTGGAAACTCATTCACAAT TGGCATTCCTTCCCAACTTGCTTCCTTTCTCATGATTTCTGCCACTAAAGGTCTTGGTAAATTGTTTTAATCTCCATGGTAAGTAATTTGGTTTTGATCCCATGGTTTAAGTGTAATTTGTTTTT CATGGCAATC AATATTTGCTTTGTTCTTACATAACCAATCCATACCAAGTATAATATCAAAATCATGCATATCCAAGGGTATGAGGTCAGCAGTTAATTCCCATCCATCAATAGTAATTGGACACAATTTGCAAA TTAAATTAG TTATTTGGCTATCCAAAGGAGTTTCTATGCAAATCCTTTCTTTTAATTGACTAGTAGGGATGGTGTATTTTCTCACGAAGTTGGTGGAGATAAACGAATGTGTTGCGCCAGAATCAAATAAAACTTTACCAGG ATAAGAGCACACTAAGACATTACCTGTAACCACGGTGTTGGATTTTTCGGCTGTGCTCTTAGTTAAGTTGTATACCCCAAGCGCGATTCCCACCTTGTGAATTATTCGACCGTATTCCTCATGTA GTATTAG TATTTGCAGGTGGCTTTCCTTGATTTGGCCCATTATTATTTGCTGAAGATGGTCCAGGTAAATAAAGCGACGGTACTGAAGTCAATACTTTAGTACTTGGCTGAGTAGTTCAATTAACTCGATTT TTACCCTT CTGTAACAGAGGACAAAGGTATCTAGTATGTCCTGCTTCTCCACACTCAAAGCACCTTCCCCACCGATTAGGACAAATTGATGGAACATGGCCACCTTGGCATATTGGACATTTTCTGTCTTGATTTTCTAA AGATTCCCTCTACATTTTTCCAGAGTAGTTTCCACGGAATCTTCCCTGGTTTTGTTGATTATTTGTCTTGAATTTCTTTTGGGGTTGTCCGTGTTCTATTCTTTGTTCATGATACCCCTTCTCAA GAAGTTGTG CTTTACTTACTACCTCCCTGAATATGGTTAATTCAAAGGCTTCGACACACCTTTTGAGAGGTTGGCGTAATCCACTTTCAAATCGTCGAGCTTTAGAGCCGTCCGTTTGTACAAATTCAGGAGCA AATCTTG CAAGTCTCGAAAATTCTATTTCATATTCTACTACAGATTTATTACCTTACTTAAGCTCTAGAAATTCCTTCTTCATTCTCTTCACACTTTCTGGAAAATATTTCTTGTAAAAAGCTTCTTTGAATATTTCCCATGT AATAGAGATACGTTCCGAATATGACTTTTTGTGAGCATCCCACCATTCAAAAGCACTAGACTGAAGCATATAGGTAGCATATGTAATCTTTTCTTTATCTGTACAACCCATAGCTTCAAATGCCT TTTCCATT GCTACTATCCAAACTTCCGCTTCAAGTGGATTGGTAGTTCCTGAAAGGAAAAAGTATGAATTACCCCCTGAACTATTGCGAGAGTATGAATTACCCCCCCCCCCCAAAACCACAAAACCAGACATATTAAAC CTCAAACTATTGAAATCGGATTACCCCCCCTGATTCAATCCGGAGCGGTTTGGTCCTACGTGGCATACACGTGGCACCGCCATGGAAATCCAATCAGCAATATTAGGTGGTCCCACATGTCATGA TCATGT ATTTCTTCCACTTTCCCCTCTCTTCATCTCCTCCAGGGCAAATAGAAAGCGGCGCGGTGGTGGCGCTCTCCAGGGCGGCCGGGGGAAGCGGCGGCGGCGGCGTCCAGGGCGGGTGGGGGAAGCGGC GGCGTCCAGGGCGGCTGCGGAAGCGACGGCGGCGTCCAGGGTGGGCTAGGGAAGCGGCGGCTTCTAGGGCAAGCTGGGGAAGTGGCGGCGGTGGCGGCGACGGCGGCGTCCAGGGCGGGCTGG GGAAGCAGCGGCGTCCAGGGCAGGCGGGGAAGTGGCGGTGATGACGGCGCCCTCCAGGTCGAACTGGGGTGGTGGCGGGGAAGTGACGGCAGCGACGGCGCCCTCCAGGGCAGGTAGGGGAAGC GGTGGCGGCGGGTGTGGCGGGAGCGCTCGTGCGGTGGGCGCGGCGGGAGCGGGAGCGGGCGCGGCGAGGAGCAGGCGCTTGTGCTCCTCCTCCGTGGCGCCAGAGATGGAGCGGGCGCTCGTG AGCGGGTCGGCCGCCGCTGCGAGCTCGCCGTGGAGGCGGCGAGAATCGAGATCGACGGCGAGCTCCACGGAGATGGAGAGAAGAAGGGAAGGGGCAAAGAGGAGGGGGAGAAGAGGAGGGTTGG GCAGACAGTGGGCCCCACCATATTTATTTGTTGTGGCTGACAAGTGGGTCCTATATATTTTTCTTTTGTTTTAGCTGACCAGACTGCCACATGGGCATCCACGTAGGACCGAAACCACCCTATATCGATCTA GGGGGTAATTCATCCGGTTTGTAAAGTTCAGGGTTAAAAATAACTGGTATTGGAGTTCAGGGTTAAAAATCGGACGACCGTAATTGTTGAGGGGGTAATTCGTACTTTTTCCTTCTTGAAAATGTTGGTGG CTTCAATTTCTGAAATTCCCCAAGTCCATTCCGGTTAGCATCACTTTTAGTAGTACGTTCTAAAATCTCCATCTATCGTTGTTGGGTTTCCTGTTGCTTGCCCAATATATTCGCGAGTAAGTTAGCCCAAGGG TCTTGACTACTTGCACTAGGTATTATTGATCCAGTGGCACCATTACTAGTATTATTTCCATCCTGACTAGTACCATTGTTGTCGTTGTTTTGCTCCATCTATCATATTCAACTCATTAGCCAGAA TACATAAAT GATCATTGGATGGATCTCAAAATGGTAACAAAAATCAGATTTACTATAAAATATTCAATATAGGTAATATTAAAATAAAACTATTTAGTTATATTATCATCATTATACTTTTCTCTTCTTATTTTAGTCTTATCATT ATTCTTAACATGCACCAGTTAAAAAATAAATAAATAAAATTAGTACAAACCACAAGCACCACAGCACTAGTGCATTACGGTCATGTTTAGATTCAAATTTTTTTCTTCAAACTTCTAACTTTTCCGTCACATCAA ATGTTTGGACACATGCATGGAGCATTAAATGTGGAGAAAAAAACAATTGCACAGTTTGCATGTAAATTGTGAGACGAATCTTTTGAGCCTAATTACACCATGATTTGACAATGTGATGCTATAGTAAACATTT GTTAATGATAGATTAATTAGTCTTAATAAATTCATCTCGCAGTTTACAGGTGAAATCTGTAATTTGTTTTGTTATTAGTCTACATTTAATACTTCAAATGTATATCCATATACTTGAAAAAAAATTTGGCACACG AACTAAACACAGCCTACTTCGACGAAAAGAAAGTGCAGGAGCCTATCATGCTACACAAACACTAAGGCAAACACCTACTGGTGTACTAGTGCCACATACAGAGCTCTGGTTGTTTACACAAGATGTCTAGA AAGACATCACCATGAGTTCTGATGTTAACTCTTCAGTTCTAAAAGCTCCTTTGGCTGTCTCGTGACCCATCCACACATGCTACTAACACTAAGGGTGTGTAGGGTGTGTTTAGTTCACACCAAAA TTGAAAG TTTGGTTGAAATTGAAACGATGTGACGGAAAAGTTGAAGTTTACGTGTGTAGGAGAGTTTTGATGTGATGAAAAAGTTAAAAGTTTGAAGAAAAATTTTGGAACTAAACTCAGCCTAAAGGACTTATTATAGT GGAGTACATCCCATCCCAAGGGAAAACAAAACCCATACTGACACCACTCCTACATCTCACACACTGCCACTAGAGCTGTCACTACCCCCAACCCCACTCTGCAGAACAGTAAATGGTTTCACTCA GGTAG CAGACGCGGTGGTACAGGCGATAGGTGAGGCGCTCCAGAAACATAGGCTGTGTTTAGATGGTGGAAAAGTTGGGAGGTTGGGAGAAAGTTAGTAGTTTGGAGAAAAAGTTGGTAGTTTATGTGTGTACG AAAGTTTTCGATGTGATGTGATGTGATGGAAAGTTAGGAATTTGGGGGGAACTAAACACGGCCATAACTTCATTCTCACTGGAGCGAACAATAGTCGGCAGTTATTTTTATATACATATTTGTTA AAGAAGA AATATTACTGTCCATGGATATTAATGGCCGATAAATAGTATAAAAAACATTAAATATAGTAAGTGATTTAAATACATTCTGCAGAGGTATTAAAATAATTGTCATAATCTCGTTCCTTCAATCCA TTTTTTTCCA ACTAGTGATACCTCATCTGAGAATCACGGCGCCGAATTCCCTACTTGTGTGAGGCATTCCTTCTCTCACACTGATATCAGCCGACCCGATATCGTTGTTTCAGGTATCGGCCGTCTCAGGCTAAGTATCAA AATCATGTTCCATGATTATGACGTTATTATTCTCACTGATAAAATCATCAATCAATTATTCGGGAGTTAATAATATTTACCGTTAGATCGTTAGTATCATCATCCCAATATATAATACAGGTAAGCGAATTTAGT TAGAGATGATTAAGTAAAATAGTTGATGGACACAGTCTTGCCTTCTCTTTTGTTGTTCTTCCTCTGCATCCCACCTAATCAAATATACATGTCTTTGGTATTAATTTATATCTATATTTGTTATG CAGGACATTA GCTACTGGAACCAGCTACTAGGACCATAGATAGCTAGTTGATGTGACTCTACTGGAGAAAGAAAACCAACATGTAGGCCTAGTTTATTTCCCCCAAAATTTTTCCCAAAAACATCACATTGAATCTTTGGAC ATATGCATGGAGCATTAAATATAGATTAAAAAAACTAATTGCACAGTTAGGGGGAAAATCACGAGACGAATCTTTTGAGCCTTATTAATCCATGATTAGCCATAAGTGCTACAGTAATGCCAGCTGGGCGAG GAGAGGTGGCAGTGGTGGTGAGCCCAGCTGGGTGGATGTGTGGAGGGTGGAGAGGAGACGGGGAGGGAGGGAGGGAGGGAGAGAGGACTAGG 3.1 Compositional bias in biological sequences An obvious first summary of a DNA sequence is just the distribution of the four base types. Almost all empirical studies show an unequal distribution of the four bases
Promoter sequences Base content as a function of CDNA position, relative to the start of transcription sites, and averaged over all cDNAs with a 10-bp sliding window R Ice I-10-A TSS CDNA coord. 100b
Promoter sequences Base content as a function of cDNA position, relative to the start of transcription sites, and averaged over all cDNAs with a 10-bp sliding window 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -4 -3 -2 -1 0 1 cDNA coord, 100bp I-10-GC I-10-A I-10-T I-10-G I-10-C Rice TSS
Arabidopsis 0.45 a-10-GO 10-A 0.25 a-10-T a-10-G 0.2 a-10-C 0.15 0.1 0.0
Arab_10_A,T,G,C,GC 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 -4 -3 -2 -1 0 1 a-10-GC a-10-A a-10-T a-10-G a-10-C Arabidopsis
Human 0.6 w~4 H-10-GC H-10-A H-10 0.3 H-10-G H-10-C
Human_10_A,T,G,C,GC 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -4 -3 -2 -1 0 1 H-10-GC H-10-A H-10-T H-10-G H-10-C Human
Three patterns of base contents Rio Arabidopsis Human
Three patterns of base contents Rice Arabidopsis Human TSS
Neighboring bases are not independet P air Observed/Expected Example TG 1.29 CT 126 Dinucleotide frequencies CC in some vertebrate AG 1.16 Squences AA 15 CA 1.15 GG 1.14 Based on 166 vertebrate TT 1.07 sequences, totaling GA 1.0 136, 731 bases(Nussinov, TC 1.00 1984) GO 0.99 AT 0.85 AC 0.84 GT 0.82 Pn≠PuPv TA 0.65 CG 0.42
Neighboring bases are not independet Pair Observed/Expected TG CT CC AG AA CA GG TT GA TC GC AT AC GT TA CG 1.29 1.26 1.18 1.16 1.15 1.15 1.14 1.07 1.04 1.00 0.99 0.85 0.84 0.82 0.65 0.42 Example: Dinucleotide frequencies in some vertebrate squences. Based on 166 vertebrate sequences, totaling 136,731 bases (Nussinov, 1984) Puv ≠ PuPv
相邻碱基对观测频率/期望频率 人类 水稻 数据来自这两个 127 1.05 GG 122 1.03 物种目前注释出 1.20 来的所有基因的 TG 1.19 DNA序列,总长 AG 0.99 CT 0.99 各为168717,208 1.13 1.13 和1,506657,427 AA 1.13 个碱基(邱杰, GC 1.02 105 2016) 0.96 100 AT 0.88 1.02 0.84 0.84 AC 0.83 0.86 A 0.75 0.77 CG 0.26 0.83
相邻碱基对 观测频率/期望频率* 人类 水稻 CC 1.27 1.05 GG 1.22 1.03 CA 1.20 1.11 TG 1.19 1.11 AG 1.18 0.99 CT 1.15 0.99 TT 1.13 1.13 AA 1.13 1.11 GC 1.02 1.11 GA 0.99 1.05 TC 0.96 1.00 AT 0.88 1.02 GT 0.84 0.84 AC 0.83 0.86 TA 0.75 0.77 CG 0.26 0.83 数据来自这两个 物种目前注释出 来的所有基因的 DNA序列,总长 各为168,717,208 和1,506,657,427 个碱基 (邱杰, 2016)
3. 2 Alignment of pairs of sequences The most basic sequence analysis task is to ask if two sequences are related This is usually done by first aligning the sequences(or parts of them) and deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Sequence alignment is the procedure of comparing two(pairwise alignment or more(multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that in the same order in the sequences
3.2 Alignment of Pairs of Sequences • The most basic sequence analysis task is to ask if two sequences are related. • This is usually done by first aligning the sequences (or parts of them) and deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance. • Sequence alignment is the procedure of comparing two (pairwise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that in the same order in the sequences
Web BLAST blastx Nucleotide BLAST tblastn Protein BLAST nucleotide b nucleotide protein> translated nucleotide BLAST Genomes Enteremansm common name scentific name, or tar d Search Standalone and API BLAST Download BLAST Use BLAST API Get BLAST databases and executables all BLAST from your application Specialized searches SmartBLAST Primer- BLAST Global Align CD-search primers specific to Compare two sequences VecScreen DART Multiple Allgnment Search immunoglobulins Search sequences for Find sequences with Align sequences using and T cell receptor contaminator similar conserved domain domain and protein MOLE- BLAST Establsh taxonomy for cultured or