《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-3 3.4 Multiple sequence alignment and domain finding (1) Multiple sequence alignment and progressive global alignment(Clustal W) (2) Find and model local multiple alignment (3)How to evaluate the quality of a PSSM?
• (1) Multiple sequence alignment and progressive global alignment (ClustalW) • (2) Find and model local multiple alignment • (3) How to evaluate the quality of a PSSM? 3.4 Multiple sequence alignment and domain finding 《生物信息学》(第二版)(樊龙江主编,2021)配套PPT3-3
数据库 DNA/RNA 保守功能位点/元件 (Rfam、Dfam等) 联配多条序列 信息量/熵 概型 蛋白质 功能域 数据库 HMM ( PROSITE、Pfam等) 正则表达式
(1)Multiple sequence alignment and progressive global alignment(ClustalW) Why produce a multiple sequence alignment? Conserved regions/domains are likely to represent regions that are essential for structure and function core of proteins A multiple sequence alignment is a starting point for an evolutionary(phylogenetic)analysis Using more than two sequences results in a more convincing alignment by revealing conserved regions in all of the sequences
(1) Multiple sequence alignment and progressive global alignment (ClustalW) • Conserved regions/domains are likely to represent regions that are essential for structure and function - core of proteins • A multiple sequence alignment is a starting point for an evolutionary (phylogenetic) analysis • Using more than two sequences results in a more convincing alignment by revealing conserved regions in all of the sequences Why produce a multiple sequence alignment?
Types of multiple sequence alignment Global alignment in which entire sequences are aligned at the same time using extension of dy ynamic programming Local alignment in which conserved local regions derived by removing stretches of global alignment found by statistical methods
Types of multiple sequence alignment • Global alignment in which entire sequences are aligned at the same time using extension of dynamic programming • Local alignment in which conserved local regions • derived by removing stretches of global alignment • found by statistical methods
EXample of local msa Rest of Rest of proteins do not align we∥ot proteins do align well Domain that aligns well To find an identifiable common AGGCTT usually longest, pattern with AAGCTA 2 some degree of variability. No agactt 3 gaps are shown in this example AAACTA/ 4 but they can be accomodated
Example of local msa AGGCTT AAGCTA AGACTT AAACTA 1 2 3 4 Domain that aligns well Rest of proteins do not align well Rest of proteins do not align well To find an identifiable common, usually longest, pattern with some degree of variability. No gaps are shown in this example but they can be accomodated
Challenges Finding an optimal alignment of more than two sequences that includes matches mismatches, and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column ot an msa
• Finding an optimal alignment of more than two sequences that includes matches, mismatches, and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge. • A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column of an msa. Challenges
Multiple sequence alignment is computational complex Suppose one tries to align three sequences by extending the method of aligning 2 sequences to a 3 dimensional scoring matrix Sequence 1 ■■■■■■■■■■■■ Problems . Time and space needed is length of seq raised to power of no of sequences 8c8 ? Optimal score in 3 Can do for three sequences dimensions but not more than three
Multiple sequence alignment is computational complex Suppose one tries to align three sequences by extending the method of aligning 2 sequences to a 3 dimensional scoring matrix. Sequence 1 Sequence 2 Y W W ? Optimal score in 3 dimensions Problems: •Time and space needed is length of seq. raised to power of no. of sequences •Can do for three sequences but not more than three
Alignment of three sequences by dynamic programming For three protein sequences each 300 amino acids in length and excluding gaps, the number of comparisons to be made by dynamic programming is equal to 3003=2.7 X 107, whereas only 3002=9 X104 is required for two sequences of this length (The number of steps and memory required for N M-amino-acid sequences: Mv Carrillo and Lipman(1988) found a way(the sum of pairs, sP method, the msa program)to reduce the number of comparisons that must be made without compromising the attempt to find an optimal alignment
Alignment of three sequences by dynamic programming • For three protein sequences each 300 amino acids in length and excluding gaps, the number of comparisons to be made by dynamic programming is equal to 3003 = 2.7 ×107 , whereas only 3002 = 9 ×104 is required for two sequences of this length. (The number of steps and memory required for N M-amino-acid sequences: MN) • Carrillo and Lipman (1988) found a way (the sum of pairs, SP method, the MSA program) to reduce the number of comparisons that must be made without compromising the attempt to find an optimal alignment
Basic idea of msa program sum of pairs(SP)method B A-C sequence A
Basic idea of MSA program: sum of pairs (SP) method
Thus, approximate methods are used, including (1)a progressive global alignment of the sequences starting with an alignment of the most alike sequences and then building an alignment by adding more sequences; (2)iterative methods that make an initial alignment of groups of sequences and then revise the alignment to achieve a more reasonable result (3)alignments based on locally conserved patterns found in the same order in the sequences (4)use of statistical methods and probabilistic models of the sequences
Thus, approximate methods are used, including: (1) a progressive global alignment of the sequences starting with an alignment of the most alike sequences and then building an alignment by adding more sequences; (2) iterative methods that make an initial alignment of groups of sequences and then revise the alignment to achieve a more reasonable result; (3) alignments based on locally conserved patterns found in the same order in the sequences; (4) use of statistical methods and probabilistic models of the sequences