Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 附录:生物信息学主要英文术话及释义 Abstract Syntax Notation(ASN.)(NcB发展的许多程序,如显示蛋白质三维 结构的cn3D等所使用的内部格式 a language that is used to describe structured data types formally, Within bioinformaties, it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software Accession number(记录号) A unique identifier that is assigned to a single database entry for a dNa or protein sequence. Affine gap penalty(一种设置空位罚分策略) a gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty Algorithm(算法 A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program Alignment(联配/比对/联配) Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments Alignment score(联配/比对联配值 An algorithmically computed score based on the number of matches substitutions, insertions, and deletions ( gaps)within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and afine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis Alphabet(字母表 The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences Annotation(注释) Th genes In a genome g the location o protein-encoding genes, the sequence of the encoded proteins, any significant
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 附录: 生物信息学主要英文术语及释义 Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维 结构的Cn3D等所使用的内部格式) A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software. Accession number(记录号) A unique identifier that is assigned to a single database entry for a DNA or protein sequence. Affine gap penalty(一种设置空位罚分策略) A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty. Algorithm(算法) A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program. Alignment(联配/比对/联配) Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments. Alignment score(联配/比对/联配值) An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis. Alphabet(字母表) The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences. Annotation(注释) The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, any significant 125
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 matches to other proteins of known function and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in rna Anonymous FTP(匿名FTP) When a FtP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP ASC The American Standard Code for Information Interchange(AScII)encodes unaccented letters a-Z, A-Z, the numbers o-9, most punctuation marks, space and a set of control characters such as carriage return and tab. ASCll specifies 128 characters that are mapped to the values o-127. AScll tiles are commonly called plain text, meaning that they only encode text without extra markup BAC clone(细菌人工染色体克隆) Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100-200 kb. Most of the large- insert clones sequenced in the project were BAC clones Back- propagation(反向传输) When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network's output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights See also feed-forward neural network Baum-Welch algorithm( Baum-Welch算法) An expectation maximization algorithm that is used to train hidden Markov models Baye'srue(贝叶斯法则) Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information In terms of two parameters a and b, the theorem is stated in an equation The condition-al probability of A, given B, P(AlB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA) divided by the probability of B, P(B). P(A is the historical or prior distribution value of A, P(BlA)is a new prediction for b for a particular value of A, and P(B is the sum of the newly predicted values for B P(A/B)is a posterior probability representing a new prediction for a given the prior knowledge of A and the newly discovered relationships between A and B Bayesian analysis(贝叶斯分析 A statistical procedure used to estimate parameters of an underlying 126
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 matches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA. Anonymous FTP(匿名FTP) When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP. ASCII The American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup. BAC clone(细菌人工染色体克隆) Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones. Back-propagation(反向传输) When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network. Baum-Welch algorithm(Baum-Welch算法) An expectation maximization algorithm that is used to train hidden Markov models. Baye’s rule(贝叶斯法则) Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B. Bayesian analysis(贝叶斯分析) A statistical procedure used to estimate parameters of an underlying 126
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 distribution based on an observed distribution. See also Bayes rule Biochips(生物芯片 Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips Bioinformatics(生物信息学) The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based or current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution Bit score(二进制值/Bt值 The value s is derived from the raw alignment score s in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they car be used to compare alignment scores from different searches Bit units From information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has a4 possibilities is log2 M=bits BLAST(基本局部联配搜索工具,一种主要数据库搜索程序 Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLaSTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLasT. BLaST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant Bock(蛋白质家族中保守区域的组块) Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins BLOSUM matrices(模块替换矩阵,一种主要替换矩阵 An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments Boltzmann distribution( Boltzmann分布) Describes the number of molecules that have energies above a certain level based on the boltzmann gas constant and the absolute temperature 127
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 distribution based on an observed distribution. See also Baye’s rule. Biochips(生物芯片) Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips. Bioinformatics (生物信息学) The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution. Bit score (二进制值/ Bit 值) The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. Bit units From information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits. BLAST (基本局部联配搜索工具,一种主要数据库搜索程序) Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant. Block(蛋白质家族中保守区域的组块) Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins. BLOSUM matrices(模块替换矩阵,一种主要替换矩阵) An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments. Boltzmann distribution(Boltzmann 分布) Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature. 127
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 Boltzmann probability function( Boltzmann概率函数 See boltzmann distribution Bootstrap analysis A method for testing how well a particular data set fits a model. For example the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis Branch| ength(分支长度) In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree CDS or cds(编码序列) Coding sequence Chebyshe, d inequality The probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean Clone(克隆) Population of identical cells or molecules(e.g. DNA), derived from a single ancestor Cloning Vector(克隆载体 A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCr), care should be taken not to include the cloning vector sequence when performing similarity searches Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors Cluster analysis(聚类分析) A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used Cobbler a single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKs server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches Coding system(neural networks) Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen Codon usage
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 Boltzmann probability function(Boltzmann概率函数) See Boltzmann distribution. Bootstrap analysis A method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis. Branch length(分支长度) In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree. CDS or cds (编码序列) Coding sequence. Chebyshe, d inequality The probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone (克隆) Population of identical cells or molecules (e.g. DNA), derived from a single ancestor. Cloning Vector (克隆载体) A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors. Cluster analysis(聚类分析) A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used. Cobbler A single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches. Coding system (neural networks) Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen. Codon usage 128
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 Analysis of the codons used in a particular gene or organism coG(直系同源簇) Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast(S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs Comparative genomics(比较基因组学) A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism Complexity( of an algorithm)(算法的复杂性) Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned Conditional probability(条件概率) The probability of a particular result (or of a particular value of a variabl given one or more events or conditions(or values of other variables Conservation(保守) Changes at a specific position of an amino acid or(less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue Consensus(一致序列) A single sequence that represents at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment Context-free grammars A recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol Contig(序列重叠群/拼接序列 A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs)can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨 计算机、操作系统、程序语言和网络的共同标准) The Common object Request Broker Architecture (CORBA)is an open industry standard for working with distributed objects, developed by the object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, oI geographic location of the computers Correlation coefficient(相关系数)
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 Analysis of the codons used in a particular gene or organism. COG(直系同源簇) Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs. Comparative genomics(比较基因组学) A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity (of an algorithm)(算法的复杂性) Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned. Conditional probability(条件概率) The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables). Conservation (保守) Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Consensus(一致序列) A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment. Context-free grammars A recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig (序列重叠群/拼接序列) A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level. CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨 计算机、操作系统、程序语言和网络的共同标准) The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers. Correlation coefficient(相关系数) 129
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 A numerical measure, falling between-1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship A value near zero indicates no relationship between the variables Covariation( In sequences)(共变) Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNa or protein molecules Coverage( or depth)(覆盖率/厚度) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a high-quality base is defined as one with an accuracy of at least 99%(corresponding to a PHRED score of at least 20) Database(数据库) A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also object-oriented database. Relational database Dendogram A form of a tree that lists the compared objects(e.g, sequences or genes in a microarray analysis)in a vertical order and joins related ones by levels of branches extending to one side of the list Depth(厚度) See coverage Dirichlet mixtures Defined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains(blocks) Distance in sequence analysis(序列距高) The number of observed changes in an optimal alignment of two sequences. usually not counting gaps DNA Sequencing(DNA测序) The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide(A, C, G or T)with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are mbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised Domain(功能域) a discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables. Covariation (in sequences)(共变) Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Database(数据库) A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database. Dendogram A form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list. Depth (厚度) See coverage Dirichlet mixtures Defined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks). Distance in sequence analysis(序列距离) The number of observed changes in an optimal alignment of two sequences, usually not counting gaps. DNA Sequencing (DNA 测序) The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域) A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. 130
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 Dot matrix(点标矩阵图) Dot matrix diagrams provide a graphical method for comparing two sequences One sequence is written horizontally across the top of the graph and the other along the left-hand side Dots are placed within the graph at the intersection of the same letter appearing in both sequences. a series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window Draft genome sequence(基因组序列草图 The sequence produced by combining the information from the individual sequenced clones(by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes DUsT(一种低复杂性区段过濾程序) A program for filtering low complexity regions from nucleic acid sequences Dynamic programming(动态规划法 a dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence companIsons EMBL(欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之 European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases EMBnet(歐洲分子生物学网络) EuropeanMolecularBiologyNetworkhttp://www.embnet.orgwasestablished in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY Entropy(熵) From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy Erdos and renyi law In a toss of a fair coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment EST(表达序列标签的缩写)
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 Dot matrix(点标矩阵图) Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window. Draft genome sequence (基因组序列草图) The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. DUST (一种低复杂性区段过滤程序) A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法) A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons. EMBL (欧洲分子生物学实验室,EMBL 数据库是主要公共核酸序列数据库之 一) European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases. EMBnet (欧洲分子生物学网络) European Molecular Biology Network: http://www.embnet.org/ was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY. Entropy(熵) From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy. Erdos and Renyi law In a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment. EST (表达序列标签的缩写) 131
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 See Expressed Sequence Tag Expect value(E)(E值) E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning Expectation maximization ( sequence analysis) An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement Exon(外显子 Coding region of DNA. See CDS Expressed sequence Tag(EsT)(表达序列标签) Randomly selected, partial CDNA sequence; represents it's corresponding mRNA dbEST is a large database of ESTs at GenBank, NCBI FASTA(一种主要数据库搜索程序) The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for smal matches called words". Initially, the scores of segments in which there are multiple word hits are calculated (init1). Later the scores of several segments may be summed to generate an initn " score. An optimized alignment that includes gaps is shown in the output as"opt". The sensitivity and speed of the search are inversely related and controlled by the k-tup variable which specifies the size of a word"(Pearson and Lipman) Extreme value distribution(极值分布) Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution which follows a double negative exponential function after Gumbel False negative(假阴性 A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 See Expressed Sequence Tag Expect value (E)(E值) E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning. Expectation maximization (sequence analysis) An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement. Exon (外显子) Coding region of DNA. See CDS. Expressed Sequence Tag (EST) (表达序列标签) Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI. FASTA (一种主要数据库搜索程序) The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman) Extreme value distribution(极值分布) Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel. False negative(假阴性) A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results. 132
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 False positive(假阳性) a positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative Feed- -forward neural network(反向传输神经网络) Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a feed-forward direction, resulting in output at the final layer. See also Neural network Filtering(window size) During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot indicating a match, is generated only if a certain minimal number of matches occur Filtering(过滤) Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and dUST. Finished sequence(完成序列 Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps Fourier analysis Studies the approximations and decomposition of functions using trigonometric polynomials Format(file)(格式) Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format Forward-backward algorithm sed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach FTP( Fille Transfer protoco)(文件传输协议) Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn will make a specific portion of its tile system available for FTP access providing that the client is able to supply a recognized user name and password to the server Full shotgun clone(鸟枪法克隆) A large- insert clone for which full shotgun sequence has been produced
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 False positive (假阳性) A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative. Feed-forward neural network (反向传输神经网络) Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network. Filtering (window size) During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur. Filtering (过滤) Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST. Finished sequence(完成序列) Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps. Fourier analysis Studies the approximations and decomposition of functions using trigonometric polynomials. Format (file)(格式) Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format. Forward-backward algorithm Used to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach. FTP (File Transfer Protocol)(文件传输协议) Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server. Full shotgun clone (鸟枪法克隆) A large-insert clone for which full shotgun sequence has been produced. 133
Nw.cab.zju.edu.cn/cab/ xueyuanxiashubumen/nx/ bioinplant.htm《生物信息学札记》樊龙江 Functional genomics(功能基因组学) Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype gap(空位间腺缺口 A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment Gap penalty(空位罚分) A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices Genetic algorithm(遗传算法) a kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions Genetic map(遗传图谱 a genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans(CM), denoting a 1% chance of recombination Genome(基因组) The genetic material of an organism, contained in one haploid set of chromosomes Gibbs sampling method An algorithm for finding conserved patterns within a set of related sequences a guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix G| obal alignment(整体联配) Attempts to match as many characters as possible, from end to end, in a set of two or
www.cab.zju.edu.cn/cab/xueyuanxiashubumen/nx/bioinplant.htm 《生物信息学札记》 樊龙江 Functional genomics(功能基因组学) Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype. gap (空位/间隙/缺口) A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Gap penalty(空位罚分) A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices. Genetic algorithm(遗传算法) A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions. Genetic map (遗传图谱) A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination. Genome(基因组) The genetic material of an organism, contained in one haploid set of chromosomes. Gibbs sampling method An algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix. Global alignment(整体联配) Attempts to match as many characters as possible, from end to end, in a set of two or 134