Introduction
- Extraction note: PDF 文本存在部分单词粘连与换行断词;译文已按页渲染与上下文修正可读性。
Extracted source
Multiple Sequence Alignments
Fabian Sievers, Geoffrey J. Barton, and Desmond G. Higgins
Introduction
A multiple sequence alignment (MSA) is an arrangement of more than two amino acid or nucleotide sequences which are aligned so as to make the residues from the different sequences line up in vertical columns in some appropriate manner. These are used in a great variety of analyses and pipelines in proteome and genome analysis and are an essential initial step in most phylogenetic comparisons. They are widely used to help search for common features in sequences and can be used to help predict two- and three-dimensional structures of proteins and nucleic acids. An excellent review of MSA methods, uses, and abuses is provided by Chatzou et al. (2016).
Usually, one should only attempt to align sequences which are phylogenetically related and, therefore, homologous. In this case, the ideal alignment will have homologous residues aligned in the columns. An example of a multiple protein sequence alignment is shown in Figure 8.1. Here, one column is highlighted. If this column is well aligned, one can infer that the residues in that column have been derived from the same residue in the common ancestor of these sequences. That residue could have been a valine (V) or an isoleucine (I) or some other residue, but the key thing is that all of the amino acids in that column derive from that one position in the common ancestor. This is the phylogenetic perspective that underlies the construction of these alignments. In principle, one could also attempt to align the sequences so as to maximize the structural, functional, or physicochemical similarity of the residues in each column.
In simple cases, if the sequences are homologous, a good phylogenetic alignment will also maximize structural similarity. If the sequences are not homologous or so highly divergent that similarity is not clear, then a functional alignment may be very difficult to achieve. One common example of this kind of difficulty involves promoter sequences that share short functional motifs, such as binding sites for regulatory proteins. Most MSA packages struggle to correctly align such motifs and these are best searched for using special motif-finding packages or by comparison with sets of known motifs. A second example is where protein sequences share a common fold but no sequence similarity, perhaps because of convergent evolution of their three-dimensional structures or because of extreme divergence of the sequences. Again, such alignments are best carried out using special sequence–structure matching packages. In this chapter, we focus specifically on cases where we wish to align sequences that are clearly homologous and phylogenetically related.
When constructing an MSA, one must also take into account insertions and deletions that have taken place in the time during which the sequences under consideration have diverged from one another, after gene duplication or divergence of the host species. This means that MSA packages have to be able to find an arrangement of null characters or “gaps” that will somehow maximize the alignment of homologous residues in a fashion similar to that done for pairwise sequence alignments, as discussed in Chapter 3. These gaps are frequently represented by hyphens, as shown in Figure 8.1. Given a scoring scheme for residue matches (e.g. BLOSUM62; Henikoff and Henikoff 1992) and scores for gaps, one can attempt to find an MSA that produces the best overall score (and, thereby, the best overall alignment). In principle, this can be done using extensions of dynamic programming sequence alignment methods (Needleman and Wunsch 1970) to many sequences. This would then guarantee the best-scoring MSA. In practice, such extensions require time and memory that involve an exponential function of the number of sequences (written O(L^N), for N sequences of length L) and are limited to tiny numbers of sequences. Therefore, all of the methods that are widely used rely on heuristics to make the MSAs. The use of heuristics makes very large alignments possible but comes at the expense of a lack of guarantees about alignment scores or quality.
The most widely used MSA heuristic was called “progressive alignment” by Feng and Doolittle (1987); this method also belongs to a family of methods that were described by different groups in the 1980s (see, for example, Hogeweg and Hesper 1984). The earliest automatic MSA method that we are aware of was described by David Sankoff in 1973 (Sankoff et al. 1973) for aligning 5S rRNA sequences and is essentially a form of progressive alignment. All of these methods work by starting with alignments of pairs of sequences and merging these with new sequences or alignments to build up the MSA progressively. The order in which these alignments are performed is usually done according to some form of clustering of these sequences, generated by an all-against-all comparison, referred to as a “guide tree” in Higgins et al. (1992). A generic outline of this process is illustrated in Figure 8.2.
Figure 8.1
An example multiple sequence alignment of seven globin protein sequences. One position is highlighted.
Figure 8.2
An outline of the simple progressive multiple alignment process. There are variations for all of these steps and some of them are iterated in well-known packages such as MAFFT and MUSCLE.
第 8 章 多序列比对 / 引言
多序列比对(multiple sequence alignment,MSA)是指将两条以上的氨基酸序列或核苷酸序列排列在一起,使来自不同序列的残基以某种合理方式在垂直列中对齐。MSA 广泛用于蛋白质组和基因组分析中的各类分析流程,也是大多数系统发育比较的关键起点。研究者常用它寻找序列中的共同特征,并辅助预测蛋白质和核酸的二维、三维结构。Chatzou 等(2016)对 MSA 方法、用途及其误用作了很好的综述。
通常,只有在序列之间存在系统发育相关性、因而同源时,才应尝试对它们进行比对。在这种情况下,理想的比对应当把同源残基放在同一列中。图 8.1 给出了一个蛋白质多序列比对示例,其中突出显示了一列。如果这一列比对正确,就可以推断该列中的残基来自这些序列共同祖先中的同一个残基位置。这个祖先残基可能是缬氨酸(valine,V),也可能是异亮氨酸(isoleucine,I)或其他残基;关键在于,这一列中的所有氨基酸都源自共同祖先中的同一位置。这就是构建这类比对背后的系统发育视角。原则上,也可以尝试让序列比对最大化每一列残基在结构、功能或理化性质上的相似性。
在简单情形下,如果序列同源,一个良好的系统发育比对通常也会最大化结构相似性。如果序列并不同源,或者分化程度极高、相似性并不清楚,那么要得到有意义的功能性比对就会非常困难。一个常见例子是启动子序列:它们可能共享较短的功能基序,例如调控蛋白结合位点。多数 MSA 软件包很难正确比对这类基序;更合适的做法通常是使用专门的 motif-finding 软件包,或与已知基序集合进行比较。另一个例子是蛋白质序列具有共同折叠,但缺乏序列相似性;这可能源于三维结构的趋同进化,也可能源于序列极端分化。在这种情况下,也最好使用专门的序列—结构匹配软件包。本章将专门讨论这样一类情形:我们希望比对的序列明确同源,并且具有系统发育相关性。
构建 MSA 时,还必须考虑插入和缺失。在基因复制之后,或在宿主物种分化之后,待比较序列在彼此分化的过程中会发生插入和缺失。因此,MSA 软件包必须能够寻找一种空字符或“gap”(空位)的排列方式,使同源残基尽可能对齐;这一思路与第 3 章讨论的双序列比对类似。如图 8.1 所示,gap 通常用连字符表示。给定残基匹配打分方案(例如 BLOSUM62;Henikoff and Henikoff 1992)和 gap 打分之后,就可以尝试寻找一个总体得分最高、也就是总体上最优的 MSA。原则上,这可以通过把动态规划序列比对方法(Needleman and Wunsch 1970)扩展到多条序列来实现,并由此保证得到最高得分的 MSA。实践中,这类扩展需要的时间和内存随序列数量呈指数增长(可写作 O(L^N),其中 N 为序列条数,L 为序列长度),因此只能用于极少量序列。所以,所有广泛使用的方法都依赖启发式策略来构建 MSA。启发式方法使很大规模的比对成为可能,但代价是无法保证比对得分或比对质量一定最优。
最常用的 MSA 启发式方法由 Feng 和 Doolittle(1987)称为“progressive alignment”(渐进式比对);这一方法也属于 20 世纪 80 年代不同研究组提出的一类方法(例如 Hogeweg and Hesper 1984)。据作者所知,最早的自动 MSA 方法由 David Sankoff 于 1973 年提出,用于比对 5S rRNA 序列(Sankoff et al. 1973),本质上也是一种渐进式比对。所有这些方法都从序列两两比对开始,再逐步把新序列或已有比对合并进来,从而构建完整的 MSA。比对执行的顺序通常由某种序列聚类结果决定;这种聚类由全对全比较生成,Higgins 等(1992)将其称为“guide tree”(引导树)。图 8.2 概括展示了这一过程。
图 8.1
七条球蛋白蛋白质序列的多序列比对示例。图中突出显示了一个位置。
图 8.2
简单渐进式多序列比对过程示意。这个流程中的每一步都有不同变体,其中一些步骤会在 MAFFT 和 MUSCLE 等知名软件包中迭代执行。
术语表(10 条)
| English | 中文 |
|---|---|
| multiple sequence alignment (MSA) | 多序列比对(MSA) |
| homologous residues | 同源残基 |
| phylogenetic perspective | 系统发育视角 |
| functional motif | 功能基序 |
| motif-finding package | motif-finding 软件包 / 基序查找软件包 |
| sequence–structure matching package | 序列—结构匹配软件包 |
| gap | gap(空位) |
| progressive alignment | progressive alignment(渐进式比对) |
| guide tree | guide tree(引导树) |
| heuristic | 启发式策略 |

















