Chapter 8

Multiple Sequence Alignments

9 小节

056

Introduction

PDF page 247-248；Figure 8.2 跨至 PDF page 249 并按逻辑归属本节；印刷页码 227-229

▶

English SourcePDF extracted

Extraction note: PDF 文本存在部分单词粘连与换行断词；译文已按页渲染与上下文修正可读性。

Extracted source

Multiple Sequence Alignments

Fabian Sievers, Geoffrey J. Barton, and Desmond G. Higgins

Introduction

A multiple sequence alignment (MSA) is an arrangement of more than two amino acid or nucleotide sequences which are aligned so as to make the residues from the different sequences line up in vertical columns in some appropriate manner. These are used in a great variety of analyses and pipelines in proteome and genome analysis and are an essential initial step in most phylogenetic comparisons. They are widely used to help search for common features in sequences and can be used to help predict two- and three-dimensional structures of proteins and nucleic acids. An excellent review of MSA methods, uses, and abuses is provided by Chatzou et al. (2016).

Usually, one should only attempt to align sequences which are phylogenetically related and, therefore, homologous. In this case, the ideal alignment will have homologous residues aligned in the columns. An example of a multiple protein sequence alignment is shown in Figure 8.1. Here, one column is highlighted. If this column is well aligned, one can infer that the residues in that column have been derived from the same residue in the common ancestor of these sequences. That residue could have been a valine (V) or an isoleucine (I) or some other residue, but the key thing is that all of the amino acids in that column derive from that one position in the common ancestor. This is the phylogenetic perspective that underlies the construction of these alignments. In principle, one could also attempt to align the sequences so as to maximize the structural, functional, or physicochemical similarity of the residues in each column.

In simple cases, if the sequences are homologous, a good phylogenetic alignment will also maximize structural similarity. If the sequences are not homologous or so highly divergent that similarity is not clear, then a functional alignment may be very difficult to achieve. One common example of this kind of difficulty involves promoter sequences that share short functional motifs, such as binding sites for regulatory proteins. Most MSA packages struggle to correctly align such motifs and these are best searched for using special motif-finding packages or by comparison with sets of known motifs. A second example is where protein sequences share a common fold but no sequence similarity, perhaps because of convergent evolution of their three-dimensional structures or because of extreme divergence of the sequences. Again, such alignments are best carried out using special sequence–structure matching packages. In this chapter, we focus specifically on cases where we wish to align sequences that are clearly homologous and phylogenetically related.

When constructing an MSA, one must also take into account insertions and deletions that have taken place in the time during which the sequences under consideration have diverged from one another, after gene duplication or divergence of the host species. This means that MSA packages have to be able to find an arrangement of null characters or “gaps” that will somehow maximize the alignment of homologous residues in a fashion similar to that done for pairwise sequence alignments, as discussed in Chapter 3. These gaps are frequently represented by hyphens, as shown in Figure 8.1. Given a scoring scheme for residue matches (e.g. BLOSUM62; Henikoff and Henikoff 1992) and scores for gaps, one can attempt to find an MSA that produces the best overall score (and, thereby, the best overall alignment). In principle, this can be done using extensions of dynamic programming sequence alignment methods (Needleman and Wunsch 1970) to many sequences. This would then guarantee the best-scoring MSA. In practice, such extensions require time and memory that involve an exponential function of the number of sequences (written O(L^N), for N sequences of length L) and are limited to tiny numbers of sequences. Therefore, all of the methods that are widely used rely on heuristics to make the MSAs. The use of heuristics makes very large alignments possible but comes at the expense of a lack of guarantees about alignment scores or quality.

The most widely used MSA heuristic was called “progressive alignment” by Feng and Doolittle (1987); this method also belongs to a family of methods that were described by different groups in the 1980s (see, for example, Hogeweg and Hesper 1984). The earliest automatic MSA method that we are aware of was described by David Sankoff in 1973 (Sankoff et al. 1973) for aligning 5S rRNA sequences and is essentially a form of progressive alignment. All of these methods work by starting with alignments of pairs of sequences and merging these with new sequences or alignments to build up the MSA progressively. The order in which these alignments are performed is usually done according to some form of clustering of these sequences, generated by an all-against-all comparison, referred to as a “guide tree” in Higgins et al. (1992). A generic outline of this process is illustrated in Figure 8.2.

Figure 8.1

An example multiple sequence alignment of seven globin protein sequences. One position is highlighted.

Figure 8.2

An outline of the simple progressive multiple alignment process. There are variations for all of these steps and some of them are iterated in well-known packages such as MAFFT and MUSCLE.

中文译文

第 8 章多序列比对 / 引言

多序列比对（multiple sequence alignment，MSA）是指将两条以上的氨基酸序列或核苷酸序列排列在一起，使来自不同序列的残基以某种合理方式在垂直列中对齐。MSA 广泛用于蛋白质组和基因组分析中的各类分析流程，也是大多数系统发育比较的关键起点。研究者常用它寻找序列中的共同特征，并辅助预测蛋白质和核酸的二维、三维结构。Chatzou 等（2016）对 MSA 方法、用途及其误用作了很好的综述。

通常，只有在序列之间存在系统发育相关性、因而同源时，才应尝试对它们进行比对。在这种情况下，理想的比对应当把同源残基放在同一列中。图 8.1 给出了一个蛋白质多序列比对示例，其中突出显示了一列。如果这一列比对正确，就可以推断该列中的残基来自这些序列共同祖先中的同一个残基位置。这个祖先残基可能是缬氨酸（valine，V），也可能是异亮氨酸（isoleucine，I）或其他残基；关键在于，这一列中的所有氨基酸都源自共同祖先中的同一位置。这就是构建这类比对背后的系统发育视角。原则上，也可以尝试让序列比对最大化每一列残基在结构、功能或理化性质上的相似性。

在简单情形下，如果序列同源，一个良好的系统发育比对通常也会最大化结构相似性。如果序列并不同源，或者分化程度极高、相似性并不清楚，那么要得到有意义的功能性比对就会非常困难。一个常见例子是启动子序列：它们可能共享较短的功能基序，例如调控蛋白结合位点。多数 MSA 软件包很难正确比对这类基序；更合适的做法通常是使用专门的 motif-finding 软件包，或与已知基序集合进行比较。另一个例子是蛋白质序列具有共同折叠，但缺乏序列相似性；这可能源于三维结构的趋同进化，也可能源于序列极端分化。在这种情况下，也最好使用专门的序列—结构匹配软件包。本章将专门讨论这样一类情形：我们希望比对的序列明确同源，并且具有系统发育相关性。

构建 MSA 时，还必须考虑插入和缺失。在基因复制之后，或在宿主物种分化之后，待比较序列在彼此分化的过程中会发生插入和缺失。因此，MSA 软件包必须能够寻找一种空字符或“gap”（空位）的排列方式，使同源残基尽可能对齐；这一思路与第 3 章讨论的双序列比对类似。如图 8.1 所示，gap 通常用连字符表示。给定残基匹配打分方案（例如 BLOSUM62；Henikoff and Henikoff 1992）和 gap 打分之后，就可以尝试寻找一个总体得分最高、也就是总体上最优的 MSA。原则上，这可以通过把动态规划序列比对方法（Needleman and Wunsch 1970）扩展到多条序列来实现，并由此保证得到最高得分的 MSA。实践中，这类扩展需要的时间和内存随序列数量呈指数增长（可写作 O(L^N)，其中 N 为序列条数，L 为序列长度），因此只能用于极少量序列。所以，所有广泛使用的方法都依赖启发式策略来构建 MSA。启发式方法使很大规模的比对成为可能，但代价是无法保证比对得分或比对质量一定最优。

最常用的 MSA 启发式方法由 Feng 和 Doolittle（1987）称为“progressive alignment”（渐进式比对）；这一方法也属于 20 世纪 80 年代不同研究组提出的一类方法（例如 Hogeweg and Hesper 1984）。据作者所知，最早的自动 MSA 方法由 David Sankoff 于 1973 年提出，用于比对 5S rRNA 序列（Sankoff et al. 1973），本质上也是一种渐进式比对。所有这些方法都从序列两两比对开始，再逐步把新序列或已有比对合并进来，从而构建完整的 MSA。比对执行的顺序通常由某种序列聚类结果决定；这种聚类由全对全比较生成，Higgins 等（1992）将其称为“guide tree”（引导树）。图 8.2 概括展示了这一过程。

图 8.1

七条球蛋白蛋白质序列的多序列比对示例。图中突出显示了一个位置。

图 8.2

简单渐进式多序列比对过程示意。这个流程中的每一步都有不同变体，其中一些步骤会在 MAFFT 和 MUSCLE 等知名软件包中迭代执行。

术语表（10 条）

English	中文
multiple sequence alignment (MSA)	多序列比对（MSA）
homologous residues	同源残基
phylogenetic perspective	系统发育视角
functional motif	功能基序
motif-finding package	motif-finding 软件包 / 基序查找软件包
sequence–structure matching package	序列—结构匹配软件包
gap	gap（空位）
progressive alignment	progressive alignment（渐进式比对）
guide tree	guide tree（引导树）
heuristic	启发式策略

PDF 插图 (6 页)

057

Measuring Multiple Alignment Quality

PDF page 248 底部 - PDF page 251 真实 `Making an Alignment: Practical Issues` 标题前；印刷页码 228-231

▶

English SourcePDF extracted

Extraction note: PDF 文本存在部分单词粘连与换行断词；已在本文件中按可读性修复常见断词与空格。

Extracted source

Measuring Multiple Alignment Quality

There are literally hundreds of different MSA packages and each uses different combinations of parameter settings and heuristic algorithms to make the alignments. How can we tell which package works best or is best-suited to which kinds of data? One standard approach is to compare alignments produced by different packages with a set of established “gold standard” reference alignments. Such sets are used as benchmarks and have been invaluable for developers of MSA packages in order to test and compare MSAs. For proteins, the most widely used MSA benchmarks have tended to rely on comparisons of protein sequences with known structures. This is due to the observation that protein sequences with very similar structures can actually have very highly divergent sequences. Therefore, this approach is very much based on a structural perspective. In turn, phylogenetic benchmarks have tended to use simulated alignments and/or sets of sequences with known phylogeny and do not necessarily give the same results as one would obtain based on structure (Iantorno et al. 2014). The use of structures in a benchmark entails aligning the structures automatically or manually, then using the corresponding sequence alignment to test various MSA packages. Early structural aligners included SSAP (Taylor and Orengo 1989) and STAMP (Russell and Barton 1992); a more recent program is MUSTANG (Konagurthu et al. 2006). While this process leads to a structural superposition of extant sections of the sequences to be aligned, it may not always be easy to align individual residues. Therefore, creating a reliable reference alignment may require some manual intervention, something that is not always straightforward (Edgar 2010; Iantorno et al. 2014).

The earliest large-scale MSA benchmark is BAliBASE. The original version (Thompson et al. 1999) contained over 140 reference alignments divided into five hierarchical reference sets, in an attempt to cover many different alignment scenarios. These include equidistant sequences of similar length (the BB11/12 reference set), families containing orphan sequences (BB2), equidistant divergent families (BB3), N/C-terminal extensions (BB4), and alignments with insertions (BB5). For categories BB11, BB12, BB2, and BB3, different sequence lengths from less than 100 to more than 400 residues are covered. For category BB11/12, alignments with low to high sequence identity are used. While the current version of BAliBASE is 4.0, we will use BAliBASE 3.0 (Thompson et al. 2005) for the purposes of this discussion. Version 3 comprises the same five categories as version 1; however, the number of reference alignments has been increased to 218. The number of sequences of the reference alignments ranges from 4 to 142, with a median of 21. The BAliBASE benchmark contains a scoring program that assesses how well a generated (test) protein MSA resembles the reference alignment. Similarity between test and reference alignments in BAliBASE is expressed by two numbers: the sum-of-pairs (SP) score and the total column (TC) score. The scoring programme measures SP and TC scores only for regions that are reliably aligned in the reference; these regions are the so-called “core columns.” OXBench (Raghava et al. 2003) and SABmark (Van Walle et al. 2005) are based on similar principles as BAliBASE. SABmark comprises 1268 alignments, ranging from three to 50 (median eight) sequences. OXBench comprises 672 families of between two and 122 (median three) sequences. In this chapter, we give SP and TC scores for various MSA packages, as measured using the BALiBASE benchmark.

The SP score measures the proportion of correctly aligned residue pairs while the TC score measures the proportion of reference alignment columns, which are perfectly retrieved in the generated MSA. Both scores can vary between 0 (that is, no residue pair or column retrieved) and 1 (that is, the generated MSA and the reference alignment are identical). For a pairwise alignment, the SP score and TC score are the same. For an MSA aligning three or more sequences, the TC score can never exceed the SP score. The SP and TC scores give a measure of the sensitivity of the aligner, measuring the fraction of correctly aligned residues and columns (the number of true positives). They do not, however, penalize for incorrectly aligned residues, which would be a measure of the specificity of the aligner (the number of true negatives). The specificity and sensitivity (see Box 5.4) of alignments in a benchmark test can be quantified by the Cline shift score (Cline et al. 2002) and the QModeller score (Sauder et al. 2000), which take incorrectly aligned residues into account.

The maximum number of sequences in benchmarks such as BAliBASE 3.0, SABmark, or OXBench is of the order of 100. None of these benchmarks can explore the performance of MSA software if thousands or even millions of sequences have to be aligned. One way to increase the number of sequences that can be aligned is to mix a set of sequences for which a reliable alignment is known with sequences for which no reliable alignment is known. This was done with OXBench to give the “extended dataset,” which had datasets of over 1000 sequences for some families. PREFAB (Edgar 2004) was designed from the outset with this principle in mind. PREFAB comprises 1682 reference alignments of two sequences, to which between 0 and 48 (median 48, mean 45.2) non-reference sequences are added. The software performs the alignment of the full set of (up to 50) sequences. However, the quality of the alignment can only be evaluated based on the alignment of the two reference sequences. A general purpose scoring program called qscore is available from the same web site that distributes PREFAB and MUSCLE.

A benchmark that extends the number of sequences into the tens of thousands is HomFam (Blackshields et al. 2010; Sievers et al. 2013). It is based on similar principles to PREFAB in that it mixes a small number of sequences for which a reliable alignment is known with a large number of homologous sequences for which no reliable alignment is known. The reference alignments come from the Homstrad structure alignment database (Mizuguchi et al. 1998) and the bulk of the sequences come from Pfam (Finn et al. 2014). The reference alignments comprise between five and 41 sequences, while the number of Pfam sequences varies between approximately 100 and 100000. The 2013 HomFam dataset contains 95 families.

Recently, a new class of benchmarks has been devised that can test an aligner with arbitrarily large numbers of sequences, relies on a small number of references, and assesses the alignment of all sequences in the alignment (including non-reference sequences). The first such benchmark is ContTest (Fox et al. 2016). In ContTest, the MSA is used to detect the co-evolution of alignment columns and produce a contact map prediction (Marks et al. 2011). This contact map prediction is then compared with the observed contact map of an embedded reference sequence. The accuracy with which the predicted and the observed contact maps agree serves as a proxy for the alignment quality. Co-evolution can only be detected if the information content of the alignment is large enough; that is, there should be at least as many sequences in the alignment as there are residues in the reference sequences. In practice, the number of sequences should be five times as large, so, for a typical protein domain, ContTest will not work well for fewer than 1000 sequences.

Another such benchmark is QuanTest (Le et al. 2017). Here, the MSA is used to predict secondary structure (Drozdetskiy et al. 2015), and then this predicted secondary structure is compared with the true secondary structure of one or more of the embedded reference sequences. In general, secondary structure prediction accuracy increases with the number of aligned sequences, but useful predictions can already be made for 200 sequences. Therefore, QuanTest is more applicable to smaller alignments than ContTest.

中文译文

第 8 章多序列比对 / 多序列比对质量的衡量

不同的 MSA 软件包可谓数以百计，每个软件包都会使用不同组合的参数设置和启发式算法来生成比对。那么，如何判断哪个软件包效果最好，或者最适合哪一类数据？一种标准做法，是把不同软件包生成的比对与一组公认的“gold standard”（金标准）参考比对进行比较。这类集合可作为 benchmark（基准测试集），对 MSA 软件包开发者测试和比较 MSA 极其重要。对于蛋白质而言，最常用的 MSA benchmark 往往依赖已知结构蛋白质序列之间的比较。这是因为，结构非常相似的蛋白质序列实际上可能已经高度分化。因此，这种方法明显建立在结构视角之上。相对地，系统发育 benchmark 往往使用模拟比对，和/或使用系统发育关系已知的序列集合；它们得到的结果不一定与基于结构的结果一致（Iantorno et al. 2014）。

在 benchmark 中使用结构，意味着需要先以自动或人工方式对结构进行比对，然后用对应的序列比对来测试各种 MSA 软件包。早期的结构比对程序包括 SSAP（Taylor and Orengo 1989）和 STAMP（Russell and Barton 1992）；较新的程序包括 MUSTANG（Konagurthu et al. 2006）。虽然这一过程会对待比对序列中仍然存在的片段进行结构叠合，但逐个残基的对齐并不总是容易。因此，创建可靠的参考比对可能需要一定人工干预，而这并不总是直截了当（Edgar 2010; Iantorno et al. 2014）。

最早的大规模 MSA benchmark 是 BAliBASE。其原始版本（Thompson et al. 1999）包含 140 多个参考比对，并被划分为五个层级化参考集合，目的是覆盖多种不同的比对场景。这些场景包括：长度相近且距离相等的序列（BB11/12 reference set）、包含孤立序列的家族（BB2）、距离相等但分化较大的家族（BB3）、N 端或 C 端延伸（BB4），以及含插入的比对（BB5）。对于 BB11、BB12、BB2 和 BB3 类别，覆盖的序列长度从小于 100 个残基到大于 400 个残基不等。对于 BB11/12 类别，则使用从低序列同一性到高序列同一性的比对。虽然 BAliBASE 当前版本是 4.0，但本章讨论将使用 BAliBASE 3.0（Thompson et al. 2005）。第 3 版包含与第 1 版相同的五个类别，但参考比对数量增加到 218 个。每个参考比对中的序列数从 4 到 142 条不等，中位数为 21。BAliBASE benchmark 包含一个打分程序，用于评估生成的（测试）蛋白质 MSA 与参考比对的相似程度。

在 BAliBASE 中，测试比对与参考比对之间的相似性用两个数值表示：sum-of-pairs（SP，成对求和）score 和 total column（TC，总列）score。该打分程序只在参考比对中可靠对齐的区域测量 SP 和 TC score；这些区域称为“core columns”（核心列）。OXBench（Raghava et al. 2003）和 SABmark（Van Walle et al. 2005）基于与 BAliBASE 类似的原则。SABmark 包含 1268 个比对，序列数从 3 到 50 条不等（中位数为 8）。OXBench 包含 672 个家族，每个家族有 2 到 122 条序列（中位数为 3）。本章将给出多个 MSA 软件包在 BAliBASE benchmark 下测得的 SP 和 TC score。

SP score 衡量正确对齐的残基对所占比例；TC score 衡量参考比对中的列在生成的 MSA 中被完整恢复的比例。两个分数都可以在 0 到 1 之间变化：0 表示没有恢复任何残基对或列，1 表示生成的 MSA 与参考比对完全相同。对于双序列比对，SP score 与 TC score 相同。对于包含三条或更多序列的 MSA，TC score 永远不会超过 SP score。SP 和 TC score 衡量的是 aligner（比对程序）的 sensitivity（灵敏度），也就是正确对齐的残基和列所占比例，即 true positives（真阳性）数量。可是，它们不会惩罚错误对齐的残基；这类错误本应反映 aligner 的 specificity（特异度），也就是真阴性数量。benchmark 测试中比对的 specificity 与 sensitivity（见 Box 5.4）可以用 Cline shift score（Cline et al. 2002）和 QModeller score（Sauder et al. 2000）来量化，因为这些指标会把错误对齐的残基纳入考虑。

BAliBASE 3.0、SABmark 或 OXBench 这类 benchmark 中的最大序列数大约在 100 条量级。如果需要比对成千上万甚至数百万条序列，这些 benchmark 都无法考察 MSA 软件的性能。增加可比对序列数量的一种方法，是把一组已有可靠比对的序列，与一些没有可靠比对的序列混合。OXBench 的“extended dataset”（扩展数据集）采用了这种做法，其中某些家族的数据集包含 1000 条以上序列。PREFAB（Edgar 2004）从一开始就是按这一原则设计的。PREFAB 包含 1682 个由两条序列构成的参考比对；在每个参考比对中，会额外加入 0 到 48 条非参考序列（中位数 48，平均数 45.2）。软件会对完整序列集合（最多 50 条序列）进行比对。不过，比对质量只能根据两条参考序列之间的比对来评估。一个名为 qscore 的通用打分程序，可从发布 PREFAB 和 MUSCLE 的同一网站获得。

HomFam（Blackshields et al. 2010; Sievers et al. 2013）是一种把序列数量扩展到数万条级别的 benchmark。它基于与 PREFAB 类似的原则：把少量已有可靠比对的序列，与大量尚无可靠比对但同源的序列混合在一起。参考比对来自 Homstrad 结构比对数据库（Mizuguchi et al. 1998），而大部分序列来自 Pfam（Finn et al. 2014）。参考比对包含 5 到 41 条序列；Pfam 序列数则在约 100 到 100000 条之间变化。2013 年版 HomFam 数据集包含 95 个家族。

近年来，研究者设计出一类新的 benchmark。它们可以测试面对任意大规模序列数量的 aligner，只依赖少量参考序列，并且评估比对中所有序列的对齐情况，包括非参考序列。第一个这类 benchmark 是 ContTest（Fox et al. 2016）。在 ContTest 中，MSA 被用于检测比对列之间的 co-evolution（共进化），并据此生成 contact map（接触图）预测（Marks et al. 2011）。随后，将预测接触图与嵌入其中的参考序列的观测接触图进行比较。预测图与观测图的一致程度，可作为比对质量的 proxy（代理指标）。只有当比对的信息量足够大时，才可能检测到共进化；也就是说，比对中的序列数至少应与参考序列中的残基数相当。实践中，序列数最好达到残基数的五倍。因此，对于典型蛋白结构域，如果少于 1000 条序列，ContTest 的效果通常不会很好。

另一个同类 benchmark 是 QuanTest（Le et al. 2017）。在这种方法中，MSA 被用于预测二级结构（Drozdetskiy et al. 2015），随后将预测得到的二级结构与一个或多个嵌入参考序列的真实二级结构进行比较。一般来说，二级结构预测准确度会随着已比对序列数量的增加而提高；但在 200 条序列时，已经可以得到有用的预测。因此，相比 ContTest，QuanTest 更适用于规模较小的比对。

术语表（14 条）

English	中文
benchmark	benchmark（基准测试集）
gold standard reference alignment	金标准参考比对
structural perspective	结构视角
phylogenetic benchmark	系统发育 benchmark
reference alignment	参考比对
sum-of-pairs (SP) score	sum-of-pairs（SP，成对求和）score
total column (TC) score	total column（TC，总列）score
core columns	core columns（核心列）
sensitivity	sensitivity（灵敏度）
specificity	specificity（特异度）
true positives / true negatives	true positives（真阳性）/ true negatives（真阴性）
contact map	contact map（接触图）
proxy	proxy（代理指标）
co-evolution	co-evolution（共进化）

PDF 插图 (8 页)

058

Making an Alignment: Practical Issues

PDF page 251 真实标题起 - PDF page 252 `Commonly Used Alignment Packages` 标题前；印刷页码 231-232

▶

English SourcePDF extracted

Extraction note: PDF 文本存在部分单词粘连与换行断词；已在本文件中按可读性修复常见断词与空格。

Extracted source

Making an Alignment: Practical Issues

Most automatic alignment programs such as the ones described in the next section will give good quality alignments for sequences that are similar. However, building good multiple alignments for sequences that are highly divergent is an expert task even with the best available alignment tools. In this section we give an overview of some of the steps to go through in order to make alignments that are good for structure/function prediction. This is not a universal recipe, as each set of sequences presents its own problems and only experience can guide the creation of high-quality alignments.

The key steps in building a multiple alignment are as follows:

Find the sequences to align by database searching or other means.
Locate the region(s) of each sequence to include in the alignment. Do not try to multiply align sequences that are very different in length. Most multiple alignment programs are designed to align sequences that are similar over their entire length, so first edit the sequences down to those regions that the sequence database search suggests are similar. Some database search tools can be of assistance in identifying such regions (e.g. PSI-BLAST; Altschul et al. 1997).
Run the multiple alignment program.
Inspect the alignment for problems. Take particular care over regions that appear to be speckled with gaps. Use an alignment visualization tool (e.g. Jalview or SeaView; see Viewing a Multiple Alignment) to identify positions in the alignment that conserve physicochemical properties across the complete alignment. If there are no such regions, then look at subsets of the sequences.
Remove sequences that appear to seriously disrupt the alignment and then realign the subset that is left.
After identifying key residues in the set of sequences that are straightforward to align, attempt to add the remaining sequences to the alignment so as to preserve the key features of the family.

With the exception of the first step (database search), all of the above steps can be managed within the Jalview program (see Viewing a Multiple Alignment), software that combines powerful alignment editing and subsetting functions with integrated access to eight multiple alignment algorithms. Alternatively, many of the programs described below can be run from websites where the user pastes a set of sequences into a window or uploads a file with sequences in a standard file format. This works well for occasional use, and the use of many of these websites is relatively self-explanatory. In particular, we recommend the tools server at the European Bioinformatics Institute (EBI), which allows for online usage of the most widely used MSA packages. Some servers have limits on the number of sequences that can be aligned at one time or the user may need to make hundreds of alignments. In these cases, the user can run these alignment programs locally on a server or a desktop computer. Familiarity with the basic use of the Linux operating system then becomes important. All of the commonly used alignment packages can be run using so-called “command-line input,” where the user enters the name of the program (e.g. clustalo) at a prompt in a terminal window, followed by instructions for input and output. Basic usage for Linux command-line operation is given below for most of the commonly used multiple alignment packages.

中文译文

第 8 章多序列比对 / 构建比对：实践问题

下一节将介绍的多数自动比对程序，在处理彼此相似的序列时通常能够生成质量较好的比对。然而，即使用上最好的比对工具，要为高度分化的序列构建良好的多序列比对，仍然是一项需要专业判断的工作。本节概述为了得到适合结构/功能预测的比对，通常需要经历的一些步骤。这不是一个通用配方，因为每一组序列都有自身的问题，只有经验才能指导高质量比对的构建。

构建多序列比对的关键步骤如下：

通过数据库搜索或其他方式找到需要比对的序列。
确定每条序列中应纳入比对的区域。不要尝试对长度差异很大的序列直接进行多序列比对。多数多序列比对程序是为“整条序列范围内彼此相似”的序列设计的，因此，应先根据序列数据库搜索提示的相似区域，对序列进行裁剪，只保留这些区域。一些数据库搜索工具可以帮助识别这类区域（例如 PSI-BLAST；Altschul et al. 1997）。
运行多序列比对程序。
检查比对中是否存在问题。尤其要注意那些看起来被 gap 零散打断的区域。使用 alignment visualization tool（比对可视化工具，例如 Jalview 或 SeaView；见“Viewing a Multiple Alignment”）来识别在整个比对中保持理化性质保守的位置。如果找不到这类区域，就需要查看序列的不同子集。
移除那些明显严重扰乱比对的序列，然后对剩余子集重新比对。
在容易比对的序列集合中识别出关键残基之后，再尝试把其余序列加入比对，同时尽量保留该家族的关键特征。

除第一步数据库搜索以外，上述所有步骤都可以在 Jalview 程序中完成（见“Viewing a Multiple Alignment”）。Jalview 把强大的比对编辑和子集选择功能，与八种多序列比对算法的集成访问结合在一起。另一种做法是，使用下文介绍的许多程序所提供的网站：用户可以把一组序列粘贴到网页窗口中，或上传含有标准文件格式序列的文件。对于偶尔使用而言，这种方式很方便，而且许多网站的使用方式相对直观。特别推荐 European Bioinformatics Institute（EBI）的工具服务器，它允许在线使用最常用的 MSA 软件包。

有些服务器会限制一次能够比对的序列数量；也有可能用户需要生成数百个比对。在这些情况下，用户可以在服务器或台式机上本地运行这些比对程序。此时，熟悉 Linux 操作系统的基本使用就变得很重要。所有常用比对软件包都可以通过所谓的“command-line input”（命令行输入）运行：用户在终端窗口的提示符后输入程序名称（例如 clustalo），再跟上输入和输出相关指令。下文将给出多数常用多序列比对软件包在 Linux 命令行操作中的基本用法。

术语表（9 条）

English	中文
alignment tool	比对工具
highly divergent sequences	高度分化的序列
structure/function prediction	结构/功能预测
alignment visualization tool	alignment visualization tool（比对可视化工具）
gap	gap
subsetting	子集选择
command-line input	command-line input（命令行输入）
prompt	提示符
`clustalo`	命令名保持原样

PDF 插图 (4 页)

059

Commonly Used Alignment Packages — Part 1: Clustal Omega

PDF page 252-256；印刷页码 232-236；止于 ClustalW2 标题前

▶

English SourcePDF extracted

源文暂缺。

中文译文

第 8 章多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

本节介绍如何使用一系列常用软件包构建多序列比对。关于源代码下载或在线使用的汇总信息，见本章“Internet Resources”。

Clustal Omega

Clustal Omega（Sievers et al. 2011）是 Clustal MSA 软件套件的最新成员，可用于氨基酸序列和核苷酸序列。它几乎是对前代 ClustalW2（Larkin et al. 2007）的彻底重写。与 ClustalW2 相比，Clustal Omega 的主要改进包括：能够在更短时间内比对数量远多于 ClustalW2 的序列；根据基于晶体结构的 benchmark 衡量，通常能产生更准确的比对；并且能够把关于最终比对总体结构的先验知识纳入计算。Clustal Omega 是一个命令行驱动程序，已经成功编译到 Linux、Mac 和 Windows 平台。与前代不同，Clustal Omega 没有 graphical user interface（GUI，图形用户界面）；不过，许多优秀的比对可视化程序（如 SeaView；Gouy et al. 2010；Jalview；Waterhouse et al. 2009），以及 European Molecular Biology Laboratory（EMBL）-EBI bioinformatic web and programmatic tools framework、Max Planck Bioinformatics Toolkit、Pasteur Institute 的 Galaxy server 等在线服务器，可以弥补这一不足。

Clustal Omega 是一种 progressive aligner（渐进式比对程序）。它使用 guide tree（引导树）来指导多序列比对；这个 guide tree 由序列之间的成对距离矩阵计算得到。对于 N 条序列，这需要进行 N × N 次序列比较，并存储一个 N × N 距离矩阵。过去，这一步通常是阻止传统 aligner 比对大量序列的瓶颈。实际限制大约在 10000 条序列或更少。不过，Clustal Omega 默认并不计算全对全距离矩阵，而是使用 mBed 算法（Blackshields et al. 2010）。mBed 会计算所有序列相对于少量随机选择的“seed”（种子）序列的距离矩阵。因此，mBed 算法的计算需求并不随 N 呈平方增长，而是按 N × log(N) 增长。

Clustal Omega 使用 mBed 距离矩阵对序列进行 k-means 聚类。默认情况下，每个 cluster（簇）的大小上限为 100 条序列。程序会为各个 cluster 生成小的 guide tree，并为这些 cluster 构建一个总的 guide tree。默认 cluster 大小上限设为 100，是因为当时典型比对规模通常不超过 10000 条序列，这样最多会有 100 个大小为 100 的 cluster；对于更大的比对，可以通过设置 --cluster-size flag 调整 cluster 大小。尽管较小的距离矩阵看似减少了信息量，但使用 mBed guide tree 生成的比对，质量通常与基于全对全距离矩阵的比对相当，甚至更高。若需要完整距离矩阵计算，可以用 --full flag 关闭 mBed 模式。

在 progressive alignment 启发式策略的主比对步骤中，单条序列先被比对形成 subalignment（子比对），较小的 subalignment 再彼此比对，逐步形成越来越大的 subalignment。在 Clustal Omega 中，这些成对比对由 hhalign（Söding 2005）执行。该程序会把单条序列和小的 subalignment 转换成 hidden Markov models（HMMs，隐马尔可夫模型），然后以成对方式比对这些 HMM。

Clustal Omega 的文件输入/输出过程使用 Sean Eddy 的 squid library，因此能够读写多种常用序列格式，如 a2m/FASTA、Clustal、msf、PHYLIP、selex、Stockholm 和 Vienna。默认输出格式为 FASTA。最小 Clustal Omega 命令行如下：

clustalo -i <infile> -o <outfile>

其中，<infile> 是包含待比对序列的文件占位符，文件格式应为程序可识别的格式之一；<outfile> 是保存已比对序列的输出文件占位符，输出为 FASTA 格式。

Iteration

Clustal Omega 能够对比对进行 iterative refinement（迭代优化）。在初始比对阶段，距离基于未比对序列的 k-mer 计算。在迭代优化过程中，距离则基于完整比对计算。这样做的期望是，完整比对距离能更好地反映序列之间的相似性，因此会生成“更好”的 guide tree，并进一步产生更好的比对。Clustal Omega 还会把初始比对转换为一个 HMM，然后在后台将该 HMM 与单条序列和 subprofile（子 profile）比对，使 Clustal Omega 能够“预判”其他序列将如何、在何处与其对齐。这里所谓“预判”的具体方法，是把初始比对 HMM 中的 pseudocount（伪计数）信息转移到需要重新比对的序列和 subalignment 中；Sievers 等（2011）对此过程有更详细说明。

序列比对在 progressive alignment 的早期阶段尤其容易发生错配，因此转移到单条序列和小 subalignment 的 pseudocount 信息可能较大。随着 progressive alignment 后期 subalignment 逐渐增大，应该已经积累了足够多的“真实”信息，因此 pseudocount 转移可以相应缩小。对于包含 100 条或更多序列的 subprofile，实际上不会发生 pseudocount 转移。比对原则上可以被无限次优化；不过，经验表明，一到两轮迭代通常能明显提高比对质量。超过两轮迭代很少有用，应根据具体情况决定是否使用。执行迭代比对的最小命令如下：

clustalo -i infile.fa -o outfile1.fa --iter=1

其中，infile.fa 和 outfile1.fa 分别是 FASTA 格式输入文件和输出文件的名称。

需要注意，迭代会带来性能代价。每一轮迭代都需要额外执行三次比对：第一和第二个 subalignment 需要分别与背景 HMM 比对；随后，两个加入了 pseudocount 背景信息的 subalignment 还需要彼此比对。一轮迭代比对大约需要初始比对四倍的时间；两轮迭代比对大约需要原始比对七倍的时间。

迭代过程中，初步比对会被转换为 HMM，随后用这个 HMM 生成质量更高的比对。HMM 信息也可以从外部生成。如果已知待比对序列的类型，可能已经存在预先计算好的 HMM。例如，Pfam（Finn et al. 2016）包含大量蛋白质家族、比对及其 HMM。如果已知待比对序列与 Pfam 中某个家族同源，就可以从 Pfam 下载相应 HMM，并将其作为额外命令行参数使用：

clustalo -i infile.fa -o outfile4.fa --hmm-in=pfam.hmm

其中，pfam.hmm 是从 Pfam 下载的 HMM，包含与 infile.fa 中序列同源的蛋白质家族的比对信息。另一种做法是，使用 HMMER（Finn et al. 2011）从本地产生的比对生成 HMM。

Benchmarking Clustal Omega

评估一个多序列比对程序的性能时，需要考虑几个问题。比对软件能否处理输入序列的数量？比对过程需要多长时间？这种比对能否扩展到更多序列，或更长序列？与已知三维结构序列的标准比对相比，这些比对有多准确？不同 aligner 在这些方面表现各不相同。有些 aligner 在小规模序列集合上非常快，但当序列数超过几百时，会需要不切实际的运行时间。不过，这些较慢 aligner 中有些在 benchmark 上可能非常准确。相反，有些 aligner 能够处理极大数据集，但会牺牲一定准确度。本节将从计算时间、比对准确度，以及处理长序列或大量序列的能力等方面，把 Clustal Omega 与若干常用比对软件包进行比较。后续小节会详细说明这些比对软件包及其使用方法。

图 8.3 和表 8.1 给出了使用成熟 BAliBASE3 benchmark（Thompson et al. 2005）得到的结果。在这里，准确度用 218 个 benchmark 比对中的比对列比例来衡量，并在表中表示为 TC score。Clustal Omega 既不是最快的比对软件包，也不是最准确的比对软件包；但它比所有更快的 aligner 都更准确。唯一获得更高 TC score 的 aligner，是 MAFFT 软件包中的 L-INS-i（Katoh et al. 2005a,b）（图 8.3）。图 8.3 给出了 BAliBASE3 的总运行时间和整体准确度分数。BAliBASE3 被划分为若干比对类型子类别，其各自结果见表 8.1。

图 8.3

使用 BAliBASE3 benchmark 比较 aligner 准确度与单线程总运行时间。时间为所有 218 个测试比对的总和，total column（TC）score 为平均值。x 轴（时间）为对数尺度，y 轴（TC Score）为线性尺度。数据点对应 aligner 默认设置。额外数据点包括 Clustal Omega（i1：更准确模式）、MUSCLE（i2：快速模式）和 PASTA（m 表示以 MUSCLE 为 subaligner；w 表示以 ClustalW2 为 subaligner）。数据点对应表 8.1 的第 8 和第 9 列。

表 8.1 BAliBASE3 benchmark 上的 aligner 性能

Aligner	BB11	BB12	BB2	BB3	BB4	BB5	all	Time	RSS	ss
ClustalO	0.36	0.79	0.45	0.58	0.58	0.53	0.55	00h:04m:25s	959060	55961
ClustalO-i1	0.36	0.79	0.45	0.59	0.59	0.55	0.56	00h:24m:53s	3442156	106888
ClustalW2	0.22	0.71	0.22	0.27	0.40	0.31	0.37	00h:09m:58s	8032	3852
DIALIGN	0.27	0.70	0.29	0.31	0.44	0.43	0.42	00h:47m:28s	56912	7350
Kalign	0.37	0.79	0.36	0.48	0.50	0.44	0.50	00h:00m:24s	7260	2776
L-INS-i	0.40	0.84	0.46	0.59	0.60	0.59	0.58	00h:30m:01s	703524	43695
MAFFT	0.29	0.77	0.33	0.42	0.49	0.50	0.47	00h:00m:50s	461668	35950
PartTree	0.28	0.76	0.30	0.40	0.45	0.50	0.45	00h:00m:57s	448524	19421
MUSCLE	0.32	0.80	0.35	0.41	0.45	0.46	0.48	00h:07m:48s	78608	15892
MUSCLE-i2	0.27	0.76	0.33	0.38	0.43	0.43	0.45	00h:01m:47s	78780	15860
PASTA(w)	0.24	0.71	0.23	0.23	0.37	0.34	0.37	01h:08m:49s	317112	58703
PASTA	0.35	0.78	0.45	0.50	0.51	0.52	0.53	01h:45m:08s	664336	65448
PASTA(m)	0.30	0.78	0.31	0.35	0.44	0.39	0.44	01h:10m:43s	323936	62038
PRANK	0.24	0.68	0.25	0.35	0.36	0.39	0.39	35h:55m:53s	468692	36742
T-Coffee	0.41	0.86	0.40	0.47	0.55	0.59	0.55	05h:48m:46s	1870536	192504
测试比对数	38	44	41	30	49	16	218

第 2–7 列（BB11–BB5）为各层级参考集合的平均 total column（TC）score；第 8 列（all）为全部 218 个测试比对的平均 TC score。第 9 列（time）为所有 218 个测试比对的总单线程运行时间。第 10 列（RSS）为最大内存需求；第 11 列（rss）为平均内存需求。第 8/9 列（all/time）对应图 8.4。最后一行给出每个层级集合中的测试比对数量。

表 8.1 的性能度量基于固定大小的数据集。图 8.4 则绘制了多种 MSA 算法在待比对序列数量增加时的运行时间；所用数据来自 Pfam（Finn et al. 2014）的三组不同长度序列。柱形对应一个很短的蛋白结构域（zf-CCHH，平均长度 23 个氨基酸）、一个中等长度结构域（rvp，平均序列长度 93），以及一个长蛋白结构域（RuBisCO_large，长度 248）。图 8.4 使用双对数图展示结果。较平缓的曲线代表扩展性较好，也就是说，待比对序列数增加时，计算时间只会适度增加。陡峭曲线则代表扩展性较差，使用越来越大的序列集合时，计算时间会快速增加。Clustal Omega 的结果用红色柱（底部为 zf-CCHH，顶部为 RuBisCO_large）和圆点（rvp）表示。对于 20–1000 条序列的数据集，Clustal Omega 慢于 Kalign（品红色圆点）、默认 MAFFT（深蓝色圆点）或快速 MUSCLE（绿色方块）。由于扩展性更好，Clustal Omega 在 N = 2000 时超过快速 MUSCLE 和 Kalign，并在 N = 20000 时超过默认 MAFFT。MAFFT PartTree（深蓝色方块）在所有数据集上都始终快于 Clustal Omega。

图 8.4

随着序列数量（x 轴）增加，不同 aligner 的总单线程执行时间（y 轴）。两个坐标轴均为对数尺度。须状线表示从短序列（下须：zf-CCHH/PF00096，长度 23–34 个残基）到长序列（上须：RuBisCO_large/PF00016，长度 295–329 个残基）的时间范围。实线连接中等长度序列的时间点（rvp/PF00077，长度 94–124）。

Clustal Omega 中，progressive alignment 启发式策略的两个主要阶段（即距离计算和成对比对）都已经并行化。一个比对可以分配到同一台计算机的不同核心上，但不能分配到不同计算机之间。距离矩阵计算是一项容易并行化的任务。相比之下，成对比对阶段很难有效并行化。如图 8.5 所示，Clustal Omega 在使用 2、3 或 4 个线程时可以获得较好的加速；但只有当序列数量非常大时，更多线程才有用。Clustal Omega 的并行化是“thread-safe”（线程安全）的：使用一个线程生成的比对，保证与使用多个线程生成的比对相同。

图 8.5

使用不同线程数（x 轴）时，总运行时间相对于单线程执行的比值（y 轴）：(a) 100 条序列；(b) 500 条序列；(c) 1000 条序列；(d) 10000 条序列。Def 表示程序默认设置。L-INS-i (t) 表示该程序的非线程设置。

术语表（12 条）

English	中文
Commonly Used Alignment Packages	常用比对软件包
Clustal Omega	保留英文
progressive aligner	progressive aligner（渐进式比对程序）
guide tree	guide tree（引导树）
mBed algorithm	mBed 算法
seed sequence	seed（种子）序列
cluster / k-means clustering	cluster（簇）/ k-means 聚类
subalignment	subalignment（子比对）
hidden Markov model (HMM)	hidden Markov model（HMM，隐马尔可夫模型）
Iteration	Iteration / iterative refinement（迭代优化）
pseudocount	pseudocount（伪计数）
thread-safe	thread-safe（线程安全）

060

Commonly Used Alignment Packages — Part 2: ClustalW2 / DIALIGN / Kalign / MAFFT

PDF page 256-259；印刷页码 236-239；从 ClustalW2 起，止于 MUSCLE 标题前

▶

English SourcePDF extracted

源文暂缺。

中文译文

第 8 章多序列比对 / 常用比对软件包 — Part 2：ClustalW2 / DIALIGN / Kalign / MAFFT

ClustalW2

ClustalW2（Larkin et al. 2007）是 Clustal Omega 的前代，源自一系列可追溯到 20 世纪 80 年代的程序。它通常比 Clustal Omega 慢，能够比对的序列数量也较少，而且产生的比对质量往往较低。它也不能通过使用多个线程来加速。自 2010 年以来，其代码库已经冻结，ClustalW2 不再处于活跃开发状态。虽然 ClustalW2 仍可在 Pasteur Galaxy server 上作为在线工具使用，但 EBI 和 Max Planck Bioinformatics Toolkit 已不再提供它。ClustalW2 是若干 Linux 发行版的一部分（例如 Ubuntu；代码和可执行文件也可从 Clustal 网站获得）。本书仍介绍 ClustalW2，是因为它仍然被广泛使用，而且其 GUI 使它成为一个非常易用、直观的程序。与 Clustal Omega 不同，它既可以在终端中交互式运行，也可以通过称为 ClustalX 的 GUI 运行。不过，本节只说明如何从命令行使用 ClustalW2。

ClustalW2 也是一种 progressive aligner（渐进式比对程序），并且总是计算完整的 N × N 距离矩阵，其中 N 是待比对序列的数量。这实际上限制了 ClustalW2 能在合理时间内比对的序列数量。本书这里没有尝试比对超过 5000 条序列。名称 ClustalW 中的 “W” 来自 weighting scheme（加权方案），用于降低过度代表序列的权重。

ClustalW2 会自动识别七种输入序列文件格式：NBRF-PIR、EMBL-SWISSPROT、Pearson（FASTA）、Clustal、GCG-MSF、GCG9-RSF 和 GDE。比对输出默认为 Clustal 格式，但也可以选择 GCG、NBRF-PIR、PHYLIP、GDE、NEXUS 或 FASTA。最小 ClustalW2 命令行如下：

clustalw2 -INFILE=infile.fa

该命令会读取 infile.fa 中的序列，检测文件格式，推测序列是核苷酸还是蛋白质，随后比对这些序列，并把 Clustal 格式的比对写入 infile.aln。输入文件名的 stem（主干；本例为 infile）会被保留，文件扩展名（本例为 .fa）会被去掉，然后追加扩展名 .aln。ClustalW2 默认还会把 Newick 格式的 guide tree（引导树）输出到一个以 .dnd 结尾的文件中。程序会向 standard output（标准输出）打印 progress report（进度报告），其中包含未比对序列之间的距离和中间 subalignment（子比对）分数。对于大量序列，这可能耗费时间和内存，因此可以通过设置 -QUIET flag 抑制。如果希望把比对写入不同于默认名称的文件，可以设置 -OUTFILE flag。输出格式可通过设置 -OUTPUT flag 指定，如下所示：

clustalw2 -INFILE=infile.fa -OUTFILE=output.a2m -OUTPUT=fasta

这里，infile.fa 中未比对序列的比对结果会以 FASTA 格式写入文件 output.a2m。

在标准蛋白质 benchmark BAliBASE3 上，ClustalW2 速度中等，慢于 Clustal Omega、默认 MAFFT 和 Kalign。它的执行时间与默认 MUSCLE 大致相同，但快于 PRANK、T-Coffee 和 PASTA。然而，从 TC score 衡量的准确度来看，ClustalW2 是这里考察的所有 aligner 中最差的，这一点可从图 8.3 看出。

对于少量序列，ClustalW2 是本次比较中最节省内存的 aligner 之一。然而，它的时间和内存需求随序列数量的平方增长。因此，在配备 8GB RAM 的 benchmark 机器上，我们无法把序列数量范围扩展到 5000 以上；这一点可从图 8.4 中的橙色圆点看出。

DIALIGN

如果待比对序列在全长范围内显然可以对齐，progressive alignment 算法是合适的。然而，如果序列之间只有局部相似性，除此之外并无关系，那么这种算法可能并不适用。例如，如果若干序列只共享一个短蛋白结构域，而其余部分完全无关，那么用 Clustal Omega 这样的标准 progressive aligner 很难完成比对。DIALIGN（Morgenstern et al. 1998）并不尝试匹配单个残基，而是匹配残基片段。这些片段内部没有 gap，并且在所有待比对序列中长度相同。片段内部虽然没有 gap，但允许 mismatch（错配）。程序会考虑不同长度的片段，但通常使用 10 作为下限阈值。只有在能够保持一致性时，多个片段才会被对齐；也就是说，任何一个片段都不能与另一条序列中的多个片段对齐，并且所有序列中的所有片段都必须保持相同顺序。这种 consistency scheme（一致性方案）早于 T-Coffee 中实现的方案（Notredame et al. 2000）。DIALIGN 的典型命令行是：

dialign2 -fa input.in

其中，input.in 是包含待比对未比对序列的文件。比对后的输出会写入一个与输入文件同名、但额外添加 .fa 扩展名的文件。

使用 BAliBASE3 benchmark 数据集时，DIALIGN 快于 T-Coffee、PASTA 和 PRANK，但慢于所有其他 aligner。DIALIGN 的 TC score 相对较低，但优于 ClustalW2 和 PRANK。DIALIGN 的运行时间需求在所有 aligner 中最高（图 8.4）。DIALIGN 的内存需求最初较低，但看起来会随序列数量呈平方增长（图 8.5）。已有一个 DIALIGN 版本实现了并行化（Schmollinger et al. 2004）。

Kalign

Kalign2（Lassmann and Sonnhammer 2005）是一个 progressive MSA 程序。它使用 Muth–Manber string-matching algorithm（字符串匹配算法；Muth and Manber 1996）来建立生成 guide tree 所需的距离。在这里考察的所有程序中，这似乎是最快的距离计算算法。不过，Kalign2 中的距离矩阵计算会随序列数量呈平方扩展。Kalign2 支持 Clustal、PileUp、MSF、Stockholm、UniProt、Swiss-Prot 和 Macsim alignment formats。

最小 Kalign2 命令行如下：

kalign -in input.fa -out output.fa

该命令会把 input.fa 中未比对序列的比对结果，以默认 FASTA 格式写入 output.fa。此外，程序还会向 standard output 写出 progress report。

使用 BAliBASE3 benchmark 时，Kalign2 是这里考察的程序中速度最快的。按 TC score 衡量，它的准确度优于默认版本的 MAFFT、MUSCLE 和 ClustalW2，但不及 L-INS-i、Clustal Omega 或 T-Coffee（图 8.3）。不过，BAliBASE3 规模相对较小，序列数在 4 到 142 之间（中位数 21 条）。对于更大的序列数量，Kalign 的扩展性会抵消其高效实现带来的优势；在速度上，它会被 MAFFT（1000 条序列）、fast mode 下的 MUSCLE、Clustal Omega（2000 条序列）和 PASTA（20000 条序列）超过。

MAFFT

MAFFT（Katoh et al. 2005a,b）是一组不同 executable（可执行程序）的集合，由一个脚本管理；该脚本会根据序列数量、期望准确度和可用计算能力，选择一系列多序列比对程序。这里重点介绍三类： (i) 面向中大型数据集的通用默认 MAFFT aligner FFT-NS-i；(ii) 适用于几百条序列小数据集、准确度更高但速度更慢的 L-INS-i；以及 (iii) 能处理极大量序列的 PartTree。

当运行 MAFFT 而不指定特定 aligner 时，它会以默认模式运行。在默认模式中，MAFFT 会把氨基酸序列重新编码为 tuple（元组）序列，其中包含残基的体积和极性。使用 fast Fourier transform（FFT，快速傅里叶变换），可以高效计算两条序列的体积和极性的相关性。通过这种方式，程序能够识别序列中的同源区段。随后，这些部分使用传统 dynamic programming（动态规划）进行比对。该算法称为 FFT-NS-1。在默认模式中，MAFFT 会再重复一次这个过程（称为 FFT-NS-2），然后进行 iterative refinement（迭代优化），最终构成 FFT-NS-i。在 FFT-NS-2 期间产生的 MSA，会通过对序列中随机分组的各组进行重复成对比对而逐步优化。L-INS-i 使用 iterative refinement，也使用 alignment consistency（一致性；Notredame et al. 2000），后者是一种度量多序列比对与成对比对之间一致性的技术。这种方法可以非常准确，但通常随序列数量呈三次方扩展，因此主要适用于较小问题。另一方面，PartTree 是一种快速方法，能够快速构建 guide tree，从而处理包含数千条序列的数据集。

Default MAFFT

最小默认 MAFFT 命令行如下：

mafft input.fa > output.fa

MAFFT 不接受非标准氨基酸符号，例如 ambiguity codes（歧义代码）。如果序列信息中包含这类符号，应设置 --anysymbol flag。通过设置 --quiet flag，可以抑制写入 standard error（标准错误）的诊断输出。

在 BAliBASE3 上，默认 MAFFT 是仅次于 Kalign2 的第二快 aligner，TC score 略低于 Kalign2，与默认 MUSCLE 相当，并远高于 ClustalW2。其内存消耗一直较高。所有 MAFFT 策略都已经并行化，并且最多使用 4 个线程时加速效果良好。超过这一数量后，只有在序列数量非常大时才会获得有用加速。默认 MAFFT 是 thread-safe（线程安全）的，也就是说，使用一个线程生成的比对，保证与使用多个线程生成的比对相同。这意味着多线程模式下的比对结果是可重复的。

L-INS-i

L-INS-i 是高准确度 MAFFT 程序，因此 throughput（吞吐量）低于默认版本。最小 MAFFT L-INS-i 命令行可以用以下两种方式之一书写：

linsi input.fa > output.fa

或：

mafft --localpair input.fa > output.fa

在这里考察的所有程序中，MAFFT L-INS-i 在 BAliBASE3 benchmark 上获得最高 TC score（图 8.3）。它的执行时间慢于 MUSCLE 和 Clustal Omega，与一轮迭代的 Clustal Omega 相当，并快于 T-Coffee 和 PASTA。MAFFT L-INS-i 的多线程执行加速效果是所有程序中最好的。不过，MAFFT L-INS-i 不是 thread-safe。这意味着使用不同线程数运行时，结果可能不同。即使使用相同线程数，不同运行之间的结果也可能不同。

PartTree

最小 MAFFT PartTree 命令行如下：

mafft --parttree input.fa > output.fa

PartTree 是高吞吐量 MAFFT 程序，不预期它在 BAliBASE3 这样的小 benchmark 上表现很好。它比 MAFFT 默认版本更慢、准确度也更低。图 8.4 中的数据表明，当序列数超过 200 时，PartTree 始终是最快的 aligner。Clustal Omega 具有相似的扩展性（图 8.4），但 overhead（开销）更高。对于超过 2000 条序列的数据集，PartTree 也是最节省内存的算法。在所有 MAFFT 版本中，都可以通过设置 --treeout flag 写出 guide tree。不过，在 PartTree 中，序列标识符会被替换为该序列在输入文件中出现位置的整数索引。PartTree guide tree 也可能包含 multifurcation（多分叉）。与所有 MAFFT 版本一样，PartTree 可以读入外部 guide tree；不过，其文件格式是 MAFFT 专用格式。输入必须先由名为 newick2mafft.rb 的工具程序从标准格式 guide tree 生成；该程序是 MAFFT 发行版的一部分。PartTree 是 thread-safe 的；不过，使用超过一个线程并没有有用的加速效果。

术语表（19 条）

English	中文
ClustalW2 / ClustalX	保留英文
progressive aligner	progressive aligner（渐进式比对程序）
weighting scheme	weighting scheme（加权方案）
guide tree	guide tree（引导树）
Newick format	Newick 格式
standard output / standard error	standard output（标准输出）/ standard error（标准错误）
DIALIGN	保留英文
segment / residue	片段 / 残基
mismatch	mismatch（错配）
consistency scheme	consistency scheme（一致性方案）
Kalign2	保留英文
Muth–Manber string-matching algorithm	Muth–Manber string-matching algorithm（字符串匹配算法）
MAFFT / FFT-NS-i / L-INS-i / PartTree	保留英文
executable	executable（可执行程序）
dynamic programming	dynamic programming（动态规划）
fast Fourier transform (FFT)	fast Fourier transform（FFT，快速傅里叶变换）
throughput	throughput（吞吐量）
thread-safe	thread-safe（线程安全）
multifurcation	multifurcation（多分叉）

061

Commonly Used Alignment Packages — Part 3: MUSCLE / PASTA / PRANK / T-Coffee

PDF page 260-262；印刷页码 240-242；从 MUSCLE 起，止于 Viewing a Multiple Alignment 标题前

▶

English SourcePDF extracted

源文暂缺。

中文译文

第 8 章多序列比对 / 常用比对软件包 — Part 3：MUSCLE / PASTA / PRANK / T-Coffee

MUSCLE

MUSCLE（Edgar 2004）是一个 progressive MSA 程序。在第一阶段，它基于快速 k-tuple vector comparison（k 元组向量比较）计算未比对序列之间的距离矩阵。随后，这些距离使用 UPGMA cluster analysis（聚类分析；Sokal and Michener 1958）进行聚类。该阶段会产生一个初始比对，随后可在第二个迭代步骤中加以改进。第二步与第一步相似，唯一区别在于使用 alignment-based distances（基于比对的距离；Kimura 1983），而不是 k-tuple vector comparison。在后续一轮 iterative refinement（迭代优化）中，第二阶段比对可以通过以下方式改进：将第二阶段 guide tree 切成两部分，重新比对每个 subtree 中的序列，然后比对两个 subprofile（称为 tree-dependent restricted partitioning，依赖树的受限划分）。如果新的比对提高了 alignment score（比对分数），则接受该新比对。默认情况下，这些 refinement 会执行 14 次，因此总共会进行 16 轮比对。

最小 MUSCLE 命令行如下：

muscle -in input.fa -out output.fa

该命令会执行最初两轮比对（分别基于 k-tuple 和 alignment distance），随后执行 14 轮 iterative refinement。如果序列数量很大，可以在命令中加入一个额外项来指定最大迭代次数，从而跳过 iterative refinement：

muscle -in input.fa -out output.fa -maxiters 2

使用 BAliBASE3 benchmark 时，默认 MUSCLE 的准确度（按 TC score 衡量）与默认 MAFFT 相当；它略快于 ClustalW2，略慢于 Clustal Omega。Fast MUSCLE 只执行前两个比对阶段，速度大约比默认版本快一个数量级。使用 BAliBASE3 时，它快于 Clustal Omega，但仍不及默认 MAFFT 或 Kalign2 快。不过，相比默认版本，它的准确度会下降。在图 8.4 的大规模测试中，对于 5000 条和 20000 条序列，MUSCLE 分别在默认模式和 fast mode 下超出了本测试平台可用内存。Fast MUSCLE 的运行时间一开始快于 Clustal Omega、慢于 Kalign2；随后在速度上超过 Kalign2，又在 2000 条序列时被 Clustal Omega 超过。由于 iterative refinement 会重新划分 guide tree，但不会重新生成 guide tree，因此默认版本和 fast 版本的 guide tree 总是相同的。MUSCLE 没有并行版本。

PASTA

PASTA（Practical Alignments using SATé and TrAnsitivity；Mirarab et al. 2015）是一个 Python 脚本，会调用现有软件包，例如 SATé（Liu et al. 2009）、MAFFT、MUSCLE、ClustalW、HMMER（Eddy 2009）、OPAL（Wheeler and Kececioglu 2007）和 FastTree-2（Price et al. 2010），并组合它们的结果。第一步中，程序会从输入数据集中随机选择少量序列并进行比对。PASTA 的默认 aligner 是 MAFFT L-INS-i。这个初始比对称为 “backbone”（骨架），并使用 HMMER 转换为 HMM。剩余序列随后会被比对到这个 HMM 上。接着，程序使用 FastTree 从该比对构建一个初始 maximum likelihood（ML，最大似然）tree。然后根据这棵树对序列进行聚类，使 cluster 大小保持较小。随后，使用默认 aligner 比对各个 cluster，形成 subalignment。在整体 spanning tree（生成树）中彼此“相邻”的 subalignment 会使用 OPAL 进行比对，形成 subalignment pair。不同 subalignment pair 最后被合并，产生整体比对。

PASTA 默认期待输入核苷酸序列。对于蛋白质序列，最小 PASTA 命令行如下：

python run_pasta.py --input=input.fa --datatype=Protein

使用 BAliBASE3 benchmark 时，默认 PASTA 快于 T-Coffee 和 PRANK，但慢于所有其他 aligner。PASTA 的准确度与底层 subalignment 软件的准确度密切相关。可以通过指定参数来更换这个 aligner，例如 --aligner=muscle 或 --aligner=clustalw2。如果像默认设置一样使用 L-INS-i 这种更准确的 aligner，PASTA 比对会更准确；如果使用 MUSCLE 这样的 aligner，质量为中等；如果使用 ClustalW2，PASTA 会产生最差的比对。不过，在 BAliBASE3 上，PASTA 比对从未真正达到或超过底层 subalignment 软件本身的质量。图 8.3 可以看出这一点：PASTA 的数据点位于相应 subaligner 数据点的右侧（更慢）和下方（更不准确）。这并不令人意外，因为用少量蛋白质序列比对已经证明，ML phylogenetic trees（最大似然系统发育树）不一定是好的 guide tree，而且经常是明显糟糕的 guide tree（Sievers et al. 2014）。

不过，PASTA 并不是为比对少量序列而设计的。使用大规模 benchmark 数据时，它一开始（20 条序列）是仅次于 PRANK 的第二慢 aligner；但由于时间扩展性较好，它在 500 条序列时超过 L-INS-i，在 5000 条序列时超过默认 MUSCLE 和 ClustalW2，并在 20000 条序列时超过 Kalign2。它的内存消耗也以类似方式扩展。

PASTA 已经并行化。默认情况下，它会尝试使用所有可用线程。可以通过为 --num_cpus flag 指定参数来改变线程数。随着线程数增加，PASTA 表现出良好加速；随着序列数量增加，这种效果会更明显，如图 8.5 所示。然而，这里考察的 PASTA 版本并不是 thread-safe（线程安全）的。这意味着比对结果会因线程数不同而不同。也许更令人不安的是，使用超过一个线程时，结果无法复现。PASTA 默认模式使用非 thread-safe 的 L-INS-i；即使使用只能单线程运行的 MUSCLE 作为 subaligner，也同样存在这个问题。对于后一种情况，在一个具体例子中，如果使用 3 个线程把同一组 100 条 rvp 序列（平均长度 106.5，最长序列 124）比对 10 次，比对长度可在 159 到 183 之间变化。在这个例子中，6 条 rvp 参考序列核心列的 TC score 在 0.433 到 0.556 之间变化。因此，为了让结果可复现，应该总是设置 --num_cpus=1。

PRANK

在两条单独序列的 pairwise alignment 中，无法判断某条序列中的 gap 是由该序列中的 deletion（缺失）造成，还是由另一条序列中的 insertion（插入）造成。然而，在 MSA 中，这一区分可能变得重要，尤其是在 phylogenetic analysis（系统发育分析）中。大多数 progressive aligner 会低估真实 insertion 事件数量，并可能产生人为偏短的比对。PRANK（Löytynoja and Goldman 2005）尝试通过执行 phylogeny-aware gap placement（系统发育感知的 gap 放置）来处理这一问题。因此，如果研究者关心对所有 gap 位置进行谨慎估计，PRANK 可能会有用。本节所述这类基于结构的 benchmark 无法恰当地检验 PRANK；它在这些 benchmark 上表现较低，并不意味着它在其他情形中没有用。

PRANK 的最小命令行如下：

prank -d=infile.fa -o=outfile -f=fasta

使用 BAliBASE3 benchmark 时，PRANK 是最慢的 aligner；除 ClustalW2 外，它获得的 TC score 也是最低的（图 8.3）。这并不奇怪，因为传统的基于结构的 benchmark 奖励 compact alignment（紧凑比对），并且可能没有充分惩罚 over-alignment（过度比对）。需要注意的是，PRANK 只读取标准 IUPAC codes（每种氨基酸或碱基对应唯一字母），并将所有非 IUPAC 字符（例如 ambiguity codes）替换为 N 或 X。因此，把比对结果与未比对数据或 reference alignment 比较时，可能会出现差异。

Scalability benchmark 表明，对于少量序列，PRANK 是较慢的 aligner。不过，在所有 aligner 中，PRANK 的 time complexity（时间复杂度）是最低的一类：超过 100 条序列后，PRANK 会超过 T-Coffee；超过 1000 条序列后，会超过 MAFFT L-INS-i（图 8.4）。它的内存需求也呈现类似趋势，并预计在 5000 条序列后超过本测试平台可用内存。

T-Coffee

T-Coffee 最初是一种 progressive alignment heuristic method，用于优化 MSA 的 Coffee objective function（目标函数；Notredame et al. 1998）。该函数会寻找一个 MSA，使不同序列残基之间的加权成对匹配之和最大。这些 pairwise match（成对匹配）可以来自 pairwise alignment、已有 MSA、蛋白质结构叠合中的对应残基，或 RNA 结构比对中的已对齐残基。因此，T-Coffee 能够合并来自未比对序列、不同 MSA 软件包、结构比对或这些来源混合的信息。Notredame 等（2000）首次描述了 MSA consistency（一致性）：若序列之间的 pairwise residue match 与其他序列对中的 pairwise match 一致，则会被赋予更高权重。这有助于绕开 progressive alignment 内在的贪心性质，并被证明能产生非常准确的比对。后来，Consistency 被纳入 Probcons（Do et al. 2005）和 MAFFT（Katoh et al. 2005a,b）软件包。它会增加比对的计算复杂度，主要适用于少于 1000 条序列的比对，但能显著提高比对准确度。

T-Coffee 的最小命令行如下：

t_coffee -in infile.fa -output fasta

该命令会生成一个名为 infile.fasta_aln 的 FASTA 格式比对文件。

使用 BAliBASE3 benchmark 时，T-Coffee 是仅次于 PRANK 的第二慢 aligner。不过，如图 8.3 所示，它的平均 TC score 属于最高的一组，优于 PASTA、Kalign 和 MUSCLE。T-Coffee 的平均内存消耗最高。由于 T-Coffee 基于 consistency 原理，其相对于序列数量的时间复杂度预期较高。我们无法把序列数量范围扩展到 1000 以上，因为 T-Coffee 用尽了可用的 8GB RAM。在并行化方面，T-Coffee 是完全 thread-safe 的。这意味着比对结果不依赖处理器数量；处理器数量可通过指定 -n_core flag 设置。比对结果也可复现。因此，T-Coffee 是在仍保持 thread-safe 的前提下并行加速效果最好的 aligner。

术语表（14 条）

English	中文
MUSCLE / PASTA / PRANK / T-Coffee	保留英文
progressive MSA program	progressive MSA 程序
k-tuple vector comparison	k-tuple vector comparison（k 元组向量比较）
UPGMA cluster analysis	UPGMA cluster analysis（聚类分析）
tree-dependent restricted partitioning	tree-dependent restricted partitioning（依赖树的受限划分）
backbone	backbone（骨架）
maximum likelihood (ML) tree	maximum likelihood（ML，最大似然）tree
spanning tree	spanning tree（生成树）
subaligner	subaligner（子比对程序）
phylogeny-aware gap placement	phylogeny-aware gap placement（系统发育感知的 gap 放置）
compact alignment / over-alignment	compact alignment（紧凑比对）/ over-alignment（过度比对）
Coffee objective function	Coffee objective function（目标函数）
MSA consistency	MSA consistency（一致性）
thread-safe	thread-safe（线程安全）

062

Commonly Used Alignment Packages

PDF page 252-262；印刷页码 232-242

▶

English SourcePDF extracted

源文暂缺。

中文译文

第 8 章多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

本节介绍如何使用一系列常用软件包构建多序列比对。关于源代码下载或在线使用的汇总信息，见本章“Internet Resources”。

Clustal Omega

Clustal Omega（Sievers et al. 2011）是 Clustal MSA 软件套件的最新成员，可用于氨基酸序列和核苷酸序列。它几乎是对前代 ClustalW2（Larkin et al. 2007）的彻底重写。与 ClustalW2 相比，Clustal Omega 的主要改进包括：能够在更短时间内比对数量远多于 ClustalW2 的序列；根据基于晶体结构的 benchmark 衡量，通常能产生更准确的比对；并且能够把关于最终比对总体结构的先验知识纳入计算。Clustal Omega 是一个命令行驱动程序，已经成功编译到 Linux、Mac 和 Windows 平台。与前代不同，Clustal Omega 没有 graphical user interface（GUI，图形用户界面）；不过，许多优秀的比对可视化程序（如 SeaView；Gouy et al. 2010；Jalview；Waterhouse et al. 2009），以及 European Molecular Biology Laboratory（EMBL）-EBI bioinformatic web and programmatic tools framework、Max Planck Bioinformatics Toolkit、Pasteur Institute 的 Galaxy server 等在线服务器，可以弥补这一不足。

Clustal Omega 是一种 progressive aligner（渐进式比对程序）。它使用 guide tree（引导树）来指导多序列比对；这个 guide tree 由序列之间的成对距离矩阵计算得到。对于 N 条序列，这需要进行 N × N 次序列比较，并存储一个 N × N 距离矩阵。过去，这一步通常是阻止传统 aligner 比对大量序列的瓶颈。实际限制大约在 10000 条序列或更少。不过，Clustal Omega 默认并不计算全对全距离矩阵，而是使用 mBed 算法（Blackshields et al. 2010）。mBed 会计算所有序列相对于少量随机选择的“seed”（种子）序列的距离矩阵。因此，mBed 算法的计算需求并不随 N 呈平方增长，而是按 N × log(N) 增长。

Clustal Omega 使用 mBed 距离矩阵对序列进行 k-means 聚类。默认情况下，每个 cluster（簇）的大小上限为 100 条序列。程序会为各个 cluster 生成小的 guide tree，并为这些 cluster 构建一个总的 guide tree。默认 cluster 大小上限设为 100，是因为当时典型比对规模通常不超过 10000 条序列，这样最多会有 100 个大小为 100 的 cluster；对于更大的比对，可以通过设置 --cluster-size flag 调整 cluster 大小。尽管较小的距离矩阵看似减少了信息量，但使用 mBed guide tree 生成的比对，质量通常与基于全对全距离矩阵的比对相当，甚至更高。若需要完整距离矩阵计算，可以用 --full flag 关闭 mBed 模式。

在 progressive alignment 启发式策略的主比对步骤中，单条序列先被比对形成 subalignment（子比对），较小的 subalignment 再彼此比对，逐步形成越来越大的 subalignment。在 Clustal Omega 中，这些成对比对由 hhalign（Söding 2005）执行。该程序会把单条序列和小的 subalignment 转换成 hidden Markov models（HMMs，隐马尔可夫模型），然后以成对方式比对这些 HMM。

Clustal Omega 的文件输入/输出过程使用 Sean Eddy 的 squid library，因此能够读写多种常用序列格式，如 a2m/FASTA、Clustal、msf、PHYLIP、selex、Stockholm 和 Vienna。默认输出格式为 FASTA。最小 Clustal Omega 命令行如下：

clustalo -i <infile> -o <outfile>

其中，<infile> 是包含待比对序列的文件占位符，文件格式应为程序可识别的格式之一；<outfile> 是保存已比对序列的输出文件占位符，输出为 FASTA 格式。

Iteration

Clustal Omega 能够对比对进行 iterative refinement（迭代优化）。在初始比对阶段，距离基于未比对序列的 k-mer 计算。在迭代优化过程中，距离则基于完整比对计算。这样做的期望是，完整比对距离能更好地反映序列之间的相似性，因此会生成“更好”的 guide tree，并进一步产生更好的比对。Clustal Omega 还会把初始比对转换为一个 HMM，然后在后台将该 HMM 与单条序列和 subprofile（子 profile）比对，使 Clustal Omega 能够“预判”其他序列将如何、在何处与其对齐。这里所谓“预判”的具体方法，是把初始比对 HMM 中的 pseudocount（伪计数）信息转移到需要重新比对的序列和 subalignment 中；Sievers 等（2011）对此过程有更详细说明。

序列比对在 progressive alignment 的早期阶段尤其容易发生错配，因此转移到单条序列和小 subalignment 的 pseudocount 信息可能较大。随着 progressive alignment 后期 subalignment 逐渐增大，应该已经积累了足够多的“真实”信息，因此 pseudocount 转移可以相应缩小。对于包含 100 条或更多序列的 subprofile，实际上不会发生 pseudocount 转移。比对原则上可以被无限次优化；不过，经验表明，一到两轮迭代通常能明显提高比对质量。超过两轮迭代很少有用，应根据具体情况决定是否使用。执行迭代比对的最小命令如下：

clustalo -i infile.fa -o outfile1.fa --iter=1

其中，infile.fa 和 outfile1.fa 分别是 FASTA 格式输入文件和输出文件的名称。

需要注意，迭代会带来性能代价。每一轮迭代都需要额外执行三次比对：第一和第二个 subalignment 需要分别与背景 HMM 比对；随后，两个加入了 pseudocount 背景信息的 subalignment 还需要彼此比对。一轮迭代比对大约需要初始比对四倍的时间；两轮迭代比对大约需要原始比对七倍的时间。

迭代过程中，初步比对会被转换为 HMM，随后用这个 HMM 生成质量更高的比对。HMM 信息也可以从外部生成。如果已知待比对序列的类型，可能已经存在预先计算好的 HMM。例如，Pfam（Finn et al. 2016）包含大量蛋白质家族、比对及其 HMM。如果已知待比对序列与 Pfam 中某个家族同源，就可以从 Pfam 下载相应 HMM，并将其作为额外命令行参数使用：

clustalo -i infile.fa -o outfile4.fa --hmm-in=pfam.hmm

其中，pfam.hmm 是从 Pfam 下载的 HMM，包含与 infile.fa 中序列同源的蛋白质家族的比对信息。另一种做法是，使用 HMMER（Finn et al. 2011）从本地产生的比对生成 HMM。

Benchmarking Clustal Omega

评估一个多序列比对程序的性能时，需要考虑几个问题。比对软件能否处理输入序列的数量？比对过程需要多长时间？这种比对能否扩展到更多序列，或更长序列？与已知三维结构序列的标准比对相比，这些比对有多准确？不同 aligner 在这些方面表现各不相同。有些 aligner 在小规模序列集合上非常快，但当序列数超过几百时，会需要不切实际的运行时间。不过，这些较慢 aligner 中有些在 benchmark 上可能非常准确。相反，有些 aligner 能够处理极大数据集，但会牺牲一定准确度。本节将从计算时间、比对准确度，以及处理长序列或大量序列的能力等方面，把 Clustal Omega 与若干常用比对软件包进行比较。后续小节会详细说明这些比对软件包及其使用方法。

图 8.3 和表 8.1 给出了使用成熟 BAliBASE3 benchmark（Thompson et al. 2005）得到的结果。在这里，准确度用 218 个 benchmark 比对中的比对列比例来衡量，并在表中表示为 TC score。Clustal Omega 既不是最快的比对软件包，也不是最准确的比对软件包；但它比所有更快的 aligner 都更准确。唯一获得更高 TC score 的 aligner，是 MAFFT 软件包中的 L-INS-i（Katoh et al. 2005a,b）（图 8.3）。图 8.3 给出了 BAliBASE3 的总运行时间和整体准确度分数。BAliBASE3 被划分为若干比对类型子类别，其各自结果见表 8.1。

图 8.3

使用 BAliBASE3 benchmark 比较 aligner 准确度与单线程总运行时间。时间为所有 218 个测试比对的总和，total column（TC）score 为平均值。x 轴（时间）为对数尺度，y 轴（TC Score）为线性尺度。数据点对应 aligner 默认设置。额外数据点包括 Clustal Omega（i1：更准确模式）、MUSCLE（i2：快速模式）和 PASTA（m 表示以 MUSCLE 为 subaligner；w 表示以 ClustalW2 为 subaligner）。数据点对应表 8.1 的第 8 和第 9 列。

表 8.1 BAliBASE3 benchmark 上的 aligner 性能

Aligner	BB11	BB12	BB2	BB3	BB4	BB5	all	Time	RSS	ss
ClustalO	0.36	0.79	0.45	0.58	0.58	0.53	0.55	00h:04m:25s	959060	55961
ClustalO-i1	0.36	0.79	0.45	0.59	0.59	0.55	0.56	00h:24m:53s	3442156	106888
ClustalW2	0.22	0.71	0.22	0.27	0.40	0.31	0.37	00h:09m:58s	8032	3852
DIALIGN	0.27	0.70	0.29	0.31	0.44	0.43	0.42	00h:47m:28s	56912	7350
Kalign	0.37	0.79	0.36	0.48	0.50	0.44	0.50	00h:00m:24s	7260	2776
L-INS-i	0.40	0.84	0.46	0.59	0.60	0.59	0.58	00h:30m:01s	703524	43695
MAFFT	0.29	0.77	0.33	0.42	0.49	0.50	0.47	00h:00m:50s	461668	35950
PartTree	0.28	0.76	0.30	0.40	0.45	0.50	0.45	00h:00m:57s	448524	19421
MUSCLE	0.32	0.80	0.35	0.41	0.45	0.46	0.48	00h:07m:48s	78608	15892
MUSCLE-i2	0.27	0.76	0.33	0.38	0.43	0.43	0.45	00h:01m:47s	78780	15860
PASTA(w)	0.24	0.71	0.23	0.23	0.37	0.34	0.37	01h:08m:49s	317112	58703
PASTA	0.35	0.78	0.45	0.50	0.51	0.52	0.53	01h:45m:08s	664336	65448
PASTA(m)	0.30	0.78	0.31	0.35	0.44	0.39	0.44	01h:10m:43s	323936	62038
PRANK	0.24	0.68	0.25	0.35	0.36	0.39	0.39	35h:55m:53s	468692	36742
T-Coffee	0.41	0.86	0.40	0.47	0.55	0.59	0.55	05h:48m:46s	1870536	192504
测试比对数	38	44	41	30	49	16	218

第 2–7 列（BB11–BB5）为各层级参考集合的平均 total column（TC）score；第 8 列（all）为全部 218 个测试比对的平均 TC score。第 9 列（time）为所有 218 个测试比对的总单线程运行时间。第 10 列（RSS）为最大内存需求；第 11 列（rss）为平均内存需求。第 8/9 列（all/time）对应图 8.4。最后一行给出每个层级集合中的测试比对数量。

表 8.1 的性能度量基于固定大小的数据集。图 8.4 则绘制了多种 MSA 算法在待比对序列数量增加时的运行时间；所用数据来自 Pfam（Finn et al. 2014）的三组不同长度序列。柱形对应一个很短的蛋白结构域（zf-CCHH，平均长度 23 个氨基酸）、一个中等长度结构域（rvp，平均序列长度 93），以及一个长蛋白结构域（RuBisCO_large，长度 248）。图 8.4 使用双对数图展示结果。较平缓的曲线代表扩展性较好，也就是说，待比对序列数增加时，计算时间只会适度增加。陡峭曲线则代表扩展性较差，使用越来越大的序列集合时，计算时间会快速增加。Clustal Omega 的结果用红色柱（底部为 zf-CCHH，顶部为 RuBisCO_large）和圆点（rvp）表示。对于 20–1000 条序列的数据集，Clustal Omega 慢于 Kalign（品红色圆点）、默认 MAFFT（深蓝色圆点）或快速 MUSCLE（绿色方块）。由于扩展性更好，Clustal Omega 在 N = 2000 时超过快速 MUSCLE 和 Kalign，并在 N = 20000 时超过默认 MAFFT。MAFFT PartTree（深蓝色方块）在所有数据集上都始终快于 Clustal Omega。

图 8.4

随着序列数量（x 轴）增加，不同 aligner 的总单线程执行时间（y 轴）。两个坐标轴均为对数尺度。须状线表示从短序列（下须：zf-CCHH/PF00096，长度 23–34 个残基）到长序列（上须：RuBisCO_large/PF00016，长度 295–329 个残基）的时间范围。实线连接中等长度序列的时间点（rvp/PF00077，长度 94–124）。

Clustal Omega 中，progressive alignment 启发式策略的两个主要阶段（即距离计算和成对比对）都已经并行化。一个比对可以分配到同一台计算机的不同核心上，但不能分配到不同计算机之间。距离矩阵计算是一项容易并行化的任务。相比之下，成对比对阶段很难有效并行化。如图 8.5 所示，Clustal Omega 在使用 2、3 或 4 个线程时可以获得较好的加速；但只有当序列数量非常大时，更多线程才有用。Clustal Omega 的并行化是“thread-safe”（线程安全）的：使用一个线程生成的比对，保证与使用多个线程生成的比对相同。

图 8.5

使用不同线程数（x 轴）时，总运行时间相对于单线程执行的比值（y 轴）：(a) 100 条序列；(b) 500 条序列；(c) 1000 条序列；(d) 10000 条序列。Def 表示程序默认设置。L-INS-i (t) 表示该程序的非线程设置。

063

Viewing a Multiple Alignment

PDF page 262-266；印刷页码 242-246

▶

English SourcePDF extracted

源文暂缺。

中文译文

第 8 章多序列比对 / 查看多序列比对

如果不借助可视化软件来突出比对的某些特征，查看一个 MSA 会非常困难。例如，可以使用不同字体、颜色或阴影来强调 conserved columns（保守列）或 motifs（基序）。此外，还可以通过在不同区域显示结构或功能特征来为比对添加 annotation（注释）。有一些专门的 alignment viewing package，也有一些软件包本身包含很好的查看功能；下面介绍其中一些常用工具（见 Internet Resources）。其中两个软件包（SeaView 和 Jalview）还包含非常强大的 MSA 编辑能力。

Clustal X

Clustal X（Thompson et al. 1997）是在已有 Clustal W 软件包（Thompson et al. 1994）的基础上加入 GUI 而创建的，并且这个 GUI 可移植到所有广泛使用的操作系统。两个软件包使用相同的 alignment engine（比对引擎），之后也并行开发和维护。未比对或已比对的序列会显示在一个可滚动窗口中，默认配色方案会突出显示各列中高度保守的残基。Clustal X 包含用于调整比对显示的工具，例如用户可调的配色方案、字体大小，以及用于突出显示低保守 blocks、columns 或 sequences 的选项。比对也可以导出为适合发表的高质量 PostScript 文件。这些着色功能最适合氨基酸序列，但也可以查看核苷酸序列。Clustal X 不再处于活跃开发状态，但由于其可移植性、稳健性和易用性，它仍然可以免费获得并被广泛使用。它可作为桌面应用运行在所有广泛使用的操作系统上。

Jalview

Jalview 是一个 open-source MSA editor and analysis workbench（开源 MSA 编辑器与分析工作台），可运行在 Windows、Mac 和 Linux 平台上（Waterhouse et al. 2009）。Jalview 关注基因、蛋白质或 RNA 家族层面的多序列比对和功能分析，而不是全基因组层面的分析。除了面向 DNA、RNA 和蛋白质序列的复杂交互式多序列比对编辑功能——包括 “undo”（撤销）、多个 “views”（视图），以及对比对中的序列和列进行 subset（取子集）与 “hide”（隐藏）的能力——Jalview 还提供 linked views（联动视图），可联动显示树、DNA 和蛋白质序列、通过 Jmol 或 Chimera（Pettersen et al. 2004）显示的蛋白质三维结构，以及通过 VARNA（Darty et al. 2009）显示的 RNA 二级结构。图 8.6 展示了两个例子：一个蛋白质比对，与蛋白质结构显示联动；一个 RNA 比对，与 RNA 二级结构显示联动。

Jalview 可连接到主要公共数据库，访问序列、比对和三维结构，从而便于获取这些资源和 sequence annotation（序列注释，例如 active site descriptions，活性位点描述）。Jalview 支持多种 annotation 方法，既可以作用于单条序列，也可以根据比对列计算，并显示在比对上方或下方。它还包含一个分屏 DNA/RNA/protein view，可将 DNA 比对与相关蛋白质序列比对联动起来，一起编辑和分析；图 8.7 展示了一个例子。这个视图还允许将 population variation data（群体变异数据）、single-nucleotide polymorphisms（SNPs，单核苷酸多态性）以及 gene exons 等其他基因组特征映射到蛋白质序列和三维结构上。例如，Jalview 用户可以在 UniProt 中查找蛋白质，然后反向交叉引用到 Ensembl 中的完整基因和转录本，以查看比对中任何已知 SNP；随后，只需点击几次鼠标，就可以查看蛋白质三维结构和 SNP 位置（如果可用）。

为了生成比对，Jalview 提供对八种常用多序列比对算法的直接访问，并允许用户修改每种方法的参数（Troshin et al. 2011）。因此，用户可以交互式地进行比对、重新比对，并比较不同方法和参数组合生成的比对。Jalview 还提供对 JPred protein secondary structure prediction algorithm（蛋白质二级结构预测算法；Drozdetskiy et al. 2015）的直接访问，可从单条序列或多序列比对预测蛋白质二级结构和 solvent accessibility（溶剂可及性）。Jalview 包含四种 protein disorder prediction algorithms（蛋白质无序预测算法），还包括 RNAalifold 程序（Bernhart et al. 2008），该程序可通过 JABAWS2.2 从 RNA 多序列比对预测 RNA 二级结构。对于 conservation analysis（保守性分析），Jalview 中通过 AACon package 提供 17 种不同的氨基酸保守性评分方法，以及 SMERFS functional site prediction algorithm（功能位点预测算法）。Jalview 网站包含培训材料和手册，在线培训 YouTube 频道还提供 20 多个关于 Jalview 基础和高级功能的短视频教程。

SeaView

SeaView（Galtier et al. 1996）是一个 MSA editor，尤其适合把比对视图与 MSA 和 phylogenetic package（系统发育软件包）连接起来。它既可处理核苷酸比对，也可处理氨基酸比对。SeaView 能够读写多种 MSA 文件格式，并可直接调用 MUSCLE 或 Clustal Omega 来创建 MSA。随后，用户可以编辑比对，并调用 Gblocks filter program 去除比对较差的区域。该软件包可使用多种方法计算系统发育树，包括 maximum parsimony（最大简约法；使用 PHYLIP 软件包中的 Protpars；Felsenstein 1981）、neighbor joining（邻接法；Saitou and Nei 1987），或使用 Phyml（Guindon and Gascuel 2003）进行 ML 分析。SeaView 是一种非常直接且稳健的方式，可以在单一框架下从未比对序列进入完整系统发育分析。

图 8.6

使用 Jalview 可视化的蛋白质和 RNA 多序列比对。左侧窗格展示蛋白质多序列比对，并包含不同 feature coloring（特征着色）、tree 和 Jmol 分子结构视图。所有窗口均相互联动，因此在一个窗口中点击某个残基或序列，会在所有其他窗口中高亮对应残基或序列。右侧展示 RNA 多序列比对，并在 VARNA 中显示相应二级结构信息。

图 8.7

在 Jalview 中可视化的 linked coding sequence（CDS，编码序列）、蛋白质和三维结构视图，显示已知 single-nucleotide polymorphisms（SNPs，单核苷酸多态性）的位置。Jalview 中使用文本搜索在 UniProt 中找到一组相关蛋白质序列。随后，Jalview 将这些序列交叉引用到 Ensembl 中的 CDS 数据。蛋白质序列由 Clustal Omega 进行多序列比对。最后，其中一个蛋白质的三维结构显示在联动的 Chimera 应用中。比对中的红色和绿色位置突出显示从 Ensembl 获取的已知 SNP 位置。

ProViz

ProViz（Jehl et al. 2016）是一个近期重新发布的软件包，用于查看预先制作好的蛋白质序列比对，并叠加 feature annotation（特征注释），尤其是 functional domains（功能结构域）。这些比对以及指向功能信息数据库的链接已经预先计算，查看器会以整合方式显示来自多种来源的序列信息。ProViz 可以在线运行，也可以下载后在本地运行。最简单的查看入口，是使用感兴趣蛋白质或基因的 ID、名称或关键词；随后查看器会显示包含该蛋白质的比对。用户也可以输入自己的蛋白质序列或多序列比对。ProViz 使用的数据来源列在 Internet Resources 中。

术语表（10 条）

English	中文
Viewing a Multiple Alignment	查看多序列比对
MSA editor and analysis workbench	MSA 编辑器与分析工作台
alignment engine	alignment engine（比对引擎）
conserved columns / motifs	conserved columns（保守列）/ motifs（基序）
linked views	linked views（联动视图）
single-nucleotide polymorphisms (SNPs)	single-nucleotide polymorphisms（SNPs，单核苷酸多态性）
maximum parsimony	maximum parsimony（最大简约法）
neighbor joining	neighbor joining（邻接法）
feature annotation	feature annotation（特征注释）
functional domains	functional domains（功能结构域）

064

Summary + Internet Resources + References

PDF page 266-270；印刷页码 246-250

▶

English SourcePDF extracted

源文暂缺。

中文译文

第8章多序列比对

8.6 总结、网络资源与参考文献

总结（Summary）

即使是包含数千条序列的较大型数据集，也可以很快通过在线服务，或在基于 Linux 的笔记本和台式机上完成 multiple sequence alignment（MSA，多序列比对）。MSA 会被用于大量后续分析，几乎出现在所有系统发育分析、许多结构分析以及序列相似性研究中。

目前可用的软件包很多，但没有哪一个可以说在所有情况下都能给出“最佳”比对；为了让计算可处理，它们都会采用各种 computational shortcuts（计算捷径）。不同软件包各有优势和弱点，因此更好的做法是使用 alignment viewer（比对查看器）直接检查比对结果，并尝试不同程序。有些网站和 alignment-viewing package 支持多个最常用程序，同时保持一致的界面，这会让试用和比较变得更容易。

最重要的考虑因素，始终是输入序列本身的性质和质量。输入序列必须足够相似，才有可能被可靠地比对。还要记住：纳入的片段化序列或离群序列越多，比对结果就会越碎片化。干净的数据集通常会产生干净的比对；这样的比对既容易人工查看，也容易进一步分析。

网络资源（Internet Resources）

多序列比对软件版本

软件	版本	URL	在线可用性
Clustal Omega	v1.2.3	`www.clustal.org/omega`	EMP
ClustalW2	v2.1	`www.clustal.org/clustal2`	--P
DIALIGN	v2.2.2	`dialign.gobics.de`	---
Kalign	v2.04	`msa.sbc.su.se/cgi-bin/msa.cgi`	---
MAFFT	v7.309	`mafft.cbrc.jp/alignment/software`	EMP
MUSCLE	v3.8.31	`www.drive5.com/muscle`	EMP
PASTA	v1.6.4	`github.com/smirarab/pasta`	---
PRANK	v.150803	`wasabiapp.org/software/prank`	E--
T-Coffee	11.00.8cbe486	`www.tcoffee.org/Projects/tcoffee/index.html`	EMP

在线可用性标记说明：三类站点分别为 EBI（E，www.ebi.ac.uk/services`）、Tübingen 的 MPI for Genetics（M，toolkit.tuebingen.mpg.de）以及 Pasteur Institute Galaxy server（P，galaxy.pasteur.fr`）。

多序列比对可视化软件包

软件	说明	URL
ClustalX	ClustalW 的桌面 MSA 版本	`www.clustal.org`
Jalview	比对编辑器与查看器	`www.jalview.org`
SeaView	比对编辑器与查看器	`doua.prabi.fr/software/seaview`
ProViz	比对与注释查看器	`proviz.ucd.ie`

ProViz 用于蛋白质比对可视化的数据来源

多序列比对

资源	说明	URL
GeneTree	同源、旁系同源和直系同源比对，以及基因重复信息	`www.ensembl.org`
GOPHER	通过 reciprocal best hit 得到的直系同源比对	`bioware.ucd.ie`
Quest for Orthologs	同源基因数据集	`questfororthologs.org`

蛋白质模块性

资源	说明	URL
ELM	人工审查的 linear motifs	`elm.eu.org`
Pfam	功能区域与结合结构域	`pfam.xfam.org`
Phospho.ELM	经实验验证的磷酸化位点	`phospho.elm.eu.org`

结构信息

资源	说明	URL
DSSP	由 PDB 三级结构推导出的二级结构	`swift.cmbi.ru.nl/gv/dssp`
Homology models / SWISS-MODEL	根据与已解析结构的序列相似性指派三级结构	`swissmodel.expasy.org`
Protein Data Bank（PDB）	经实验解析的蛋白质三级结构	`www.rcsb.org`

基因组数据

资源	说明	URL
1000 Genomes	single-nucleotide polymorphism（单核苷酸多态性）	`www.1000genomes.org`
dbSNP	single-nucleotide polymorphism，包含疾病关联与 genotype（基因型）信息	`www.ncbi.nlm.nih.gov/SNP`
Isoforms	alternative splicing（可变剪接）	`www.uniprot.org`

其他人工审查数据

资源	说明	URL
Mutagenesis	经实验验证的点突变及其效应	`www.uniprot.org`
Regions of interest	经实验验证的功能区域	`www.uniprot.org`
Switches.ELM	经实验验证、基于 motif 的 molecular switches（分子开关）	`switches.elm.eu.org`

预测

资源	说明	URL
Anchor	无序区域中的结合位点	`anchor.enzim.hu`
Conservation	比对中残基的保守性	`bioware.ucd.ie`
ELM	由正则表达式识别的 linear motifs	`elm.eu.org`
IUPred	intrinsically disordered regions（内在无序区域）	`iupred.enzim.hu`
MobiDB	多种无序预测方法的集合	`mobidb.bio.unipd.it`
PsiPred	人类蛋白质的二级结构	`bioinf.cs.ucl.ac.uk/psipred`

参考文献（References）

以下参考文献题录按原书英文原文保留：

Altschul, S.F., Madden, T.L., Schäffer, A.A. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389–3402.

Bernhart, S.H., Hofacker, I.L., Will, S. et al. (2008). RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinf. 9:474.

Blackshields, G., Sievers, F., Shi, W. et al. (2010). Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 14(5):21. https://doi.org/10.1186/1748-7188-5-21.

Chatzou, M., Magis, C., Chang, J.M. et al. (2016). Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 17(6):1009–1023.

Cline, M., Hughey, R., and Karplus, K. (2002). Predicting reliable regions in protein sequence alignments. Bioinformatics. 18(2):306–314.

Darty, K., Denise, A., and Ponty, Y. (2009). VARNA: interactive drawing and editing of the RNA secondary structure. Bioinformatics 25(15):1974–1975.

Do, C.B., Mahabhashyam, M.S., Brudno, M., and Batzoglou, S. (2005). ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2):330–340.

Drozdetskiy, A., Cole, C., Procter, J., and Barton, G.J. (2015). JPred4: a protein secondary structure prediction server. Nucleic Acids Res. 43(W1):W389–W394. https://doi.org/10.1093/nar/gkv332.

Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inf. 23(1):205–211.

Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797.

Edgar, R.C. (2010). Quality measures for protein alignment benchmarks. Nucleic Acids Res. 38(7):2145–2153. https://doi.org/10.1093/nar/gkp1196.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6):368–376.

Feng, D.F. and Doolittle, R.F. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25(4):351–360.

Finn, R.D., Clements, J., and Eddy, S.R. (2011). HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39(Web Server issue):W29–W37. https://doi.org/10.1093/nar/gkr367.

Finn, R.D., Bateman, A., Clements, J. et al. (2014). Pfam: the protein families database. Nucleic Acids Res. 42(Database issue):D222–D230. https://doi.org/10.1093/nar/gkt1223.

Finn, R.D., Coggill, P., Eberhardt, R.Y. et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44(D1):D279–D285. https://doi.org/10.1093/nar/gkv1344.

Fox, G., Sievers, F., and Higgins, D.G. (2016). Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. Bioinformatics. 32(6):814–820. https://doi.org/10.1093/bioinformatics/btv592.

Galtier, N., Gouy, M., and Gautier, C. (1996). SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. Comput. Appl. Biosci. 12(6):543–548.

Gouy, M., Guindon, S., and Gascuel, O. (2010). SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol. Biol. Evol. 27(2):221–224. https://doi.org/10.1093/molbev/msp259.

Guindon, S. and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52(5):696–704.

Henikoff, S. and Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89(22):10915–10919.

Higgins, D.G., Bleasby, A.J., and Fuchs, R. (1992). CLUSTAL V: improved software for multiple sequence alignment. Comput. Appl. Biosci. 8(2):189–191.

Hogeweg, P. and Hesper, B. (1984). The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol. 20(2):175–186.

Iantorno, S., Gori, K., Goldman, N. et al. (2014). Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods Mol. Biol. 1079:59–73. https://doi.org/10.1007/978-1-62703-646-7_4.

Jehl, P., Manguy, J., Shields, D.C. et al. (2016). ProViz-a web-based visualization tool to investigate the functional and evolutionary features of protein sequences. Nucleic Acids Res. 44(W1):W11–W15. https://doi.org/10.1093/nar/gkw265.

Katoh, K., Kuma, K., Miyata, T., and Toh, H. (2005a). Improvement in the accuracy of multiple sequence alignment program MAFFT. Genome Inf. 16(1):22–33.

Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005b). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33(2):511–518.

Kimura, M. (1983). The Neutral Theory of Molecular Evolution, 75. Cambridge, UK: Cambridge University Press.

Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J., and Lesk, A.M. (2006). MUSTANG: a multiple structural alignment algorithm. Proteins 64(3):559–574.

Larkin, M.A., Blackshields, G., Brown, N.P. et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics. 23(21):2947–2948.

Lassmann, T. and Sonnhammer, E.L. (2005). Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinf. 6:298.

Le, Q., Sievers, F., and Higgins, D.G. (2017). Protein multiple sequence alignment benchmarking through secondary structure prediction. Bioinformatics. 33(9):1331–1337. https://doi.org/10.1093/bioinformatics/btw840.

Liu, K., Raghavan, S., Nelesen, S. et al. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324(5934):1561–1564. https://doi.org/10.1126/science.1171243.

Löytynoja, A. and Goldman, N. (2005). An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. USA. 102(30):10557–10562.

Marks, D.S., Colwell, L.J., Sheridan, R. et al. (2011). Protein 3D structure computed from evolutionary sequence variation. PLoS One. 6(12):e28766. https://doi.org/10.1371/journal.pone.0028766.

Mirarab, S., Nguyen, N., Guo, S. et al. (2015). PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5):377–386. https://doi.org/10.1089/cmb.2014.0156.

Mizuguchi, K., Deane, C.M., Blundell, T.L., and Overington, J.P. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7(11):2469–2471.

Morgenstern, B., Frech, K., Dress, A., and Werner, T. (1998). DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14(3):290–294.

Muth, R. and Manber, U. (1996). Approximate multiple string search. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, Laguna Beach, CA (10–12 June 1996), vol. 1075, 75–86. Berlin, Germany: Springer.

Needleman, S.B. and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3):443–453.

Notredame, C., Higgins, D.G., and Heringa, J. (2000). T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1):205–217.

Notredame, C., Holm, L., and Higgins, D.G. (1998). COFFEE: an objective function for multiple sequence alignments. Bioinformatics. 14(5):407–422.

Pettersen, E.F., Goddard, T.D., Huang, C.C. et al. (2004). UCSF Chimera: a visualization system for exploratory research and analysis. J. Comput. Chem. 25(13):1605–1612.

Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree2–approximately maximum-likelihood trees for large alignments. PLoS One. 5(3):e9490. https://doi.org/10.1371/journal.pone.0009490.

Raghava, G.P., Searle, S.M., Audley, P.C. et al. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinf. 4:47.

Russell, R.B. and Barton, G.J. (1992, 1992). Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins. 14(2):309–323.

Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4):406–425.

Sankoff, D., Morel, C., and Cedergen, R.J. (1973). Evolution of 5S rRNA and the non-randomness of base replacement. Nature. 245:232–234.

Sauder, J.M., Arthur, J.W., and Dunbrack, R.L. Jr., (2000). Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins. 40(1):6–22.

Schmollinger, M., Nieselt, K., Kaufmann, M., and Morgenstern, B. (2004). DIALIGNP: fast pair-wise and multiple sequence alignment using parallel processors. BMC Bioinf. 5:128.

Sievers, F., Wilm, A., Dineen, D. et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol. Syst. Biol. 7:539. https://doi.org/10.1038/msb.2011.75.

Sievers, F., Dineen, D., Wilm, A., and Higgins, D.G. (2013). Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics. 29(8):989–995. https://doi.org/10.1093/bioinformatics/btt093.

Sievers, F., Hughes, G.M., and Higgins, D.G. (2014). Systematic exploration of guide-tree topology effects for small protein alignments. BMC Bioinf. 15:338. https://doi.org/10.1186/1471-2105-15-338.

Söding, J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics. 21(7):951–960.

Sokal, R. and Michener, C. (1958). A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38:1409–1438.

Taylor, W.R. and Orengo, C.A. (1989). Protein structure alignment. J. Mol. Biol. 208(1):1–22.

Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673–4680.

Thompson, J.D., Gibson, T.J., Plewniak, F. et al. (1997). The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25(24):4876–4882.

Thompson, J.D., Plewniak, F., and Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27(13):2682–2690.

Thompson, J.D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 61(1):127–136.

Troshin, P.V., Procter, J.B., and Barton, G.J. (2011). Java bioinformatics analysis web services for multiple sequence alignment–JABAWS:MSA. Bioinformatics 27(14):2001–2002.

Van Walle, I., Lasters, I., and Wyns, L. (2005). SABmark–a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics. 21(7):1267–1268.

Waterhouse, A.M., Procter, J.B., Martin, D.M. et al. (2009). Jalview version 2–a multiple sequence alignment editor and analysis workbench. Bioinformatics. 25(9):1189–1191. https://doi.org/10.1093/bioinformatics/btp033.

Wheeler, T.J. and Kececioglu, J.D. (2007). Multiple alignment by aligning alignments. Bioinformatics. 23(13):i559–i568.

Multiple Sequence Alignments

Extracted source

Multiple Sequence Alignments

Introduction

Figure 8.1

Figure 8.2

第 8 章 多序列比对 / 引言

图 8.1

图 8.2

Extracted source

Measuring Multiple Alignment Quality

第 8 章 多序列比对 / 多序列比对质量的衡量

Extracted source

Making an Alignment: Practical Issues

第 8 章 多序列比对 / 构建比对：实践问题

第 8 章 多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

Clustal Omega

Iteration

Benchmarking Clustal Omega

图 8.3

表 8.1 BAliBASE3 benchmark 上的 aligner 性能

图 8.4

图 8.5

第 8 章 多序列比对 / 常用比对软件包 — Part 2：ClustalW2 / DIALIGN / Kalign / MAFFT

ClustalW2

DIALIGN

Kalign

MAFFT

Default MAFFT

L-INS-i

PartTree

第 8 章 多序列比对 / 常用比对软件包 — Part 3：MUSCLE / PASTA / PRANK / T-Coffee

MUSCLE

PASTA

PRANK

T-Coffee

第 8 章 多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

Clustal Omega

Iteration

Benchmarking Clustal Omega

图 8.3

表 8.1 BAliBASE3 benchmark 上的 aligner 性能

图 8.4

图 8.5

第 8 章 多序列比对 / 查看多序列比对

Clustal X

Jalview

SeaView

图 8.6

图 8.7

ProViz

第8章 多序列比对

8.6 总结、网络资源与参考文献

总结（Summary）

网络资源（Internet Resources）

多序列比对软件版本

多序列比对可视化软件包

ProViz 用于蛋白质比对可视化的数据来源

多序列比对

蛋白质模块性

结构信息

基因组数据

其他人工审查数据

预测

参考文献（References）

导出

第 8 章多序列比对 / 引言

第 8 章多序列比对 / 多序列比对质量的衡量

第 8 章多序列比对 / 构建比对：实践问题

第 8 章多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

第 8 章多序列比对 / 常用比对软件包 — Part 2：ClustalW2 / DIALIGN / Kalign / MAFFT

第 8 章多序列比对 / 常用比对软件包 — Part 3：MUSCLE / PASTA / PRANK / T-Coffee

第 8 章多序列比对 / 常用比对软件包 — Part 1：Clustal Omega

第 8 章多序列比对 / 查看多序列比对

第8章多序列比对