Chapter 3

Assessing Pairwise Sequence Similarity: BLAST and FASTA

11 小节

017

Introduction

PDF page 65 - PDF page 66 顶部；印刷页码 45-46

▶

English SourcePDF extracted

Assessing Pairwise Sequence Similarity: BL AST and FASTA

Andreas D. Baxevanis

Introduction

One of the cornerstones of bioinformatics is the process of comparing nucleotide or protein

sequences in order to deduce how the sequences are related to one another. Through this

type of comparative analysis, one can draw inferences regarding whether two proteins have

similar function, contain similar structural motifs, or have a discernible evolutionary rela-

tionship.Thischapterfocuseson pairwisealignments,wheretwosequencesaredirectlycom-

pared,positionbyposition,todeducetheserelationships.Anotherapproach, multiplesequence

alignment, is used to identify important features common to three or more sequences; this

approach,whichisoftenusedtopredictsecondarystructureandfunctionalmotifsandtoiden-

tifyconservedpositionsandresiduesimportanttobothstructureandfunction,isdiscussedin

Chapter8.

Before entering into any discussion of how relatedness between nucleotide or protein

sequencesisassessed,twoimportanttermsneedtobedefined: similarityandhomology.These

terms tend to be used interchangeably when, in fact, they mean quite different things and

implyquitedifferentbiologicalrelationships.

Similarityisaquantitativemeasureofhowrelatedtwosequencesaretooneanother.Similar-

ityisalwaysbasedonanobservable–usuallypairwisealignmentoftwosequences.Whentwo

sequencesarealigned,onecansimplycounthowmanyresidueslineupwithoneanother,and

thisrawcountcanthenbeconvertedtothemostcommonlyusedmeasureofsimilarity:per-

centidentity.Measuresofsimilarityareusedtoquantifychangesthatoccurastwosequences

divergeoverevolutionarytime,consideringtheeffectofsubstitutions,insertions,ordeletions.

Theycanalsobeusedtoidentifyresiduesthatarecrucialformaintainingaprotein’sstructure

or function. In short, a high percentage of sequence similarity may imply a common evolu-

tionaryhistoryorapossiblecommonalityinbiologicalfunction.

In contrast, homology implies an evolutionary relationship and is the putative conclusion

reachedbasedonexaminingtheoptimalalignmentbetweentwosequencesandassessingtheir

similarity.Genes(andtheirproteinproducts)eitherareorarenothomologous–homologyis

notmeasuredindegreesorpercentages.Theconceptofhomologyandtheterm homologmay

applytotwodifferenttypesofrelationships,asfollows.

• Ifgenesareseparatedbytheeventofspeciation,theyaretermed orthologous.Orthologsare

directdescendantsofasequenceinacommonancestor,andtheymayhavesimilardomain

structure, three-dimensional structure, and biological function. Put simply, orthologs can

bethoughtofasthesamegene(orprotein)indifferentspecies.

• Ifgeneswithinthesamespeciesareseparatedbyageneticduplicationevent,theyaretermed

paralogous.Theexaminationofparalogsprovidesinsightintohowpre-existinggenesmay

havebeenadaptedorco-optedtowardprovidinganewormodifiedfunctionwithinagiven

species.

Bioinformatics,FourthEdition.EditedbyAndreasD.Baxevanis,GaryD.Bader,andDavidS.Wishart.

CompanionWebsite:www.wiley.com/go/baxevanis/Bioinformatics_4e

46 Assessing Pairwise Sequence Similarity: BLAST and FASTA

The concepts of homology, orthology, and paralogy and methods for determining the

evolutionary relationships between sequences are covered in much greater detail in

Chapter9.

中文译文

评估双序列相似性：BLAST 和 FASTA

> 来源：Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 4th ed.

> 作者：Andreas D. Baxevanis

> 范围：PDF page 65 - PDF page 66 顶部；印刷页码 45-46。

> 用途：

引言

生物信息学的基石之一，是比较核苷酸序列或蛋白质序列，并据此推断这些序列之间的关系。通过这种比较分析，研究者可以推断两个蛋白质是否具有相似功能、是否包含相似的结构基序，或者是否存在可识别的进化关系。本章关注双序列比对（pairwise alignments）：即直接将两条序列逐位比较，以推断这些关系。另一种方法是多序列比对（multiple sequence alignment），用于识别三条或更多序列之间共有的重要特征；这种方法常用于预测二级结构和功能基序，并识别对结构和功能都很重要的保守位置与保守残基，第 8 章将对此进行讨论。

在讨论如何评估核苷酸序列或蛋白质序列之间的相关性之前，需要先定义两个重要术语：相似性（similarity）和同源性（homology）。这两个术语常被混用，但事实上，它们含义明显不同，所隐含的生物学关系也非常不同。

相似性是衡量两条序列彼此相关程度的定量指标。相似性始终基于可观察的结果，通常就是两条序列的双序列比对。当两条序列完成比对后，可以直接统计有多少残基彼此对齐；这个原始计数随后可以转换为最常用的相似性度量，即序列一致性百分比（percent identity）。相似性指标可用于量化两条序列在进化时间中逐渐分化时发生的变化，包括替换、插入和缺失的影响。它们也可用于识别那些对维持蛋白质结构或功能至关重要的残基。简言之，较高的序列相似性百分比可能提示共同的进化历史，或者提示生物学功能上可能存在共性。

相比之下，同源性意味着一种进化关系；它是在考察两条序列之间的最优比对并评估其相似性之后，提出的一种推断性结论。基因（及其蛋白质产物）要么同源，要么不同源——同源性不能用程度或百分比来衡量。同源性这一概念以及 homolog 这一术语，可适用于两类不同关系：

如果基因是由物种分化事件分隔开的，则称为直系同源（orthologous）。直系同源基因是共同祖先中某一序列的直接后代，可能具有相似的结构域组成、三维结构和生物学功能。简单来说，直系同源基因可以理解为不同物种中的同一个基因（或蛋白质）。
如果同一物种内的基因是由基因复制事件分隔开的，则称为旁系同源（paralogous）。研究旁系同源基因，有助于理解既有基因如何被适应性改造或被共同利用，从而在某一物种内提供新的或经过修饰的功能。

关于同源性、直系同源和旁系同源的概念，以及判断序列之间进化关系的方法，第 9 章将进行更详细的讨论。

术语表（9 条）

English	中文
pairwise alignment	双序列比对
multiple sequence alignment	多序列比对
similarity	相似性
homology	同源性
percent identity	序列一致性百分比
orthologous / ortholog	直系同源 / 直系同源基因
paralogous / paralog	旁系同源 / 旁系同源基因
structural motifs	结构基序
conserved positions / residues	保守位置 / 保守残基

018

Global Versus Local Sequence Alignments

PDF page 66；印刷页码 46

▶

English SourcePDF extracted

Global Versus Local Sequence Alignments

Themethodsusedtoassesssimilarity(and,inturn,inferhomology)canbegroupedintotwo

types:globalsequencealignmentandlocalsequencealignment.Globalsequencealignment

methodstaketwosequencesandtrytocomeupwiththebestalignmentofthetwosequences

acrosstheirentirelength.Ingeneral,globalsequencealignmentmethodsaremostapplicable

to highly similar sequences of approximately the same length. Although these methods can

beappliedtoanytwosequences,asthedegreeofsequencesimilaritydeclines,theywilltend

tomissimportantbiologicalrelationshipsbetweensequencesthatmaynotbeapparentwhen

consideringthesequencesintheirentirety.

Mostbiologistsinsteaddependonthesecondclassofalignmentalgorithm–localsequence

alignments.Inthesemethods,thesequencecomparisonisintendedtofindthemostsimilar

regionswithinthetwosequencesbeingaligned,ratherthanfinding(orforcing)analignment

overtheentirelengthofthetwosequencesbeingcompared.Assuch,andbyfocusingonsub-

sequences of high similarity that are more easily alignable, determining putative biological

relationshipsbetweenthetwosequencesbeingcomparedbecomesamucheasierproposition.

Thismakeslocalalignmentmethodsoneoftheapproachesofchoiceforbiologicaldiscovery.

Oftentimes,thesemethodswillreturnmorethanoneresultforthetwosequencesbeingcom-

pared,astheremaybemorethanonedomainorsubsequencecommontothesequencesbeing

analyzed.Localsequencealignmentmethodsarebestforsequencesthatsharesomedegreeof

similarityorforsequencesofdifferentlengths,andtheensuingdiscussionwillfocusmostly

onthesemethods.

中文译文

全局序列比对与局部序列比对

> 来源：Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 4th ed.

> 范围：PDF page 66；印刷页码 46。

> 用途：

用于评估相似性，并进一步推断同源性的序列比对方法，可以分为两类：全局序列比对（global sequence alignment）和局部序列比对（local sequence alignment）。全局序列比对方法会取两条序列，并尝试在两条序列的全长范围内给出最佳比对。一般来说，全局序列比对方法最适用于长度大致相同、且相似性很高的序列。虽然这类方法可以应用于任意两条序列，但随着序列相似性降低，它们往往会漏掉一些重要的生物学关系；这些关系在把序列作为整体来考察时，可能并不明显。

相比之下，大多数生物学家更依赖第二类比对算法，即局部序列比对。在这类方法中，序列比较的目标是在被比对的两条序列中找到最相似的区域，而不是在两条序列的全长范围内寻找（或强行建立）一个比对。因此，局部比对聚焦于更容易比对的高相似性子序列，使得判断两条序列之间可能存在的生物学关系变得容易得多。这也使局部比对方法成为生物学发现中常用的首选方法之一。

很多时候，这类方法会为被比较的两条序列返回不止一个结果，因为所分析的序列之间可能存在不止一个共同结构域或共同子序列。局部序列比对方法最适合用于具有一定相似性的序列，或长度不同的序列；接下来的讨论也将主要围绕这类方法展开。

术语表（6 条）

English	中文
global sequence alignment	全局序列比对
local sequence alignment	局部序列比对
subsequence	子序列
domain	结构域
similarity	相似性
homology	同源性

019

Scoring Matrices

PDF page 66 - PDF page 72 上半；印刷页码 46-52

▶

English SourcePDF extracted

Scoring Matrices

Whetheroneusesaglobalorlocalalignmentmethod,oncethetwosequencesunderconsid-

erationarealigned,howdoesoneactuallymeasurehowgoodthealignmentisbetween“se-

quenceA”and“sequenceB”?Thefirststeptowardansweringthatquestioninvolvesnumerical

methods that consider not just the position-by-position overlap between two sequences but

alsothenatureandcharacteristicsoftheresiduesornucleotidesbeingaligned.

Muchefforthasbeendevotedtothedevelopmentofconstructscalled scoringmatrices.These

matricesareempiricalweightingschemesthatappearinallanalysesinvolvingthecomparison

oftwoormoresequences,soitisimportanttounderstandhowthesematricesareconstructed

andhowtochoosebetweenmatrices.Thechoiceofmatrixcan(anddoes)stronglyinfluence

theresultsobtainedwithmostsequencecomparisonmethods.

Themostcommonlyusedproteinscoringmatricesconsiderthefollowingthreemajorbio-

logicalfactors.

Conservation. The matrices need to consider absolute conservation between protein

sequencesandalsoneedtoprovideawaytoassessconservativeaminoacidsubstitutions.

The numbers within the scoring matrix provide a way of representing what amino acid

residues are capable of substituting for other residues while not adversely affecting the

functionofthenativeprotein.Fromaphysicochemicalstandpoint,characteristicssuchas

residuecharge,size,andhydrophobicity(amongothers)needtobesimilar.

Scoring Matrices 47

Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 is the most widely

used scoring matrix for protein analysis and provides best coverage for general-use cases. Standard

single-letter codes to the left of each row and at the top of each column specify each of the 20 amino

acids. The ambiguity codes B (for asparagine or aspartic acid; Asx) and Z (for glutamine or glutamic acid;

Glx) also appear, as well as an X (denoting any amino acid). Note that the matrix is a mirror image of

itself with respect to the diagonal. See text for details.

Frequency. In the same way that amino acid residues cannot freely substitute for one

another, the matrices also need to reflect how often particular residues occur among the

entireconstellationofproteins.Residuesthatarerarearegivenmoreweightthanresidues

thataremorecommon.

Evolution. By design, scoring matrices implicitly represent evolutionary patterns, and

matricescanbeadjustedtofavorthedetectionofcloselyrelatedormoredistantlyrelated

proteins.Thechoiceofmatricesfordifferentevolutionarydistancesisdiscussedbelow.

There are also subtle nuances that go into constructing a scoring matrix, and these are

describedinanexcellentreviewbyHenikoffandHenikoff(2000).

How these various factors are actually represented within a scoring matrix can be best

demonstratedbydeconstructingthemostcommonlyusedscoringmatrix,calledBLOSUM62

(Figure 3.1). Each of the 20 amino acids (as well as the standard ambiguity codes) is shown

alongthetopanddownthesideofamatrix.Thescoresinthematrixactuallyrepresentthe

logarithmofanoddsratio(Box3.1)thatconsidershowoftenaparticularresidueisobserved,

in nature, to replace another residue. The odds ratio also considers how often a particular

residuewouldbereplacedbyanotherifreplacementsoccurredinarandomfashion(purely

by chance). Given this, a positive score indicates two residues that are seen to replace each

other more often than by chance, and a negative score indicates two residues that are seen

to replace each other less frequently than would be expected by chance. Put more simply,

frequentlyobservedsubstitutionshavepositivescoresandinfrequentlyobservedsubstitutions

havenegativescores.

48 Assessing Pairwise Sequence Similarity: BLAST and FASTA

Box 3.1 Scoring Matrices and the Log Odds Ratio

Protein scoring matrices are derived from the observed replacement frequencies of amino

acids for one another. Based on these probabilities, the scoring matrices are generated by

applying the following equation:

S

i,j =log[(qi,j)∕(pi pj)]

where pi is the probability with which residue i occurs among all proteins and pj is the

probability with which residue j occurs among all proteins. The quantity qi,j represents

how often the two amino acids i and j are seen to align with one another in multiple

sequence alignments of protein families or in sequences that are known to have a biolog-

ical relationship. Therefore, the log odds ratio Si,j (or “lod score”) represents the ratio of

observed vs. random frequency for the substitution of residue i by residue j. For commonly

observed substitutions, Si,j will be greater than zero. For substitutions that occur less fre-

quently than would be expected by chance, Si,j will be less than zero. If the observed

frequency and the random frequency are the same, Si,j will be zero.

Toexplainthemeaningofthenumbersinthematrixmorefully,imaginethattwosequences

havebeen alignedwith oneanother,and it isnow necessary to assess howwella residue in

sequenceAmatchestoaresidueinsequenceBatanygivenpositionofthealignment.Using

thescoringmatrixinFigure3.1asourstartingpoint,

• Thevaluesonthediagonalrepresentthescorethatwouldbeconferredforanexactmatch

atagivenposition,andthesenumbersarealwayspositive.So,ifatryptophanresidue(W)

in sequence A is aligned with a tryptophan residue in sequence B, this match would be

conferred11points,thevaluewheretherowmarked Wintersectsthecolumnmarked W.Also

noticethat11isthehighestvalueonthediagonal,sothehighnumberofpointsassigned

to a W:W alignment reflects not only the exact match but also the fact that tryptophan is

therarestofaminoacidsfoundinproteins.Putotherwise,theW:Walignmentismuchless

likelytooccuringeneraland,inturn,ismorelikelytobecorrect.

• Movingoffthediagonal,considerthecaseofaconservativesubstitution:atyrosine(Y)fora

tryptophan.Theintersectionoftherowmarked Ywiththecolumnmarked Wyieldsavalue

of2.Thepositivevalueimpliesthatthesubstitutionisobservedtooccurmoreofteninan

alignmentthanitwouldbychance,butthereplacementisnotasgoodasifthetryptophan

residuehadbeenpreserved(2 <11)orifthetyrosineresiduehadbeenpreserved(2 <7).

• Finally,considerthecaseofanon-conservativesubstitution:avaline(V)foratryptophan.

Theintersectionoftherowmarked Vwiththecolumnmarked Wyieldsavalueof −3.The

negativevalueimpliesthatthesubstitutionisnotobservedtooccurfrequentlyandmayarise

moreoftenthannotbychance.

Althoughthemeaningofthenumbersandrelationshipswithinthescoringmatricesseems

straightforwardenough,somevaluejudgmentsneedtobemadeastowhatactuallyconstitutes

a conservativeor non-conservativesubstitution and how to assess the frequency of either of

thoseeventsinnature.Thisisthemajorfactorthatdifferentiatesscoringmatricesfromone

another.Tohelpthereadermakeanintelligentchoice,adiscussionoftheapproach,advan-

tages,anddisadvantagesofthevariousavailablematricesisinorder.

PAM Matrices

ThefirstusefulmatricesforproteinsequenceanalysisweredevelopedbyDayhoffetal.(1978).

The basis for these matrices was the examination of substitution patterns in a group of pro-

teinsthatsharedmorethan85%sequenceidentity.Theanalysisyielded1572changesinthe

71groupsofcloselyrelatedproteinsthatwereexamined.Usingtheseresults,tableswerecon-

structedthatindicatedthefrequencyofagivenaminoacidsubstitutingforanotheraminoacid

atagivenposition.

Scoring Matrices 49

Asthesequencesexaminedsharedsuchahighdegreeofsimilarity,theresultingfrequencies

representwhatwouldbeexpectedovershortevolutionarydistances.Further,giventheclose

evolutionaryrelationshipbetweentheseproteins,onewouldexpectthattheobservedmuta-

tions would not significantly change the function of the protein. This is termedacceptance:

changesthatcanbeaccommodatedthroughnaturalselectionandresultinaproteinwiththe

sameorsimilarfunctionastheoriginal.Asindividualpointmutationswereconsidered,the

unitofmeasureresultingfromthisanalysisisthe pointacceptedmutation orPAMunit.One

PAMunitcorrespondstooneaminoacidchangeper100residues,orroughly1%divergence.

SeveralassumptionswentintotheconstructionofthePAMmatrices.Oneofthemostimpor-

tantassumptionswasthatthereplacementofanaminoacidisindependentofpreviousmuta-

tionsatthesameposition.Basedonthisassumption,theoriginalmatrixwasextrapolatedto

comeupwithpredictedsubstitutionfrequenciesatlongerevolutionarydistances.Forexample,

the PAM1 matrixcould be multipliedby itself 100 times to yield the PAM100 matrix,which

wouldrepresentwhatonewouldexpectiftherewere100aminoacidchangesper100residues.

(This does not imply that each of the 100 residues has changed, only that there were 100

total changes; some positions could conceivably change and then change back to the origi-

nalresidue.)Asthematricesrepresentinglongerevolutionarydistancesareanextrapolation

oftheoriginalmatrixderivedfromthe1572observedchangesdescribedabove,itisimportant

torememberthatthesematricesare,indeed,predictionsandarenotbasedondirectobserva-

tion.Anyerrorsintheoriginalmatrixwouldbeexaggeratedintheextrapolatedmatrices,as

themereactofmultiplicationwouldmagnifytheseerrorssignificantly.

Thereareadditionalassumptionsthatthereadershouldbeawareofregardingtheconstruc-

tionofthesePAMmatrices.Allsiteshavebeenassumedtobeequallymutable,replacement

hasbeenassumed tobeindependentofsurroundingresidues,andthereisnoconsideration

ofconservedblocksormotifs.Thesequencesbeingcomparedhereareofaveragecomposition

based on the small number of protein sequences available in 1978, so there is a bias toward

small,globularproteins,eventhougheffortshavebeenmadetobringinadditionalsequence

data over time (Gonnet et al. 1992; Jones et al. 1992). Finally, there is an implicit assump-

tionthattheforcesresponsibleforsequenceevolutionovershortertimespansarethesame

asthoseforlongerevolutionarytimespans.Althoughtherearesignificantdrawbackstothe

PAMmatrices,itisimportanttorememberthat,giventheinformationavailablein1978,the

development of these matrices marked an important advance in our ability to quantify the

relationshipsbetweensequences.Asthesematricesarestillavailableforusewithnumerous

bioinformatictools,thereadershouldkeepthesepotentialdrawbacksinmindandusethem

judiciously.

BLOSUM Matrices

In1992,SteveandJorjaHenikofftookaslightlydifferentapproachtotheonedescribedabove,

onethataddressedmanyofthedrawbacksofthePAMmatrices.Thegroundworkforthedevel-

opmentofnewmatriceswasastudyaimedatidentifyingconservedmotifswithinfamiliesof

proteins (Henikoff and Henikoff 1991, 1992). This study led to the creation of the BLOCKS

database,whichusedtheconceptofa blocktoidentifyafamilyofproteins.Theideaofablock

isderivedfromthemorefamiliarnotionofamotif,whichusuallyreferstoaconservedstretch

ofaminoacidsthatconfersaspecificfunctionorstructuretoaprotein.Whentheseindividual

motifsfromproteinsinthesamefamilycanbealignedwithoutintroducingagap,theresult

isablock,withtheterm blockreferringtothealignment,nottheindividualsequencesthem-

selves. Obviously, any given protein can contain one or more blocks, corresponding to each

ofitsstructuralorfunctionalmotifs.Withtheseproteinblocksinhand,itwasthenpossible

tolookforsubstitutionpatternsonlyinthemostconservedregionsofaprotein,theregions

that(presumably)wereleastpronetochange.Twothousandblocksrepresentingmorethan

500groupsofrelatedproteinswereexaminedand,basedonthesubstitutionpatternsinthose

conservedblocks, blockssubstitutionmatrices(orBLOSUMs,forshort)weregenerated.

50 Assessing Pairwise Sequence Similarity: BLAST and FASTA

Giventhepaceofscientificdiscovery,manymoreproteinsequenceswereavailablein1992

than in 1978, providing for a more robust base set of data from which to derive these new

matrices.However,themostimportantdistinctionbetweentheBLOSUMandPAMmatrices

isthattheBLOSUMmatricesaredirectlycalculatedacrossvaryingevolutionarydistancesand

are not extrapolated, providing a more accurate view of substitution patterns (and, in turn,

evolutionaryforces)atthosevariousdistances.ThefactthattheBLOSUMmatricesarecalcu-

lateddirectlybasedonlyonconservedregionsmakesthesematricesmoresensitivetodetecting

structuralorfunctionalsubstitutions;therefore,theBLOSUMmatricesperformdemonstrably

betterthanthePAMmatricesforlocalsimilaritysearches(HenikoffandHenikoff1993).

Returning to the point of directly deriving the various matrices, each BLOSUM matrix is

assigned a number (BLOSUMn), and that number represents the conservation level of the

sequencesthatwereusedtoderivethatparticularmatrix.Forexample,theBLOSUM62matrix

is calculated from sequences sharing no more than 62% identity; sequences with more than

62%identityareclusteredandtheircontributionisweightedto1.Theclusteringreducesthe

contributionofcloselyrelatedsequences,meaningthatthereislessbiastowardsubstitutions

that occur (and may be over-represented) in the most closely related members of a family.

Reducingthevalueof nyieldsmoredistantlyrelatedsequences.

Which Matrices Should be Used When?

Although most bioinformatic software will provide users with a default choice of a scoring

matrix,thedefaultmaynotnecessarilybethemostappropriatechoiceforthebiologicalques-

tion being asked. Table 3.1 is intended to provide some guidance as to the proper selection

ofscoringmatrix,basedonstudiesthathaveexaminedtheeffectivenessofthesematricesto

detect known biological relationships (Altschul 1991; Henikoff and Henikoff 1993; Wheeler

2003).Notethatthenumberingschemesforthetwomatrixfamiliesmoveinoppositedirec-

tions:moredivergentsequencesarefoundusinghighernumberedPAMmatricesandlower

numberedBLOSUMmatrices.ThefollowingequivalenciesareusefulinrelatingPAMmatrices

toBLOSUMmatrices(Wheeler2003):

• PAM250isequivalenttoBLOSUM45

• PAM160isequivalenttoBLOSUM62

• PAM120isequivalenttoBLOSUM80.

Inadditiontotheproteinmatricesdiscussedhere,therearenumerousspecializedmatrices

thatareeitherspecifictoaparticularspecies,concentrateonparticularclassesofproteins(e.g.

transmembraneproteins),focusonstructuralsubstitutions,orusehydrophobicitymeasures

inattemptingtoassesssimilarity(seeWheeler2003).Giventhislandscape,themostimpor-

tanttake-homemessageforthereaderisthatnosinglematrixisthecompleteanswerforall

sequence comparisons.A thoroughunderstandingofwhat eachmatrix represents is critical

toperformingpropersequence-basedanalyses.

T able 3.1 Selecting an appropriate scoring matrix.

Matrix Best use Similarity

PAM40 Shortalignmentsthatarehighlysimilar 70–90%

PAM160 Detectingmembersofaproteinfamily 50–60%

PAM250 Longeralignmentsofmoredivergentsequences ∼30%

BLOSUM90 Shortalignmentsthatarehighlysimilar 70–90%

BLOSUM80 Detectingmembersofaproteinfamily 50–60%

BLOSUM62 Mosteffectiveinfindingallpotentialsimilarities 30–40%

BLOSUM30 Longeralignmentsofmoredivergentsequences <30%

TheSimilaritycolumngivestherangeofsimilaritiesthatthematrixisabletobestdetect(Wheeler2003).

Scoring Matrices 51

Nucleotide Scoring Matrices

用途：

范围：PDF page 71；印刷页码 51。

边界：从 “Nucleotide Scoring Matrices” 标题开始，到 “Gaps and Gap Penalties” 标题前结束。

Nucleotide Scoring Matrices

Atthenucleotidelevel,thescoringlandscapeismuchsimpler.Moreoftenthannot,thematri-

ces used heresimply countmatchesandmismatches. Thesematricesalsoassume thateach

ofthepossiblefournucleotidebasesoccurswithequalfrequency(25%ofthetime).Insome

cases, ambiguities or chemical similarities between the bases are also considered; this type

ofmatrixisshowninFigure3.2.Thebasicdifferencesintheconstructionofnucleotideand

proteinscoringmatricesshouldmakeobviousthefactthatprotein-basedsearchesarealways

morepowerfulthannucleotide-basedsearchesofcodingDNAsequencesindeterminingsimi-

larityandinferringhomology,giventheinherentlyhigherinformationcontentofthe20-letter

aminoacidalphabetversusthefour-letternucleotidealphabet.

Gaps and Gap Penalties

用途：

范围：PDF page 71 - PDF page 72 上半；印刷页码 51-52。

边界：从 “Gaps and Gap Penalties” 标题开始，到 “BLAST” 标题前结束。

Gaps and Gap Penalties

Oftentimes,gapsareintroducedtoimprovethealignmentbetweentwonucleotideorprotein

sequences.Thesegapscompensateforinsertionsanddeletionsbetweenthesequencesbeing

studied so, in essence, these gaps represent biological events. As such, the number of gaps

introducedintoapairwisesequencealignmentneedstobekepttoareasonablenumbersoas

tonotyieldabiologicallyimplausiblescenario.

The scoring of gaps in pairwise sequence alignments is different from scoring approaches

discussedtothispoint,asnocomparisonbetweencharactersispossible–onesequencehasa

residueatsomepositionandtheothersequencehasnothing.Themostwidelyusedmethod

forscoringgapsinvolvesaquantityknownasthe affinegappenalty .Here,afixeddeduction

is made for introducing the gap; an additional deduction is made that is proportional to the

lengthofthegap.Theformulafortheaffinegappenaltyis G+Ln,where Gisthegap-opening

penalty(thecostofcreatingthegap), Listhegap-extensionpenalty,and nisthelengthofthe

gap,with G>L.Thislastconditionisimportant:giventhatthegap-openingpenaltyislarger

thanthegap-extensionpenalty,lengtheningexistinggapswouldbefavoredovercreatingnew

ones.Thevaluesof GandLcanbeadjustedmanuallyinmostprogramstomaketheinsertion

Figure 3.2 A nucleotide scoring table. The scoring for the four nucleotide bases is shown in the upper

left of the ﬁgure, with the remaining one-letter codes specifying the IUPAC/UBMB codes for ambiguities

or chemical similarities. Note that the matrix is a mirror image of itself with respect to the diagonal.

52 Assessing Pairwise Sequence Similarity: BLAST and FASTA

ofgapseithermoreorlesspermissive,butmostmethodsautomaticallyadjustboth GandLto

themostappropriatevaluesforthescoringmatrixbeingused.

Theothermajortypeofgappenaltyusedisa non-affine(orlinear)gappenalty.Here,there

isnocostforopeningthegap;asimple,fixedmismatchpenaltyisassessedforeachpositionin

thegap.Itisthoughtthataffinepenaltiesbetterrepresentthebiologyunderlyingthesequence

alignments,asaffinegappenaltiestakeintoaccountthefactthatmostconservedregionsare

ungappedandthatasinglemutationaleventcouldinsertordeletemanymorethanjustone

residue.Inpractice,useoftheaffinegappenaltybetterenablesthedetectionofmoredistant

homologs.

中文译文

评分矩阵

> 来源：Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 4th ed.

> 范围：PDF page 66 - PDF page 71 上半；印刷页码 46-51。

> 用途：

无论使用全局比对方法还是局部比对方法，一旦两条待比较序列完成比对，接下来的问题就是：怎样实际衡量“序列 A”和“序列 B”之间的比对有多好？回答这个问题的第一步，是使用数值方法。这些方法不仅考虑两条序列逐位重叠的情况，还考虑被比对的残基或核苷酸本身的性质与特征。

为此，研究者投入了大量精力来开发一类称为评分矩阵（scoring matrices）的工具。评分矩阵是经验性的加权方案，出现在所有涉及两条或多条序列比较的分析中。因此，理解这些矩阵是如何构建的，以及如何在不同矩阵之间做出选择，非常重要。矩阵的选择能够——而且确实会——显著影响大多数序列比较方法得到的结果。

最常用的蛋白质评分矩阵会考虑以下三个主要生物学因素。

保守性（conservation）。矩阵需要考虑蛋白质序列之间的绝对保守性，也需要提供一种方法来评估保守性氨基酸替换。评分矩阵中的数值，用来表示哪些氨基酸残基可以替换其他残基，同时不对天然蛋白质的功能造成不利影响。从物理化学角度看，残基的电荷、大小和疏水性等特征需要相似。

图 3.1 BLOSUM62 评分矩阵（Henikoff and Henikoff 1992）。BLOSUM62 是蛋白质分析中使用最广泛的评分矩阵，并且在通用场景中具有最佳覆盖度。矩阵每一行左侧和每一列顶部的标准单字母代码表示 20 种氨基酸。图中还包括歧义代码 B（表示天冬酰胺或天冬氨酸；Asx）、Z（表示谷氨酰胺或谷氨酸；Glx），以及 X（表示任意氨基酸）。注意，该矩阵相对于对角线呈镜像对称。详见正文。

频率（frequency）。正如氨基酸残基不能任意相互替换一样，矩阵也需要反映特定残基在整个蛋白质集合中出现的频率。稀有残基会比常见残基获得更高权重。

进化（evolution）。从设计上看，评分矩阵隐含地代表了进化模式；矩阵也可以被调整，以偏向检测亲缘关系较近或较远的蛋白质。针对不同进化距离选择何种矩阵，将在下文讨论。

构建评分矩阵还涉及一些细微问题，Henikoff 和 Henikoff（2000）的一篇优秀综述对此有详细说明。

这些因素在评分矩阵中究竟如何体现，最好的说明方式是拆解最常用的评分矩阵 BLOSUM62（图 3.1）。20 种氨基酸以及标准歧义代码分别列在矩阵顶部和侧边。矩阵中的分数实际上表示一个优势比（odds ratio）的对数（Box 3.1），该优势比考虑的是：在自然界中，某一残基被观察到替换另一残基的频率。这个优势比还会考虑：如果替换以随机方式发生（纯粹出于偶然），某一残基被另一残基替换的频率应当是多少。因此，正分表示两种残基彼此替换的观察频率高于随机预期；负分则表示两种残基彼此替换的观察频率低于随机预期。更简单地说，常见替换得到正分，不常见替换得到负分。

Box 3.1 评分矩阵与对数优势比

蛋白质评分矩阵来源于氨基酸彼此替换的观察频率。基于这些概率，可以使用下式生成评分矩阵：

S_i,j = log[(q_i,j) / (p_i p_j)]

其中，p_i 表示残基 i 在所有蛋白质中出现的概率，p_j 表示残基 j 在所有蛋白质中出现的概率。q_i,j 表示在蛋白质家族的多序列比对中，或在已知具有生物学关系的序列中，氨基酸 i 和 j 被观察到彼此对齐的频率。因此，对数优势比 S_i,j（或称 “lod score”，即 lod 分数）表示残基 i 被残基 j 替换时，观察频率与随机频率之间的比值。对于常见替换，S_i,j 大于 0；对于发生频率低于随机预期的替换，S_i,j 小于 0；如果观察频率与随机频率相同，则 S_i,j 等于 0。

为了更充分地说明矩阵中数字的含义，可以设想两条序列已经完成比对，现在需要评估在比对的某一给定位置上，序列 A 中的一个残基与序列 B 中的一个残基匹配得有多好。以图 3.1 中的评分矩阵为例：

对角线上的数值表示某一位置发生精确匹配时得到的分数，这些数值总是正数。例如，如果序列 A 中的色氨酸残基（W）与序列 B 中的色氨酸残基对齐，那么这一匹配会得到 11 分，即 W 行与 W 列交叉处的数值。还应注意，11 是对角线上的最高值。因此，W:W 比对获得如此高的分数，不仅反映了这是一次精确匹配，也反映了色氨酸是蛋白质中最稀有的氨基酸。换句话说，W:W 比对总体上更不容易偶然发生，因此也更可能是正确的。
离开对角线后，可以考虑一个保守性替换的例子：用酪氨酸（Y）替换色氨酸。Y 行与 W 列的交叉处数值为 2。这个正值意味着，这种替换在比对中出现的观察频率高于随机预期；但它不如保留色氨酸残基好（2 < 11），也不如保留酪氨酸残基好（2 < 7）。
最后，考虑一个非保守性替换的例子：用缬氨酸（V）替换色氨酸。V 行与 W 列交叉处的数值为 −3。这个负值意味着，这种替换并不常被观察到，其出现更多可能是偶然结果。

尽管评分矩阵中数字及其相互关系的含义看起来相当直接，但在实际构建矩阵时，仍必须对什么才算保守性替换或非保守性替换，以及如何评估这些事件在自然界中的频率，做出一些判断。这正是不同评分矩阵彼此区分的主要因素。为了帮助读者做出明智选择，有必要讨论目前可用矩阵的构建思路、优点和缺点。

PAM 矩阵

最早可用于蛋白质序列分析的实用矩阵由 Dayhoff 等人（1978）开发。这些矩阵的基础，是考察一组序列一致性超过 85% 的蛋白质中的替换模式。该分析在 71 组亲缘关系较近的蛋白质中识别出 1572 个变化。研究者据此构建表格，用来表示在某一给定位置上，一个特定氨基酸替换另一个氨基酸的频率。

由于被考察的序列具有如此高的相似性，所得频率代表的是较短进化距离上预期会出现的情况。此外，由于这些蛋白质之间进化关系接近，可以预期观察到的突变不会显著改变蛋白质功能。这称为接受（acceptance）：即那些能够通过自然选择被容纳，并产生与原始蛋白质具有相同或相似功能蛋白质的变化。由于该分析考察的是单个点突变，因此由此得到的计量单位称为可接受点突变（point accepted mutation），即 PAM 单位。1 个 PAM 单位对应每 100 个残基中发生 1 个氨基酸变化，约等于 1% 分化。

PAM 矩阵的构建包含若干假设。其中最重要的假设之一是：某一位置上的氨基酸替换独立于该位置此前发生过的突变。基于这一假设，原始矩阵被外推，用来预测更长进化距离上的替换频率。例如，PAM1 矩阵可以与自身相乘 100 次，得到 PAM100 矩阵；PAM100 代表的是每 100 个残基发生 100 次氨基酸变化时的预期情况。（这并不意味着 100 个残基中的每一个都发生了变化，而只是说总共发生了 100 次变化；某些位置可能先发生变化，随后又变回原来的残基。）由于代表较长进化距离的矩阵是从上述 1572 个观察变化所构建的原始矩阵外推而来，因此必须记住，这些矩阵确实是预测结果，并非基于直接观察。原始矩阵中的任何误差都会在外推矩阵中被放大，因为单纯的矩阵相乘会显著放大这些误差。

读者还应了解 PAM 矩阵构建中的其他假设。所有位点都被假定为同等可变；替换被假定为独立于周围残基；同时，PAM 矩阵不考虑保守区块或基序。这里比较的序列具有平均组成特征，而这一“平均”是基于 1978 年可获得的少量蛋白质序列得出的，因此偏向小型球状蛋白；尽管后来已有努力将更多序列数据纳入其中（Gonnet et al. 1992; Jones et al. 1992）。最后，这里还隐含了一个假设：负责较短时间尺度上序列进化的力量，与较长进化时间尺度上的力量相同。虽然 PAM 矩阵存在显著缺点，但也应记住，在 1978 年可获得的信息条件下，这些矩阵的开发标志着人们量化序列关系能力的一项重要进展。由于许多生物信息学工具仍可使用这些矩阵，读者应牢记这些潜在缺陷，并审慎使用。

BLOSUM 矩阵

1992 年，Steve 和 Jorja Henikoff 采用了一种与上述方法略有不同的思路，并解决了 PAM 矩阵的许多缺点。新矩阵开发的基础，是一项旨在识别蛋白质家族中保守基序的研究（Henikoff and Henikoff 1991, 1992）。该研究促成了 BLOCKS 数据库的建立；这一数据库使用 block 的概念来识别蛋白质家族。block 的概念源自更熟悉的 motif 概念，后者通常指一段保守的氨基酸序列，并为蛋白质赋予特定功能或结构。当同一家族蛋白质中的这些单个基序可以在不引入缺口的情况下完成比对时，得到的结果就是一个 block；这里的 block 指的是比对本身，而不是单条序列。显然，任意一个给定蛋白质都可以包含一个或多个 block，分别对应其结构基序或功能基序。有了这些蛋白质 block，就可以只在蛋白质中最保守、也就是推测最不容易发生变化的区域中寻找替换模式。研究者考察了代表 500 多组相关蛋白质的 2000 个 block，并基于这些保守 block 中的替换模式，生成了区块替换矩阵（blocks substitution matrices），简称 BLOSUM。

鉴于科学发现的速度，到 1992 年，可用蛋白质序列数量远多于 1978 年，因此这些新矩阵可以从更稳健的基础数据集中推导出来。然而，BLOSUM 矩阵与 PAM 矩阵之间最重要的区别在于：BLOSUM 矩阵是在不同进化距离上直接计算得到的，而不是外推得到的，因此能够更准确地反映这些距离上的替换模式，并进一步反映相应的进化力量。BLOSUM 矩阵仅基于保守区域直接计算，这一事实使其对结构性或功能性替换的检测更加敏感；因此，在局部相似性搜索中，BLOSUM 矩阵的表现明显优于 PAM 矩阵（Henikoff and Henikoff 1993）。

回到直接推导各类矩阵这一点，每个 BLOSUM 矩阵都会被赋予一个编号（BLOSUMn），该编号表示用于推导该矩阵的序列的保守水平。例如，BLOSUM62 矩阵是由序列一致性不超过 62% 的序列计算得到的；序列一致性超过 62% 的序列会被聚类，并且它们的贡献被加权为 1。聚类会降低亲缘关系很近的序列的贡献，也就是说，来自同一家族中最接近成员的替换不会被过度代表，从而减少偏倚。降低 n 的值，会得到用于更远亲缘关系序列的矩阵。

应在何时使用哪种矩阵？

虽然大多数生物信息学软件都会为用户提供默认评分矩阵，但默认矩阵未必一定最适合当前提出的生物学问题。表 3.1 旨在根据已有研究提供一些指导，帮助选择合适的评分矩阵；这些研究考察了不同矩阵检测已知生物学关系的有效性（Altschul 1991; Henikoff and Henikoff 1993; Wheeler 2003）。需要注意的是，这两个矩阵家族的编号方向相反：分化程度更高的序列，需要使用编号较高的 PAM 矩阵和编号较低的 BLOSUM 矩阵来识别。以下等价关系有助于将 PAM 矩阵与 BLOSUM 矩阵对应起来（Wheeler 2003）：

PAM250 约等于 BLOSUM45
PAM160 约等于 BLOSUM62
PAM120 约等于 BLOSUM80

除这里讨论的蛋白质矩阵外，还有许多专门矩阵：有些特异于某一物种，有些关注特定蛋白质类别（如跨膜蛋白），有些关注结构性替换，还有一些尝试利用疏水性指标来评估相似性（见 Wheeler 2003）。面对这样的选择格局，读者最需要记住的是：没有任何单一矩阵能够回答所有序列比较问题。要正确开展基于序列的分析，必须充分理解每一种矩阵究竟代表什么。

表 3.1 选择合适的评分矩阵

矩阵	最适合的用途	相似性
PAM40	短而高度相似的比对	70–90%
PAM160	检测蛋白质家族成员	50–60%
PAM250	分化程度更高序列的较长比对	∼30%
BLOSUM90	短而高度相似的比对	70–90%
BLOSUM80	检测蛋白质家族成员	50–60%
BLOSUM62	最有效地发现所有潜在相似性	30–40%
BLOSUM30	分化程度更高序列的较长比对	<30%

“相似性”列给出的是该矩阵最适合检测的相似性范围（Wheeler 2003）。

---

核苷酸评分矩阵

> 范围：PDF page 71；印刷页码 51。

在核苷酸层面，评分问题要简单得多。这里使用的矩阵通常只是简单统计匹配和错配。这类矩阵还假设四种可能的核苷酸碱基出现频率相同，即各占 25%。在某些情况下，矩阵也会考虑碱基之间的歧义或化学相似性；图 3.2 展示了这类矩阵的一个例子。

核苷酸评分矩阵与蛋白质评分矩阵在构建方式上的基本差异，应当清楚表明：在判定相似性和推断同源性时，对于编码 DNA 序列，基于蛋白质的搜索总是比基于核苷酸的搜索更有力。这是因为 20 个字母组成的氨基酸字母表，相比 4 个字母组成的核苷酸字母表，天然包含更高的信息量。

---

缺口与缺口罚分

> 范围：PDF page 71 - PDF page 72 上半；印刷页码 51-52。

在比较两条核苷酸序列或蛋白质序列时，常常需要引入缺口（gaps），以改善两条序列之间的比对。这些缺口用于补偿所研究序列之间发生的插入和缺失。因此，从本质上说，这些缺口代表了生物学事件。也正因为如此，在双序列比对中引入的缺口数量必须控制在合理范围内，避免得到生物学上不可信的情形。

在双序列比对中，对缺口进行评分的方法不同于前面讨论过的评分方式，因为这里无法比较两个字符：一条序列在某个位置有残基，而另一条序列在该位置没有任何字符。最常用的缺口评分方法涉及一个称为仿射缺口罚分（affine gap penalty）的量。在这种方法中，引入缺口会产生一个固定扣分；此外，还会根据缺口长度产生一个成比例的额外扣分。仿射缺口罚分的公式为：

G + L n

其中，G 是缺口开启罚分（gap-opening penalty，即产生缺口的代价），L 是缺口延伸罚分（gap-extension penalty），n 是缺口长度，并且 G > L。最后这一条件很重要：由于缺口开启罚分大于缺口延伸罚分，延长已有缺口会比创建新缺口更受偏好。在大多数程序中，G 和 L 的值可以手动调整，使缺口插入更宽松或更严格；不过，多数方法会根据所使用的评分矩阵，自动将 G 和 L 调整到最合适的值。

图 3.2 核苷酸评分表。图左上角显示四种核苷酸碱基的评分，其余单字母代码表示 IUPAC/UBMB 关于歧义或化学相似性的代码。注意，该矩阵相对于对角线呈镜像对称。

另一种常用的主要缺口罚分类型是非仿射（或线性）缺口罚分（non-affine or linear gap penalty）。在这种方法中，开启缺口本身没有代价；对于缺口中的每一个位置，只施加一个简单、固定的错配罚分。通常认为，仿射罚分更能代表序列比对背后的生物学原因，因为仿射缺口罚分考虑到这样一个事实：大多数保守区域没有缺口，而且单个突变事件可能插入或删除的不止一个残基。在实践中，使用仿射缺口罚分更有助于检测亲缘关系更远的同源序列。

术语表（15 条）

English	中文
scoring matrix / scoring matrices	评分矩阵
conservation	保守性
frequency	频率
evolution	进化
conservative amino acid substitution	保守性氨基酸替换
odds ratio	优势比
log odds ratio	对数优势比
lod score	lod 分数
PAM matrix	PAM 矩阵
point accepted mutation	可接受点突变
PAM unit	PAM 单位
BLOSUM matrix	BLOSUM 矩阵
block	block / 区块
blocks substitution matrices	区块替换矩阵
motif	基序

020

BLAST

PDF page 72 - PDF page 81 中部；印刷页码 52-61

▶

English SourcePDF extracted

BL AST

Byfarthemostwidelyusedtechniquefordetectingsimilaritybetweensequencesofinterest

is the Basic Local Alignment Search Tool, or BLAST (Altschul et al. 1991). The widespread

adoptionofBLASTasacornerstonetechniqueinsequenceanalysisliesinitsabilitytodetect

similaritiesbetweennucleotideandproteinsequencesaccuratelyandquickly,withoutsacri-

ficingsensitivity.Theoriginal,standardfamilyofBLASTprogramsisshowninTable3.2,but

inthetimesinceitsintroductionmanyvariationsoftheoriginalBLASTprogramhavebeen

developedtoaddressspecificneedsintherealmofpairwisesequencecomparison,severalof

whichwillbediscussedinthischapter.

The Algorithm

用途：

范围：PDF page 72 - PDF page 74；印刷页码 52-54。

边界：从 “The Algorithm” 标题开始，到 “Performing a BLAST Search” 标题前结束。

图表归属：包含 Table 3.2、Figure 3.3、Figure 3.4；Table 3.2 在 PDF 文本抽取顺序中紧随本小节开头后出现，Figure 3.3/3.4 为算法说明图。

The Algorithm

BLASTisalocalalignmentmethodthatiscapableofdetectingnotonlythebestregionoflocal

alignmentbetweenaquerysequenceanditstarget,butalsowhetherthereareotherplausi-

blealignmentsbetweenthequeryandthetarget.Tofindtheseregionsoflocalalignmentina

computationallyefficientfashion,themethodbeginsbyseedingthesearchwithasmallsub-

set of letters from the query sequence, known as thequery word. Using the example shown

in Figure 3.3, consider a search where the query word of default length 3 is RDQ. (In prac-

tice,allwordsoflength3areconsidered,so,usingthesequenceinFigure3.3,thefirstquery

wordwouldbeTLS,followedbyLSH,andsoonacrossthesequence.)BLASTnowneedsto

find not only the word RDQ in all of the sequences in the target database but also related

wordswhereconservativesubstitutionshavebeenintroduced,asthosematchesmayalsobe

biologicallyinformativeandrelevant.TodeterminewhichwordsarerelatedtoRDQ,scoring

matricesareusedtodevelopwhatiscalledtheneighborhood.ThecenterpanelofFigure3.3

showsthecollectionofwordsthatarerelatedtotheoriginalqueryword,indescendingscore

order; the scores here are calculated using a BLOSUM62 scoring matrix (Figure 3.1). Obvi-

ously,somecut-offmustbeappliedsothatfurtherconsiderationisonlygiventowordsthat

areindeedcloselyrelatedtotheoriginalqueryword.Theparameterthatcontrolsthiscut-off

is the neighborhood score threshold (T). The value ofT is determined automatically by the

BLASTprogrambutcanbeadjustedbytheuser.Increasing T wouldpushthesearchtoward

more exact matches and would speed up the search, but could lead to overlooking possibly

interestingbiologicalrelationships.Decreasing Tallowsforthedetectionofmoredistantrela-

tionshipsbetweensequences.Here,onlywordswith T ≥11movetothenextstep.

T able 3.2 BLAST algorithms.

Program Query Database

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Nucleotide,six-frametranslation Protein

TBLASTN Protein Nucleotide,six-frametranslation

TBLASTX Nucleotide,six-frametranslation Nucleotide,six-frametranslation

BLAST 53

Query Word (W = 3)

Establish neighborhood

Extension using neighborhood

words greater than neighborhood

score threshold (T = 11)

Figure 3.3 The initiation of a BLAST search. The search begins with query words of a given length (here,

three amino acids) being compared against a scoring matrix to determine additional three-letter words

“in the neighborhood” of the original query word. Any occurrences of these neighborhood words in

sequences within the target database are then investigated. See text for details.

Focusing now on the lower panel of Figure 3.3, the original query word (RDQ) has been

alignedwithanotherwordfromtheneighborhoodwhosescoreismorethanthescorethresh-

old ofT ≥11 (REQ). The BLAST algorithm now attempts to extend this alignment in both

directions,tallyingacumulativescoreresultingfrommatches,mismatches,andgaps,untilit

constructsalocalalignmentofmaximallength.Determiningwhatthemaximallengthactually

iscanbebestexplainedbyconsideringthegraphinFigure3.4.Here,thenumberofresidues

that have been alignedis plotted against the cumulative score resulting from the alignment.

Theleft-mostpointonthegraphrepresentsthealignmentoftheoriginalquerywordwithone

ofthewordsfromtheneighborhood,againhavingavalueof T =11orgreater.Astheexten-

sionproceeds,aslongasexactmatchesandconservativesubstitutionsoutweighmismatches

Length of extension

Length of HSP

X

S

T

Cumulative score

Figure 3.4 BLAST search extension. Length of extension represents the number of characters that

have been aligned in a pairwise sequence comparison. Cumulative score represents the sum of the

position-by-position scores, as determined by the scoring matrix used for the search. T represents the

neighborhood score threshold, S is the minimum score required to return a hit in the BLAST output, and

X is the signiﬁcance decay. See text for details.

54 Assessing Pairwise Sequence Similarity: BLAST and FASTA

andgaps,thecumulativescorewillincrease.Assoonasthecumulativescorebreaksthescore

thresholdS,thealignmentisreportedintheBLASToutput.Simplyclearing Sdoesnotauto-

maticallymeanthatthealignmentisbiologicallysignificant,averyimportantpointthatwill

beaddressedlaterinthisdiscussion.

As the extension continues, at some point, mismatches and gaps will begin to outweigh

the exact matches and conservative substitutions, accruing negative scores from the scoring

matrix.Assoonasthecurvebeginstoturndownward,BLASTmeasureswhetherthedrop-off

exceedsathresholdcalled X.Ifthecurvedecaysmorethanisallowedbythevalueof X,the

extension is terminated and the alignment is trimmed back to the length corresponding to

the preceding maximum in the curve. The resulting alignment is called ahigh-scoring seg-

mentpair,orHSP.GiventhattheBLASTalgorithmsystematicallymarchesacrossthequery

sequenceusingallpossiblequerywords,itispossiblethatmorethanoneHSPmaybefound

foranygivensequencepair.

AfteranHSPisidentified,itisimportanttodeterminewhethertheresultingalignmentis

actuallysignificant.Usingthecumulativescorefromthealignment,alongwithanumberof

other parameters,anew valuecalled E (for “expect”) is calculated(Box3.2).For eachhit,E

givesthenumberofexpectedHSPshavingascoreof SormorethatBLASTwouldfindpurely

bychance.Putanotherway,thevalueof EprovidesameasureofwhetherthereportedHSPis

afalsepositive(seeBox5.4).Lower Evaluesimplygreaterbiologicalsignificance.

Box 3.2 The Karlin–Altschul Equation

As one might imagine, assessing the putative biological signiﬁcance of any given BLAST hit

based simply on raw scores is difﬁcult, since the scores are dependent on the composition

of the query and target sequences, the length of the sequences, the scoring matrix used

to compute the raw scores, and numerous other factors. In one of the most important

papers on the theory of local sequence alignment statistics, Karlin and Altschul (1990)

presented a formula which directly addresses this problem. The formula, which has come

to be known as the Karlin–Altschul equation, uses search-speciﬁc parameters to calculate

an expectation value (E). This value represents the number of HSPs that would be expected

purely by chance. The equation and the parameters used to calculate E are as follows:

E = kmNe

−𝜆s

where k is a minor constant, m is the number of letters in the query, N is the total number

of letters in the target database, 𝜆is a constant used to normalize the raw score of the

high-scoring segment pair, with the value of 𝜆varying depending on the scoring matrix

used; and S is the score of the high-scoring segment pair.

Ch3 Performing a BLAST Search 原文抽取

>

> 实际 PDF 小节名：Performing a BLAST Search / Understanding the BLAST Output

> 范围：PDF page 74 下半 - PDF page 81 顶部；印刷页码 54-61

> 边界：从 “Performing a BLAST Search” 标题开始，到下一小节标题前结束。

Performing a BLAST Search

While many BLAST servers are available throughout the world, the most widely used portal

for these searches is the BLAST home page at the National Center for Biotechnology Infor-

mation (NCBI; Figure 3.5). The top part of the page provides access to the most frequently

performed types of BLAST searches, summarized in Table 3.2, while the lower part of the page

is devoted to specialized types of BLAST searches. To illustrate the relative ease with which

one can perform a BLAST search, a protein-based search using BLASTP is discussed. Click-

ing on the Protein BLAST box brings users to the BLASTP search page, a portion of which is

shown in Figure 3.6. Obviously, a query sequence that will be used as the basis for comparison

is required. Harking back to the Entrez discussion in Chapter 2, the sequence of the netrin

receptor from Homo sapiens (NP_005206.2) has been pasted into the query sequence box.

Immediately to the right, the user can use the query subrange boxes to specify whether only a

portion of this sequence is to be used; if the whole sequence is to be used, these fields should

be left blank.

BLAST

Figure 3.5 The National Center for Biotechnology Information (NCBI) BLAST landing page. Examples of the most commonly used queries

that can be performed using the BLAST interface are discussed in the text.

Moving to the Choose Search Set section of the page, the database to be searched can be

selected using the Database pull-down menu; clicking on the question mark next to the

Database pull-down provides a brief description of each of the available target databases.

Here, the search will be performed against the RefSeq database (see Box 1.2). Directly below,

the Organism box can be used to limit the search results to sequences from individual

organisms or taxa. While not part of this worked example, if the user wanted to limit the

returned results to those from just mouse and rat, using the same type of syntax used in

issuing Entrez searches (see Table 2.1), the user would type Mus musculus [ORGN] AND

Rattus norvegicus [ORGN] in this field; if the user wanted all results except those

from mouse and rat, they would also need to check the Exclude box. As this search will be

performed against RefSeq, one can exclude predicted proteins from the search results by

clicking the “Models (XM/XP)” checkbox. Finally, in the Program Selection section, BLASTP

is selected by default.

Assessing Pairwise Sequence Similarity: BLAST and FASTA

Figure 3.6 The upper portion of the BLASTP query page. The ﬁrst section in the window is used to specify the sequence of interest, whether

only a portion of that sequence should be used in performing the search (query subrange), which database should be searched, and which

protein-based BLAST algorithm should be used to execute the query. See text for details.

If the user wishes to use the default settings for all algorithm parameters, the search can

be submitted by simply clicking on the blue BLAST button. However, the user can exert finer

control over how the search is performed by changing the items found in the Algorithm param-

eters section. To access these settings, the user must first click on the plus sign next to the words

“Algorithm parameters” to expand this section of the web page, producing the view shown in

Figure 3.7. This part of the query page is where the theory underlying a BLAST search dis-

cussed earlier in this chapter comes into play. In the General Parameters section, the expect

threshold limits returned results to those having an E value lower than the specified value, with

smaller values providing a more stringent cut-off. The word size setting changes the size of the

query word used to initiate the BLAST search, with longer word sizes initiating the search with

longer ungapped alignments. A word size of 3 is recommended for protein searches, as shorter

words increase sensitivity; however, if searching for near-exact matches, a longer word size

can be used, also yielding faster search times.

BLAST

Figure 3.7 The lower portion of the BLASTP query page, showing algorithm parameters that the user can adjust to ﬁne-tune the search.

Values that have been changed for the search discussed in the text are highlighted in yellow and marked with a diamond. See text for

details.

In the Scoring Parameters section, the user can select an appropriate scoring matrix (with

the default being BLOSUM62). Changing the matrix automatically changes the gap penalties to

values appropriate for that scoring matrix. As described in the discussion of affine gap penalties

above, the user may change these values manually; increasing the gap costs would result in

pairwise alignments with fewer gaps, where decreasing the values would make the insertion

of gaps more permissive.

In the Filters and Masking section, one should filter to remove low-complexity regions.

Low-complexity regions are defined simply as regions of biased composition (Wootton

and Federhen 1993). These may include homopolymeric runs, short-period repeats, or the

subtle over-representation of several residues in a sequence. The biological role of these

low-complexity regions is not understood; it is thought that they may represent the results of

either DNA replication errors or unequal crossing-over events. It is important to determine

whether sequences of interest contain low-complexity regions; they tend to prove problematic

when performing sequence alignments and can lead to false-positive results, as they are

Assessing Pairwise Sequence Similarity: BLAST and FASTA

generally similar across unrelated proteins. Finally, before issuing the query, be sure to check

the box marked “Show results in a new window.” This leaves the original query window (or

tab) in place, making it easier to go back and refine or change search parameters, as needed.

Understanding the BLAST Output

The first part of the BLASTP results for the query described above is shown in Figure 3.8. The

top part of the figure shows the position of conserved protein domains found by comparing

the query sequence with data found within NCBI’s Conserved Domain Database (CDD). This

is followed by a graphical overview of the BLASTP results, providing a sense of how many

sequences were found to have similarity to the query and how they scored against the query.

Details of the various graphical display features are given in the legend to Figure 3.8. The actual

list of sequences found as a result of this particular BLASTP search – the “hit list” – is shown,

in part, in Figure 3.9. The information included for each hit includes the definition line from

Figure 3.8 Graphical display of BLASTP results. The query sequence is represented by the thick cyan bar labeled “Query,” with the tick

marks indicating residue positions within the query. The thinner bars below the query represent each of the matches (“hits”) detected by

the BLAST algorithm. The colors represent the relative scores for each hit, with the color key for the scores appearing at the top of the box.

The length of each line, as well as its position, represents the region of similarity with the query. Hits connected by a thin line indicate

more than one high-scoring segment pair (HSP) within the same sequence; similarly, a thin vertical bar crossing one of the hits indicates a

break in the overall alignment. Moving the mouse over any of the lines produces a pop-up that shows the identity of that hit. Clicking on

any of the lines takes the user directly to detailed information about that hit (see Figure 3.10).

BLAST

Figure 3.9 The BLASTP “hit list.” For each sequence found, the user is presented with the deﬁnition line from the hit’s source database entry,

the score value for the best high-scoring segment pair (HSP) alignment, the total of all scores across all HSP alignments, the percentage of

the query covered by the HSPs, and the E value and percent identity for the best HSP alignment. The hyperlinked accession number allows

for direct access to the source database record for that hit. In the E value column, vanishingly low E values are rounded down to zero. For

non-zero E values, exponential notation is used; using the ﬁrst non-zero value in the ﬁgure, 2e-159 should be read as 2 × 10−159.

the hit’s source database entry, the score value that is, in turn, used to calculate the E value for

the best HSP alignment, the percent identity for that best HSP alignment, and the hyperlinked

accession number, allowing for direct access to the source database record for that hit. The table

is sorted by E value from lowest to highest, by default; recall that lower values of E represent

better alignments. In the E value column, notice that many of the entries have E-values of 0.0.

This represents a vanishingly low E value that has been rounded down to zero and implies

statistical significance. Note that each entry in the hit list is preceded by a check box; checking

one or more of these boxes lights up the grayed-out options shown in Figure 3.9, allowing

the user to download the selected sequences, view the selected hits graphically, generate a

dendrogram, or construct a multiple sequence alignment on the fly.

Clicking on the name of any of the proteins in the hit list moves the user down the page

to the portion of the output showing the pairwise alignment(s) for that hit (Figure 3.10). The

Assessing Pairwise Sequence Similarity: BLAST and FASTA

Figure 3.10 Detailed information on a representative BLASTP hit. The header provides the identity of the hit, as well as the

score and E value. The percent identity indicates exact matches, whereas the percent “positives” considers both exact matches

and conservative substitutions. The gap ﬁgures show how many residues remain unaligned owing to the introduction of gaps.

Gaps are indicated by dashes and low-complexity regions are indicated by grayed-out lower case letters. Note that there is

no header preceding the second alignment; this indicates that this is a second high-scoring segment pair (HSP) within the

same database entry.

BLAST 2 Sequences

header provides the complete definition line for this particular hit, and each identified HSP

is then shown below the header. In most cases, the user will only see one alignment, but in

the case shown in Figure 3.10 there are two, with the hit having the better score and E value

shown first. The statistics given for each hit include the E value, the number of identities (exact

matches), the number of “positives” (exact matches and conservative substitutions), and the

number of residues that fell into a gapped region. Within the alignments, gaps are indicated

by dashes, while low-complexity regions are indicated by grayed-out lower case letters.

Ch3 Suggested BLAST Cut-Offs 原文抽取

>

> 实际 PDF 小节名：Suggested BLAST Cut-Offs

> 范围：PDF page 81 中部；印刷页码 61

> 边界：从 “Suggested BLAST Cut-Offs” 标题开始，到下一小节标题前结束。

Suggested BLAST Cut-Offs

As was previously alluded to, the listing of a hit in a BLAST report does not automatically mean

that the hit is biologically significant. Over time, and based on both the methodical testing

and the personal experience of many investigators, many guidelines have been put forward as

being appropriate for establishing a boundary that separates meaningful hits from the rest. For

nucleotide-based searches, one should look for E values of 10−6 or less and sequence identities

of 70% or more. For protein-based searches, one should look for hits with E values of 10−3

or less and sequence identities of 25% or more. Using less-stringent cut-offs risks entry into

what is called the “twilight zone,” the low-identity region where any conclusions regarding

the relationship between two sequences may be questionable at best (Doolittle 1981, 1989;

Vogt et al. 1995; Rost 1999).

The reader is cautioned not to use these cut-offs (or any other set of suggested cut-offs)

blindly, particularly in the region right around the dividing line. Users should always keep

in mind whether the correct scoring matrix was used. Likewise, they should manually inspect

the pairwise alignments and investigate the biology behind any putative homology by read-

ing the literature to convince themselves whether hits on either side of the suggested cut-offs

actually make good biological sense.

中文译文

BLAST

> 来源：Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 4th ed.

> 范围：PDF page 72；印刷页码 52。

> 用途：

到目前为止，用于检测目标序列之间相似性的最常用技术，是基本局部比对搜索工具（Basic Local Alignment Search Tool），即 BLAST（Altschul 等，1991）。BLAST 能够成为序列分析中的基石技术，主要原因在于它可以准确而快速地检测核苷酸序列和蛋白质序列之间的相似性，同时不牺牲敏感性。原始的标准 BLAST 程序家族见表 3.2；不过，自 BLAST 问世以来，已经发展出许多原始程序的变体，用于满足双序列比较领域中的特定需求。本章后面将讨论其中若干变体。

---

算法

> 范围：PDF page 72 - PDF page 74；印刷页码 52-54。

BLAST 是一种局部比对方法。它不仅能够检测查询序列与目标序列之间最佳的局部比对区域，还能够判断查询序列和目标序列之间是否存在其他可能的比对。为了以计算上高效的方式找到这些局部比对区域，该方法首先用查询序列中的一小段字母作为种子来启动搜索，这段字母称为查询词（query word）。以图 3.3 所示示例为例，设默认长度为 3 的查询词为 RDQ。（实际运行时，会考虑所有长度为 3 的词；因此，使用图 3.3 中的序列时，第一个查询词是 TLS，随后是 LSH，依此沿序列继续。）

BLAST 此时不仅需要在目标数据库的所有序列中寻找 RDQ 这个词，还需要寻找引入了保守替换的相关词，因为这些匹配也可能具有生物学信息和生物学相关性。为了判断哪些词与 RDQ 相关，算法会使用评分矩阵来构建所谓的邻域（neighborhood）。图 3.3 中间面板显示了与原始查询词相关的一组词，并按得分从高到低排列；这些得分使用 BLOSUM62 评分矩阵（图 3.1）计算。显然，必须设置某种截断值，使后续分析只考虑确实与原始查询词密切相关的词。控制这一截断值的参数称为邻域得分阈值（neighborhood score threshold，T）。T 的值由 BLAST 程序自动确定，但用户也可以调整。增大 T 会使搜索更偏向精确匹配，并加快搜索速度，但也可能遗漏有趣的生物学关系。降低 T 则允许检测序列之间更远的关系。在这里，只有 T ≥ 11 的词会进入下一步。

表 3.2 BLAST 算法。

程序	查询序列	数据库
BLASTN	核苷酸	核苷酸
BLASTP	蛋白质	蛋白质
BLASTX	核苷酸，六框翻译	蛋白质
TBLASTN	蛋白质	核苷酸，六框翻译
TBLASTX	核苷酸，六框翻译	核苷酸，六框翻译

图 3.3 BLAST 搜索的启动。搜索从给定长度的查询词开始（此处为三个氨基酸），将其与评分矩阵进行比较，以确定处在原始查询词“邻域”中的其他三字母词。随后，会检查目标数据库序列中是否出现这些邻域词。详见正文。

现在看图 3.3 的下方面板，原始查询词 RDQ 已经与邻域中的另一个词 REQ 对齐，而后者的得分超过了 T ≥ 11 的得分阈值。接下来，BLAST 算法会尝试向两个方向延伸这个比对，并累计由匹配、错配和缺口产生的得分，直到构建出最大长度的局部比对。要解释这个最大长度究竟如何确定，最清楚的方式是观察图 3.4 中的图。这里，已经比对的残基数量被绘制在横轴上，而由该比对产生的累计得分被绘制在纵轴上。图中最左侧的点表示原始查询词与邻域中的某个词之间的比对，该词同样具有 T = 11 或更高的值。随着延伸继续进行，只要精确匹配和保守替换所带来的得分超过错配和缺口带来的扣分，累计得分就会升高。一旦累计得分超过得分阈值 S，该比对就会被报告在 BLAST 输出中。需要特别注意，单纯超过 S 并不自动意味着该比对具有生物学显著性；这是一个非常重要的问题，后文还会讨论。

图 3.4 BLAST 搜索延伸。延伸长度表示在双序列比较中已经对齐的字符数。累计得分表示逐位置得分的总和，这些逐位置得分由搜索所使用的评分矩阵决定。T 表示邻域得分阈值，S 是在 BLAST 输出中返回一个命中所需的最低得分，X 是显著性衰减阈值。详见正文。

随着延伸继续进行，在某个时刻，错配和缺口将开始超过精确匹配和保守替换，并从评分矩阵中累积负分。一旦曲线开始向下，BLAST 就会判断得分下降是否超过一个称为 X 的阈值。如果曲线衰减超过了 X 值所允许的范围，延伸就会终止，并且比对会被修剪回曲线先前达到最大值时对应的长度。所得比对称为高得分片段对（high-scoring segment pair，HSP）。由于 BLAST 算法会使用所有可能的查询词系统地沿查询序列推进，因此对于任意给定的一对序列，可能会发现不止一个 HSP。

识别出 HSP 之后，重要的是判断所得比对是否真正显著。利用该比对的累计得分以及若干其他参数，可以计算出一个称为 E 的新值，其中 E 表示 “expect”（见 Box 3.2）。对于每一个命中，E 给出 BLAST 纯粹由随机机会发现得分为 S 或更高的 HSP 的预期数量。换句话说，E 值提供了一个衡量指标，用于判断所报告的 HSP 是否为假阳性（见 Box 5.4）。较低的 E 值意味着更高的生物学显著性。

Box 3.2 Karlin–Altschul 方程

可以想见，仅仅根据原始得分来评估任意给定 BLAST 命中的推定生物学显著性是困难的，因为得分依赖于查询序列和目标序列的组成、序列长度、用于计算原始得分的评分矩阵，以及许多其他因素。在关于局部序列比对统计理论的最重要论文之一中，Karlin 和 Altschul（1990）提出了一个直接处理这一问题的公式。该公式后来被称为 Karlin–Altschul 方程，它使用与搜索相关的参数来计算一个期望值（expectation value，E）。该值表示纯粹由随机机会预期出现的 HSP 数量。用于计算 E 的方程及其参数如下：

E = k m N e^(-λS)

其中，k 是一个较小的常数，m 是查询序列中的字母数，N 是目标数据库中的字母总数，λ 是用于标准化高得分片段对原始得分的常数，且 λ 的取值会随所用评分矩阵而变化；S 是高得分片段对的得分。

---

执行 BLAST 搜索

> 范围：PDF page 74 下半 - PDF page 81 顶部；印刷页码 54-61。

> 实际 PDF 小节名：Performing a BLAST Search / Understanding the BLAST Output。

尽管世界各地都有许多 BLAST 服务器可供使用，但进行这类搜索最常用的入口，是美国国立生物技术信息中心（National Center for Biotechnology Information，NCBI）的 BLAST 主页（图 3.5）。页面上半部分提供最常用 BLAST 搜索类型的入口，这些类型概括于表 3.2；页面下半部分则用于进入各种专门类型的 BLAST 搜索。为了说明执行 BLAST 搜索相对容易，本节以使用 BLASTP 进行蛋白质搜索为例。点击 Protein BLAST 方框后，用户会进入 BLASTP 搜索页面，其部分界面如图 3.6 所示。显然，必须提供一条作为比较基础的查询序列。回到第 2 章对 Entrez 的讨论，本例将来自 Homo sapiens 的 netrin 受体序列（NP_005206.2）粘贴到查询序列框中。紧邻其右侧，用户可以使用 query subrange 框指定是否只使用该序列的一部分；如果要使用整条序列，则应将这些字段留空。

图 3.5 美国国立生物技术信息中心（NCBI）的 BLAST 起始页面。正文讨论了可通过 BLAST 界面执行的若干最常用查询示例。

进入页面的 Choose Search Set 部分后，可以通过 Database 下拉菜单选择要搜索的数据库；点击 Database 下拉菜单旁边的问号，可以查看每个可用目标数据库的简短说明。这里，搜索将在 RefSeq 数据库中执行（见 Box 1.2）。其下方的 Organism 框可用于将搜索结果限定为来自某个具体生物体或分类单元的序列。虽然这不是本例演示的一部分，但如果用户想把返回结果限制为只来自小鼠和大鼠，可以使用与 Entrez 搜索相同的语法（见表 2.1），在该字段中输入 Mus musculus [ORGN] AND Rattus norvegicus [ORGN]；如果用户想要除小鼠和大鼠之外的所有结果，还需要勾选 Exclude 框。由于本搜索将在 RefSeq 中执行，可以点击 “Models (XM/XP)” 复选框，从搜索结果中排除预测蛋白。最后，在 Program Selection 部分，BLASTP 默认处于选中状态。

图 3.6 BLASTP 查询页面的上半部分。窗口中的第一部分用于指定感兴趣序列、是否只使用该序列的一部分执行搜索（query subrange）、要搜索哪个数据库，以及使用哪一种基于蛋白质的 BLAST 算法执行查询。详见正文。

如果用户希望所有算法参数都使用默认设置，只需点击蓝色 BLAST 按钮即可提交搜索。不过，用户也可以通过修改 Algorithm parameters 部分中的选项，对搜索执行方式进行更精细的控制。要访问这些设置，用户必须先点击 “Algorithm parameters” 字样旁边的加号，展开网页中的这一部分，得到图 3.7 所示的界面。查询页面的这一部分，正是本章前文讨论的 BLAST 搜索理论开始发挥作用的地方。在 General Parameters 部分，expect threshold 会把返回结果限制为 E 值低于指定值的条目；数值越小，截断标准越严格。word size 设置会改变用于启动 BLAST 搜索的查询词长度；较长的 word size 会以较长的无缺口比对启动搜索。对于蛋白质搜索，推荐 word size 为 3，因为较短的词会提高灵敏度；不过，如果要搜索近乎完全相同的匹配，也可以使用较长的 word size，这样还能获得更快的搜索速度。

图 3.7 BLASTP 查询页面的下半部分，显示用户可调整以微调搜索的算法参数。正文所讨论搜索中发生改变的参数值以黄色高亮显示，并用菱形标出。详见正文。

在 Scoring Parameters 部分，用户可以选择合适的打分矩阵（默认矩阵为 BLOSUM62）。更换矩阵会自动把 gap penalties 改为适合该打分矩阵的数值。正如前文关于 affine gap penalties 的讨论所述，用户也可以手动修改这些数值；提高缺口代价会使成对比对包含更少缺口，而降低这些数值则会使插入缺口更加宽容。

在 Filters and Masking 部分，应当进行过滤以去除低复杂度区域。低复杂度区域可以简单定义为组成偏倚的区域（Wootton and Federhen 1993）。这类区域可能包括同聚物连续片段、短周期重复，或序列中若干残基的轻微过度表示。低复杂度区域的生物学作用尚不清楚；一般认为，它们可能代表 DNA 复制错误或不等交换事件的结果。判断感兴趣序列是否包含低复杂度区域非常重要；这些区域在执行序列比对时往往会造成问题，并可能导致假阳性结果，因为它们通常会在彼此无关的蛋白质之间表现出相似性。最后，在提交查询之前，务必勾选 “Show results in a new window” 框。这样可以保留原始查询窗口（或标签页），便于根据需要返回并调整或改变搜索参数。

理解 BLAST 输出

上述查询所得 BLASTP 结果的第一部分如图 3.8 所示。图的上半部分显示了通过将查询序列与 NCBI Conserved Domain Database（CDD）中的数据进行比较而找到的保守蛋白结构域位置。随后是 BLASTP 结果的图形概览，使用户能够大致了解有多少序列与查询序列具有相似性，以及这些序列相对于查询序列的得分情况。图形显示中各项特征的细节见图 3.8 图注。由这次特定 BLASTP 搜索找到的实际序列列表，即 “hit list”，部分显示于图 3.9。每个命中项包含的信息包括来自该命中来源数据库条目的定义行、用于计算最佳 HSP 比对 E 值的 score 值、该最佳 HSP 比对的 percent identity，以及带有超链接的 accession number；通过该 accession number，用户可以直接访问该命中的来源数据库记录。表格默认按 E 值从低到高排序；请记住，E 值越低，表示比对越好。在 E value 列中可以看到，许多条目的 E value 为 0.0。这表示一个极低的 E 值被向下舍入为零，并意味着统计学显著性。还要注意，hit list 中每个条目前都有一个复选框；勾选其中一个或多个复选框后，图 3.9 中灰显的选项会被激活，允许用户下载所选序列、以图形方式查看所选命中、生成树状图，或即时构建多序列比对。

图 3.8 BLASTP 结果的图形显示。查询序列用标有 “Query” 的粗青色条表示，其刻度线标示查询序列中的残基位置。查询序列下方较细的条代表 BLAST 算法检测到的每个匹配（“hits”）。颜色表示每个命中的相对得分，得分颜色键显示在框的顶部。每条线的长度及其位置表示与查询序列相似的区域。由细线连接的命中表示同一序列中存在多个高评分片段对（HSP）；类似地，穿过某个命中的细垂直条表示整体比对中存在断裂。将鼠标移到任一线条上，会弹出显示该命中身份的信息框。点击任一线条，可直接跳转到该命中的详细信息（见图 3.10）。

图 3.9 BLASTP 的 “hit list”。对于找到的每条序列，用户会看到该命中来源数据库条目的定义行、最佳高评分片段对（HSP）比对的 score 值、所有 HSP 比对得分的总和、HSP 覆盖查询序列的百分比，以及最佳 HSP 比对的 E 值和 percent identity。带超链接的 accession number 允许直接访问该命中的来源数据库记录。在 E value 列中，极低的 E 值会被向下舍入为零。对于非零 E 值，使用指数记数法；以图中第一个非零值为例，2e-159 应读作 2 × 10^-159。

点击 hit list 中任一蛋白名称后，用户会移动到页面下方，来到显示该命中的成对比对结果的输出部分（图 3.10）。标题行提供该特定命中的完整定义行，随后在标题行下方显示每个已识别的 HSP。在多数情况下，用户只会看到一个比对；但在图 3.10 所示的案例中有两个比对，其中得分和 E 值更好的命中显示在前。每个命中给出的统计量包括 E 值、identities（完全匹配）的数量、“positives”（完全匹配加保守替换）的数量，以及落入缺口区域的残基数量。在比对内部，缺口用短横线表示，而低复杂度区域用灰色小写字母表示。

图 3.10 一个代表性 BLASTP 命中的详细信息。标题行给出该命中的身份，以及 score 和 E value。percent identity 表示完全匹配，而 percent “positives” 同时考虑完全匹配和保守替换。gap 数值显示由于引入缺口而未能比对的残基数量。缺口用短横线表示，低复杂度区域用灰色小写字母表示。请注意，第二个比对前没有标题行；这表示它是同一数据库条目中的第二个高评分片段对（HSP）。

---

建议的 BLAST 截断标准

> 范围：PDF page 81 中部；印刷页码 61。

> 实际 PDF 小节名：Suggested BLAST Cut-Offs。

如前文所指出的，某个命中出现在 BLAST 报告中，并不自动意味着该命中具有生物学显著性。随着时间推移，基于许多研究者的系统测试和个人经验，已有多种指南被提出，用于建立一条边界，以区分有意义的命中和其余结果。对于基于核苷酸的搜索，应寻找 E 值不高于 10^-6、且序列一致性不低于 70% 的结果。对于基于蛋白质的搜索，应寻找 E 值不高于 10^-3、且序列一致性不低于 25% 的命中。使用更宽松的截断标准，会使分析有进入所谓 “twilight zone”（暮光区）的风险；这是一个低一致性区域，在该区域中，关于两条序列之间关系的任何结论，充其量都可能是可疑的（Doolittle 1981, 1989; Vogt et al. 1995; Rost 1999）。

需要提醒读者，不要盲目使用这些截断标准，或任何其他建议截断标准，尤其是在接近分界线的区域。用户应始终考虑所使用的打分矩阵是否正确。同样，用户也应手动检查成对比对结果，并通过阅读文献考察任何推定同源关系背后的生物学依据，从而说服自己：无论某个命中位于建议截断标准的哪一侧，它是否真正具有合理的生物学意义。

术语表（5 条）

English	中文
BLAST	BLAST / 基本局部比对搜索工具
Basic Local Alignment Search Tool	基本局部比对搜索工具
sequence analysis	序列分析
sensitivity	敏感性
pairwise sequence comparison	双序列比较

021

BLAST 2 Sequences

PDF page 81 下半 - PDF page 83 顶部跨页图注；印刷页码 61-63

▶

English SourcePDF extracted

BLAST 2 Sequences

A variation of BLAST called BLAST 2 Sequences can be used to find local alignments between

any two protein or nucleotide sequences of interest (Tatusova and Madden 1999). Although

the BLAST engine is used to find the best local alignment between the two sequences,

no database search is performed. Rather, the two sequences to be compared are specified

in advance by the user. The method is particularly useful for comparing sequences that

have been determined to be homologous through experimental methods or for making

comparisons between sequences from different species. Returning to the Protein BLAST

(BLASTP) search page shown in Figure 3.6, checking the box marked “Align two or more

sequences” will change the structure of the page, now allowing for the user to enter both

the query and subject sequences that will be compared with one another (Figure 3.11). As

with any BLAST search, the user can adjust the standard array of BLAST-related options,

including the selection of scoring matrix and gap penalties. A sample of the results produced

by the BLAST 2 Sequences method is shown in Figure 3.12, comparing the transcription

factor SOX-1 from H. sapiens and the ctenophore Mnemiopsis leidyi, the earliest branching

animal species dating back at least 500 million years in evolutionary time (Ryan et al. 2013;

Schnitzler et al. 2014). The major difference between this output and the typical BLAST

output is the inclusion of a dot matrix view of the alignment, or “dotplot.” Dotplots are

intended to provide a graphical representation of the degree of similarity between the two

sequences being compared, allowing for the quick identification of regions of local alignment,

direct or inverted repeats, insertions, deletions, and low-complexity regions. The dotplot in

Figure 3.12 indicates two regions of alignment, and additional information on those two

regions of alignment is provided in the Alignments section at the bottom of the figure. As with

all BLAST searches, the Alignments section provides the user with the usual set of scores, the

E value, and percentages for identities, positives, and any gaps that may have been introduced.

Assessing Pairwise Sequence Similarity: BLAST and FASTA

Figure 3.11 Performing a BLAST 2 Sequences alignment. Clicking the check box at the bottom of the Enter Query Sequence section expands

the search page, generating a new Enter Subject Sequence section. Here, sequences for the transcription factor SOX-1 from human and

the ctenophore Mnemiopsis leidyi have been used as the query and subject, respectively (Schnitzler et al. 2014). Here, only the BLASTP

algorithm is available in the Program Selection section, as a one-to-one alignment has already been speciﬁed. The usual set of algorithm

parameters is available, allowing the user to ﬁne-tune the alignment as needed.

===== PDF page 83 (Figure 3.12 caption continuation) =====

Figure 3.12 Typical output from a BLAST 2 Sequences alignment, based on the query issued in Figure 3.11. The standard graphical view is

shown at the top of the ﬁgure, here indicating two high-scoring segment pairs (HSPs) for the alignment of the sequences for the transcription

factor SOX-1 from human and the ctenophore Mnemiopsis leidyi. The dot matrix view is an alternative view of the alignment, with the query

sequence represented on the horizontal axis and the subject sequence represented by the vertical axis; the diagonal indicates the regions

of alignment captured within the two HSPs. The detailed alignments are shown at the bottom of the ﬁgure, along with the E values and

alignment statistics for each HSP.

中文译文

BLAST 2 Sequences

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 81 下半 - PDF page 83 顶部跨页图注；印刷页码 61-63。

> 实际 PDF 小节名：BLAST 2 Sequences。

> 用途：

BLAST 的一个变体称为 BLAST 2 Sequences，可用于在任意两条感兴趣的蛋白质序列或核苷酸序列之间寻找局部比对（Tatusova and Madden 1999）。虽然该方法使用 BLAST 引擎来寻找两条序列之间的最佳局部比对，但它并不执行数据库搜索。相反，待比较的两条序列由用户预先指定。该方法特别适合用于比较已经通过实验方法确定为同源的序列，或用于比较来自不同物种的序列。

回到图 3.6 所示的 Protein BLAST（BLASTP）搜索页面，勾选标为 “Align two or more sequences” 的复选框后，页面结构会发生变化，用户现在可以同时输入将要相互比较的 query sequence 和 subject sequence（图 3.11）。与任何 BLAST 搜索一样，用户可以调整一组标准的 BLAST 相关选项，包括选择打分矩阵和 gap penalties。

图 3.12 显示了 BLAST 2 Sequences 方法产生的一组结果示例，其中比较的是来自 H. sapiens 的转录因子 SOX-1 和来自栉水母 Mnemiopsis leidyi 的 SOX-1；栉水母是至少可追溯到 5 亿年前演化时间的最早分支动物物种（Ryan et al. 2013; Schnitzler et al. 2014）。这种输出与典型 BLAST 输出的主要差异在于，它包含比对的 dot matrix view，即 “dotplot”。Dotplot 旨在以图形方式表示两条被比较序列之间的相似程度，使用户能够快速识别局部比对区域、正向或反向重复、插入、缺失和低复杂度区域。图 3.12 中的 dotplot 指示出两个比对区域；图底部 Alignments 部分则提供了关于这两个比对区域的更多信息。与所有 BLAST 搜索一样，Alignments 部分向用户提供常规的一组分数、E 值，以及 identities、positives 和任何引入缺口的百分比。

图 3.11 执行 BLAST 2 Sequences 比对。点击 Enter Query Sequence 部分底部的复选框，会展开搜索页面，并生成新的 Enter Subject Sequence 部分。这里分别使用来自人类和栉水母 Mnemiopsis leidyi 的转录因子 SOX-1 序列作为 query 和 subject（Schnitzler et al. 2014）。由于已经指定一对一比对，因此 Program Selection 部分中只有 BLASTP 算法可用。常规的一组算法参数仍然可用，使用户能够根据需要微调比对。

图 3.12 BLAST 2 Sequences 比对的典型输出，基于图 3.11 中提交的查询。图上方显示标准图形视图；这里显示在人类和栉水母 Mnemiopsis leidyi 转录因子 SOX-1 序列比对中存在两个高评分片段对（HSP）。Dot matrix view 是比对的另一种视图，其中 query sequence 表示在横轴上，subject sequence 表示在纵轴上；对角线表示两个 HSP 中捕获的比对区域。详细比对结果显示在图底部，并列出每个 HSP 的 E 值和比对统计量。

PDF 插图 (6 页)

022

MegaBLAST

PDF page 82 下半 - PDF page 84 上部；印刷页码 62-64

▶

English SourcePDF extracted

MegaBLAST

MegaBLAST is a variation of the BLASTN algorithm that has been optimized specifically for

use in aligning either long or highly similar (>95%) nucleotide sequences and is a method

of choice when looking for exact matches in nucleotide databases. The use of a greedy

gapped alignment routine (Zhang et al. 2000) allows MegaBLAST to handle longer nucleotide

sequences approximately 10 times faster than BLASTN would. MegaBLAST is particularly

well suited to finding whether a sequence is part of a larger contig, detecting potential

sequencing errors, and for comparing large, similar datasets against each other. The run

speeds that are achieved using MegaBLAST come from changing two aspects of the traditional

BLASTN routine. First, longer default word lengths are used; in BLASTN, the default word

length is 11, whereas MegaBLAST uses a default word length of 28. Second, MegaBLAST uses

a non-affine gap penalty scheme, meaning that there is no penalty for opening the gap; there

is only a penalty for extending the gap, with a constant charge for each position in the gap.

MegaBLAST is capable of accepting batch queries by simply pasting multiple sequences in

FASTA format or a list of accession numbers into the query window.

There is also a variation of MegaBLAST called discontiguous MegaBLAST. This version

has been designed for comparing divergent sequences from different organisms, sequences

where one would expect there to be low sequence identity. This method uses a discontigu-

ous word approach that is quite different from those used by the rest of the programs in the

Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST suite. Here, rather than looking for query words of a certain length to seed the search,

non-consecutive positions are examined over longer sequence segments (Ma et al. 2002). The

approach has been shown to find statistically significant alignments even when the degree of

similarity between sequences is very low.

中文译文

MegaBLAST

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 82 下半 - PDF page 84 上部；印刷页码 62-64。

> 实际 PDF 小节名：MegaBLAST。

> 用途：

MegaBLAST 是 BLASTN 算法的一个变体，经过专门优化，用于比对较长的核苷酸序列，或高度相似（>95%）的核苷酸序列；在核苷酸数据库中寻找精确匹配时，它是一种首选方法。由于使用 greedy gapped alignment routine（贪婪式带缺口比对流程）（Zhang et al. 2000），MegaBLAST 处理较长核苷酸序列的速度大约可达到 BLASTN 的 10 倍。MegaBLAST 特别适合判断某条序列是否属于较大 contig 的一部分、检测潜在测序错误，以及在大型相似数据集之间进行比较。

MegaBLAST 之所以能够达到这样的运行速度，来自对传统 BLASTN 流程中两个方面的改变。第一，它使用更长的默认 word length；在 BLASTN 中，默认 word length 为 11，而 MegaBLAST 使用的默认 word length 为 28。第二，MegaBLAST 使用 non-affine gap penalty scheme，这意味着打开 gap 不会受到罚分；只有延长 gap 时才会受到罚分，并且 gap 中每个位置的罚分是恒定的。MegaBLAST 能够接受批量查询：用户只需将 FASTA 格式的多条序列，或一组 accession numbers，粘贴到 query window 中即可。

MegaBLAST 还有一个变体，称为 discontiguous MegaBLAST。该版本被设计用于比较来自不同生物体的差异较大的序列，也就是那些预期 sequence identity 较低的序列。该方法使用 discontiguous word approach，这与 BLAST suite 中其他程序采用的方法有很大不同。在这里，程序并不是寻找某一长度的连续 query words 来作为搜索种子，而是在较长的序列片段上检查非连续位置（Ma et al. 2002）。已有研究表明，即使序列之间的相似程度很低，这种方法也能够找到具有统计显著性的比对。

PDF 插图 (6 页)

023

PSI-BLAST

PDF page 84 中部 - PDF page 89 跨页图注；印刷页码 64-69

▶

English SourcePDF extracted

PSI-BLAST

The variation of the BLAST algorithm known as PSI-BLAST (for position-specific iterated

BLAST) is particularly well suited for identifying distantly related proteins – proteins that

may not have been found using the traditional BLASTP method (Altschul et al. 1997; Altschul

and Koonin 1998). PSI-BLAST relies on the use of position-specific scoring matrices (PSSMs),

which are also often called hidden Markov models or profiles (Schneider et al. 1986; Gribskov

et al. 1987; Staden 1988; Tatusov et al. 1994; Bücher et al. 1996). PSSMs are, quite simply, a

numerical representation of a multiple sequence alignment, much like the multiple sequence

alignments that will be discussed in Chapter 8. Embedded within a multiple sequence align-

ment is intrinsic sequence information that represents the common characteristics of that

particular collection of sequences, frequently a protein family. By using a PSSM, one is able

to use these embedded, common characteristics to find similarities between sequences with

little or no absolute sequence identity, allowing for the identification and analysis of distantly

related proteins. PSSMs are constructed by taking a multiple sequence alignment representing

a protein family and then asking a series of questions, as follows.

• What residues are seen at each position of the alignment?

• How often does a particular residue appear at each position of the alignment?

• Are there positions that show absolute conservation?

• Can gaps be introduced anywhere in the alignment?

As soon as those questions are answered, the PSSM is constructed, and the numbers in the

table now represent the multiple sequence alignment (Figure 3.13). The numbers within the

PSSM reflect the probability of any given amino acid occurring at each position. The PSSM

numbers also reflect the effect of a conservative or non-conservative substitution at each posi-

tion in the alignment, much like the PAM or BLOSUM matrices do. This PSSM now can be

used for comparison against single sequences, or in an iterative approach where newly found

sequences can be incorporated into the original PSSM to find additional sequences that may

be of interest.

The Method

Starting with a query sequence of interest, the PSI-BLAST process operates by taking a query

protein sequence and performing a standard BLASTP search, as described above. This search

produces a number of hits having E values better than a certain set threshold. These hits, along

with the initial, single-query sequence, are used to construct a PSSM in an automated fashion.

As soon as the PSSM is constructed, the PSSM then serves as the query for doing a new search

against the target database, using the collective characteristics of the identified sequences to

find new, related sequences. The process continues, round by round, either until the search

converges (meaning that no new sequences were found in the last round) or until the limit on

the number of iterations is reached.

Performing a PSI-BLAST Search

PSI-BLAST searches can be initiated by following the Protein BLAST link on the BLAST land-

ing page (Figure 3.5). The search page shown in Figure 3.14 is identical to the one shown

in the BLASTP example discussed earlier in this chapter. Here, the sequence of the human

Figure 3.13 Constructing a position-speciﬁc scoring matrix (PSSM). In the upper portion of the ﬁgure

is a multiple sequence alignment of length 10. Using the criteria described in the text, the PSSM cor-

responding to this multiple sequence alignment is shown in the lower portion of the ﬁgure. Each row

of the PSSM corresponds to a column in the multiple sequence alignment. Note that position 8 of the

alignment always contains a threonine residue (T), whereas position 10 always contains a glycine (G).

Looking at the corresponding scores in the matrix, in row 8, the threonine scores 150 points; in row 10,

the glycine also scores 150 points. These are the highest values in the row, corresponding to the fact

that the multiple sequence alignment shows absolute conservation at those positions. Now, consider

position 9, where most of the sequences have a proline (P) at that position. In row 9 of the PSSM, the

proline scores 89 points – still the highest value in the row, but not as high a score as would have been

conferred if the proline residue was absolutely conserved across all sequences. The ﬁrst column of the

PSSM provides the deduced consensus sequence.

sex-determining protein SRY from UniProtKB/Swiss-Prot (Q05066) will be used as the query,

using UniProtKB/Swiss-Prot as the target database and limiting returned results to human

sequences. PSI-BLAST is selected in the Program Selection section and, as before, selected

changes will be made to the default parameters (Figure 3.15). The maximum number of target

sequences has been raised from 500 to 1000, as a safeguard in case a large number of sequences

in UniProtKB/Swiss-Prot match the query. In addition, both the E value threshold and the

PSI-BLAST threshold have been changed to 0.001, and filtering of low-complexity regions has

been enabled. The query can now be issued as before by clicking on the blue “BLAST” button

at the bottom of the page.

The results of the first round of the search are shown in Figure 3.16, with 31 sequences

found in the first round (at the time of this writing). The structure of the hit list table is exactly

as before, now containing two additional columns that are specific to PSI-BLAST. The first

shows a column of check boxes that are all selected; this instructs the algorithm to use all the

sequences to construct the first PSSM for this particular search. Keeping in mind that the first

round of any PSI-BLAST search is simply a BLASTP search and that no PSSM has yet been con-

structed, the second column is blank. To run the next iteration of PSI-BLAST, simply click the

“Go” button at the bottom of this section. At this point, the first PSSM is constructed based on

a multiple sequence alignment of the sequences selected for inclusion, and the matrix is now

used as the query against Swiss-Prot. The results of this second round are shown in Figure 3.17,

with the final two columns indicating which sequences are to be used in constructing the

new PSSM for the next round of searches, as well as which sequences were used to build the

PSSM for the current round. Also note that a good number of the sequences are highlighted in

yellow; here, 26 additional sequences that scored below the PSI-BLAST threshold in the first

Figure 3.14 Performing a PSI-BLAST search. See text for details.

round have now been pulled into the search results. This provides an excellent example of how

PSSMs can be used to discover new relationships during each PSI-BLAST iteration, thereby

making it possible to identify additional homologs that may not have been found using the

standard BLASTP approach. Of course, the user should always check the E values and percent

identities for all returned results before passing them through to the next round, unchecking

inclusion boxes as needed. There may also be cases where prior knowledge would argue for

removing some of the found sequences based on the descriptors. As with all computational

methods, it is always important to keep biology in mind when reviewing the results.

===== PDF page 87 (Figure 3.15 caption only) =====

Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text for details.

===== PDF page 88 (PSI-BLAST figure caption) =====

Figure 3.16 Results of the ﬁrst round of a PSI-BLAST search. For each sequence found, the user is presented with the deﬁnition

line from the corresponding UniProtKB/Swiss-Prot entry, the score value for the best high-scoring segment pair (HSP) alignment,

the total of all scores across all HSP alignments, the percentage of the query covered by the HSPs, and the E value and percent

identity for the best HSP alignment. The hyperlinked accession number allows for direct access to the source database record

for that hit. Sequences whose “Select for PSI blast” box are checked will be used to calculate a position-speciﬁc scoring matrix

(PSSM), and that PSSM then serves as the new “query” for the next round, the results of which are shown in Figure 3.17.

===== PDF page 89 (PSI-BLAST figure caption) =====

Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences identiﬁed through the use of the position-speciﬁc

scoring matrix (PSSM) calculated based on the results shown in Figure 3.16 are highlighted in yellow. Check marks in the right-most

column indicate which sequences were used to build the PSSM producing these results.

中文译文

PSI-BLAST

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 84 中部 - PDF page 89 跨页图注；印刷页码 64-69。

> 实际 PDF 小节名：PSI-BLAST。

> 用途：

PSI-BLAST 是 BLAST 算法的一个变体，全称为 position-specific iterated BLAST。它特别适合识别远缘相关蛋白，也就是那些可能无法用传统 BLASTP 方法找到的蛋白（Altschul et al. 1997; Altschul and Koonin 1998）。PSI-BLAST 依赖 position-specific scoring matrices（PSSMs，位置特异性打分矩阵），这类矩阵也常被称为 hidden Markov models 或 profiles（Schneider et al. 1986; Gribskov et al. 1987; Staden 1988; Tatusov et al. 1994; Bücher et al. 1996）。简单地说，PSSM 是 multiple sequence alignment 的数值化表示；multiple sequence alignment 将在第 8 章讨论。Multiple sequence alignment 内嵌有序列信息，这些信息代表该组序列的共同特征，而这组序列通常对应一个蛋白家族。通过使用 PSSM，可以利用这些内嵌的共同特征，在几乎没有或完全没有绝对 sequence identity 的序列之间寻找相似性，从而识别和分析远缘相关蛋白。

PSSM 的构建方式是：取一个代表某个蛋白家族的 multiple sequence alignment，然后提出以下一系列问题。

在比对的每一个位置上可以看到哪些残基？
某一种特定残基在比对的每一个位置上出现的频率是多少？
是否存在显示绝对保守性的位置？
是否可以在比对中的任何位置引入 gaps？

一旦这些问题得到回答，PSSM 就被构建出来；表中的数字此时就代表该 multiple sequence alignment（图 3.13）。PSSM 中的数字反映任意给定氨基酸出现在每一个位置上的概率。PSSM 的数字还反映在比对中每一个位置发生保守替换或非保守替换的影响，这与 PAM 或 BLOSUM 矩阵的作用很相似。随后，这个 PSSM 可以用于与单条序列比较；也可以用于迭代方法，在该方法中新发现的序列可被并入原始 PSSM，以寻找更多可能感兴趣的序列。

The Method

以一条感兴趣的 query sequence 为起点，PSI-BLAST 的流程首先取一条 query protein sequence，并按前文所述执行一次标准 BLASTP 搜索。该搜索会产生一批 E values 优于某个设定阈值的 hits。这些 hits 连同最初的单条 query sequence 一起，被自动用于构建一个 PSSM。PSSM 构建完成后，它随即作为 query，对目标数据库执行新的搜索；这一次搜索利用已识别序列的集合特征来寻找新的相关序列。该过程逐轮继续，直到搜索收敛，或达到迭代次数上限为止。这里的收敛是指上一轮中没有发现新的序列。

Performing a PSI-BLAST Search

PSI-BLAST 搜索可以从 BLAST landing page（图 3.5）上的 Protein BLAST 链接启动。图 3.14 所示的搜索页面与本章前面 BLASTP 示例中展示的页面相同。这里将使用 UniProtKB/Swiss-Prot 中的人类 sex-determining protein SRY（Q05066）作为 query，使用 UniProtKB/Swiss-Prot 作为目标数据库，并将返回结果限制为人类序列。在 Program Selection 部分选择 PSI-BLAST，并且像前面一样，对默认参数作若干选择性修改（图 3.15）。Maximum number of target sequences 已从 500 提高到 1000，这是为了防止 UniProtKB/Swiss-Prot 中有大量序列与 query 匹配。同时，E value threshold 和 PSI-BLAST threshold 都被改为 0.001，并启用了对低复杂度区域的过滤。此时，用户可以像前面一样，点击页面底部的蓝色 “BLAST” 按钮提交 query。

第一轮搜索结果如图 3.16 所示；在写作本书时，第一轮找到了 31 条序列。Hit list table 的结构与前面完全相同，但现在包含两个 PSI-BLAST 特有的附加列。第一列显示一列复选框，并且这些复选框全部被选中；这会指示算法使用所有这些序列，为本次特定搜索构建第一个 PSSM。需要记住，任何 PSI-BLAST 搜索的第一轮本质上都只是一次 BLASTP 搜索，而且此时还没有构建 PSSM；因此，第二个附加列为空。若要运行 PSI-BLAST 的下一轮迭代，只需点击该部分底部的 “Go” 按钮。此时，第一个 PSSM 会根据被选中纳入的序列所形成的 multiple sequence alignment 构建出来，并且该矩阵现在被用作 query 来搜索 Swiss-Prot。第二轮结果如图 3.17 所示；最后两列显示哪些序列将用于构建下一轮搜索的新 PSSM，以及哪些序列曾用于构建当前这一轮的 PSSM。还应注意，许多序列以黄色高亮显示；这里有 26 条在第一轮中低于 PSI-BLAST threshold 的新增序列已经被纳入搜索结果。这很好地展示了 PSSM 如何在 PSI-BLAST 的每次迭代中发现新的关系，从而使研究者能够识别出使用标准 BLASTP 方法可能无法找到的其他同源物。当然，在把所有返回结果传递到下一轮之前，用户应始终检查这些结果的 E values 和 percent identities，并根据需要取消勾选 inclusion boxes。也可能存在这样的情况：根据已有生物学知识，某些已找到序列应当因其 descriptors 而被移除。与所有计算方法一样，在审查结果时始终把生物学放在心里是非常重要的。

图 3.13 构建 position-specific scoring matrix（PSSM）。图的上半部分是一段长度为 10 的 multiple sequence alignment。按照正文中描述的标准，与该 multiple sequence alignment 对应的 PSSM 显示在图的下半部分。PSSM 的每一行对应 multiple sequence alignment 中的一列。注意，比对的第 8 位始终含有一个 threonine residue（T），而第 10 位始终含有一个 glycine（G）。查看矩阵中的对应分数可见，在第 8 行中 threonine 得 150 分；在第 10 行中 glycine 也得 150 分。这些是所在行中的最高值，对应于 multiple sequence alignment 在这些位置显示绝对保守性这一事实。现在再看第 9 位，该位置上多数序列为 proline（P）。在 PSSM 第 9 行中，proline 得 89 分，仍然是该行中的最高值，但低于如果所有序列中 proline residue 都绝对保守时所会得到的分数。PSSM 的第一列给出推断出的 consensus sequence。

图 3.14 执行 PSI-BLAST 搜索。详见正文说明。

图 3.15 选择 PSI-BLAST 搜索的算法参数。详见正文说明。

图 3.16 PSI-BLAST 搜索第一轮的结果。对于每一条找到的序列，用户会看到来自相应 UniProtKB/Swiss-Prot 条目的 definition line、最佳 high-scoring segment pair（HSP）比对的 score value、所有 HSP 比对分数的总和、HSP 覆盖 query 的百分比，以及最佳 HSP 比对的 E value 和 percent identity。带超链接的 accession number 允许用户直接访问该 hit 在源数据库中的记录。那些 “Select for PSI blast” 框被勾选的序列，将用于计算 position-specific scoring matrix（PSSM）；随后该 PSSM 会作为下一轮的新 “query”，其结果见图 3.17。

图 3.17 PSI-BLAST 搜索第二轮的结果。通过使用基于图 3.16 所示结果计算出的 position-specific scoring matrix（PSSM）而识别出的新序列，以黄色高亮显示。最右侧列中的勾号表示哪些序列被用于构建产生这些结果的 PSSM。

PDF 插图 (12 页)

024

BLAT

PDF page 86 下部 - PDF page 91 顶部跨页图注；印刷页码 66-71

▶

English SourcePDF extracted

BLAT

In response to the assembly needs of the Human Genome Project, a new nucleotide sequence

alignment program called BLAT (for BLAST-Like Alignment Tool) was introduced (Kent

2002). BLAT is most similar to the MegaBLAST version of BLAST in that it is designed

to rapidly align longer nucleotide sequences having more than 95% similarity. However,

the BLAT algorithm uses a slightly different strategy than BLAST to achieve faster speeds.

Before any searches are performed, the target databases are pre-indexed, keeping track of

all non-overlapping 11-mers; this index is then used to find regions similar to the query

sequence. BLAT is often used to find the position of a sequence of interest within a genome

or to perform cross-species analyses.

As an example, consider a case where an investigator wishes to map a cDNA clone coming

from the Cancer Genome Anatomy Project (CGAP) to the rat genome. The BLAT query page

is shown in Figure 3.18, and the sequence of the clone of interest has been pasted into the

sequence box. Above the sequence box are several pull-down menus that can be used to specify

which genome should be searched (organism), which assembly should be used (usually, the

most recent), and the query type (DNA, protein, translated DNA, or translated RNA). Once

the appropriate choices have been made, the search is commenced by pressing the “Submit”

button. The results of the query are shown in the upper panel of Figure 3.19; here, the hit with

the highest score is shown at the top of the list, a match having 98.1% identity with the query

sequence. More details on this hit can be found by clicking the “details” hyperlink, to the left

of the entry. A long web page is then returned, providing information on the original query,

the genomic sequence, and an alignment of the query against the found genomic sequence

Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatomy Project Tumor Gene Index (CB312815) is the query.

The pull-down menus at the top of the page can be used to specify which genome should be searched (organism), which assembly should

be used (usually, the most recent), and the query type (DNA, protein, translated DNA, or translated RNA). The “I’m feeling lucky” button

returns only the highest scoring alignment and provides a direct path to the UCSC Genome Browser.

(Figure 3.19, bottom panel). The genomic sequence here is labeled chr5, meaning that the

query corresponds to a region of rat chromosome 5. Matching bases in the cDNA and genomic

sequences are colored in dark blue and are capitalized. Lighter blue uppercase bases mark the

boundaries of aligned regions and often signify splice sites. Gaps and unaligned regions are

indicated by lower case black type. In the Side by Side Alignment, exact matches are indicated

by the vertical line between the two sequences. Clicking on the “browser” hyperlink in the

upper panel of Figure 3.19 would take the user to the UCSC Genome Browser, where detailed

information about the genomic assembly in this region of rat chromosome 5 (specifically, at

5q31) can be obtained (cf. Chapter 4).

===== PDF page 91 (Figure 3.19 caption only) =====

Figure 3.19 Results of a BLAT query. Based on the query submitted in Figure 3.18, the highest scoring hit is to a sequence on chro-

mosome 5 rat genome having 98.1% sequence identity. Clicking on the “details” hyperlink brings the user to additional information

on the found sequence, shown in the lower panel. Matching bases in the cDNA and genomic sequences are colored in dark blue and

are capitalized. Lighter blue uppercase bases mark the boundaries of aligned regions and often signify splice sites. Gaps are indicated

by lowercase black type. In the side-by-side alignment, exact matches are indicated by the vertical line between the sequences.

中文译文

BLAT

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 86 下部 - PDF page 91 顶部跨页图注；印刷页码 66-71。

> 实际 PDF 小节名：BLAT。

> 用途：

BLAT 是为满足 Human Genome Project 的组装需求而引入的一种新的核苷酸序列比对程序（BLAST-Like Alignment Tool）（Kent 2002）。BLAT 与 BLAST 的 MegaBLAST 版本最为相似，因为它的设计目标是快速比对长度更长、相似度超过 95% 的核苷酸序列。不过，BLAT 算法采用了一种与 BLAST 略有不同的策略来实现更快的速度。在执行任何搜索之前，目标数据库都会先完成预索引，记录所有互不重叠的 11-mers；随后利用这个索引来寻找与 query sequence 相似的区域。BLAT 常用于确定某条感兴趣序列在基因组中的位置，或进行跨物种分析。

例如，假设某研究者希望将来自 Cancer Genome Anatomy Project（CGAP）的一个 cDNA clone 映射到大鼠基因组。图 3.18 显示了 BLAT query page，感兴趣的 clone 序列已粘贴到 sequence box 中。在 sequence box 上方，有若干 pull-down menus，可用于指定要搜索的是哪个基因组（organism）、使用哪个 assembly（通常是最新版本），以及 query type（DNA、protein、translated DNA 或 translated RNA）。完成相应选择后，点击 “Submit” 按钮即可开始搜索。查询结果显示在图 3.19 的上方面板中；这里，得分最高的 hit 排在列表顶部，是一个与 query sequence 具有 98.1% identity 的匹配。若要获取该 hit 的更多信息，可以点击该条目左侧的 “details” hyperlink。随后会返回一个较长的网页，其中提供原始 query、基因组序列，以及 query 与所找到的基因组序列之间的 alignment。

图 3.18 提交 BLAT 查询。这里的 query 是来自 Cancer Genome Anatomy Project Tumor Gene Index 的一个大鼠克隆（CB312815）。页面顶部的 pull-down menus 可用于指定要搜索的基因组（organism）、使用的 assembly（通常是最新版本），以及 query type（DNA、protein、translated DNA 或 translated RNA）。“I’m feeling lucky” 按钮只返回得分最高的 alignment，并直接进入 UCSC Genome Browser。

图 3.19 BLAT 查询的结果。根据图 3.18 中提交的 query，最高分的 hit 是大鼠基因组染色体 5 上的一段序列，其 sequence identity 为 98.1%。点击 “details” hyperlink 后，用户会看到关于该序列的更多信息，如下方面板所示。cDNA 与基因组序列中的匹配碱基以深蓝色显示，并以大写字母表示；较浅的蓝色大写字母标记了比对区域的边界，并且常常表示 splice sites。缺口用小写黑字表示。在 side-by-side alignment 中，精确匹配由两条序列之间的竖线表示。

PDF 插图 (8 页)

025

FASTA

PDF page 90 下部 - PDF page 95；印刷页码 70-75

▶

English SourcePDF extracted

FASTA

While the most commonly used technique for detecting similarity between sequences is

BLAST, it is not the only heuristic method that can be used to rapidly and accurately compare

sequences with one another. In fact, the first widely used program designed for database sim-

ilarity searching was FASTA (Lipman and Pearson 1985; Pearson and Lipman 1988; Pearson

2000). Like BLAST, FASTA enables the user to rapidly compare a query sequence against large

databases, and various versions of the program are available (Table 3.3). In addition to the main

implementations, a variety of specialized FASTA versions are available, described in detail

Table 3.3 Main FASTA algorithms.

Program

Query

Database

Corresponding

BLAST Program

FASTA

Nucleotide

BLASTN

Protein

BLASTP

FASTX/FASTY

DNA

Protein

BLASTX

TFASTYX/TFASTY

Protein

Translated DNA

TBLASTN

in Pearson (2016). An interesting historical note is that the FASTA format for representing

nucleotide and protein sequences originated with the development of the FASTA algorithm.

The Method

The FASTA algorithm can be divided into four major steps. In the first step, FASTA deter-

mines all overlapping words of a certain length both in the query sequence and in each of

the sequences in the target database, creating two lists in the process. Here, the word length

parameter is called ktup, which is the equivalent of W in BLAST. These lists of overlapping

words are compared with one another in order to identify any words that are common to the

two lists. The method then looks for word matches that are in close proximity to one another

and connects them to each other (intervening sequence included), without introducing any

gaps. This can be represented using a dotplot format (Figure 3.20a). Once this initial round of

connections are made, an initial score (init1) is calculated for each of the regions of similarity.

In step 2, only the 10 best regions for a given pairwise alignment are considered for further

analysis (Figure 3.20b). FASTA now tries to join together regions of similarity that are close to

each other in the dotplot but that do not lie on the same diagonal, with the goal of extending

the overall length of the alignment (Figure 3.20c). This means that insertions and deletions are

now allowed, but there is a joining penalty for each of the diagonals that are connected. The

net score for any two diagonals that have been connected is the sum of the score of the original

diagonals, less the joining penalty. This new score is referred to as initn.

In step 3, FASTA ranks all of the resulting diagonals, and then further considers only the

“best” diagonals in the list. For each of the best diagonals, FASTA uses a modification of the

Smith–Waterman algorithm (1981) to come up with the optimal pairwise alignment between

the two sequences being considered. A final, optimal score (opt) is calculated on this pairwise

alignment.

(a)

(b)

(c)

Figure 3.20 The FASTA search strategy. (a) Once FASTA determines words of length ktup common to the

query sequence and the target sequence, it connects words that are close to each other, and these are

represented by the diagonals. (b) After an initial round of scoring, the top 10 diagonals are selected for

further analysis. (c) The Smith–Waterman algorithm is applied to yield the optimal pairwise alignment

between the two sequences being considered. See text for details.

In the fourth and final step, FASTA assesses the significance of the alignments by estimat-

ing what the anticipated distribution of scores would be for randomly generated sequences

having the same overall composition (i.e. sequence length and distribution of amino acids

or nucleotides). Based on this randomization procedure and on the results from the original

query, FASTA calculates an expectation value E (similar to the BLAST E value), which, as

before, represents the probability that a reported hit has occurred purely by chance.

Running a FASTA Search

The University of Virginia provides a web front-end for issuing FASTA queries. Various pro-

tein and nucleotide databases are available, and up to two databases can be selected for use in

a single run. From this page, the user can also specify the scoring matrix to be used, gap and

extension penalties, and the value for ktup. The default values for ktup are 2 for protein-based

searches and 6 for nucleotide-based searches; lowering the value of ktup increases the sensitiv-

ity of the run, at the expense of speed. The user can also limit the results returned to particular

E values.

The results returned by a FASTA query are in a significantly different format than those

returned by BLAST. Consider a FASTA search using the sequence of histone H2B.3 from the

highly regenerative cnidarian Hydractinia, one of four novel H2B variants used in place of

protamines to compact sperm DNA (KX622131.1; Török et al. 2016), as the query. The first

part of the FASTA output resulting from a search using BLOSUM62 as the scoring matrix and

Swiss-Prot as the target database is shown in Figure 3.21, summarizing the results as a his-

togram. The histogram is intended to convey the distribution of all similarity scores computed

in the course of this particular search. The first column represents bins of similarity scores,

with the scores increasing as one moves down the page. The second column gives the actual

number of sequences observed to fall into each one of these bins. This count is also represented

by the length of each of the lines in the histogram, with each of the equals signs representing

a certain number of sequences; in the figure, each equals sign corresponds to 130 sequences

from UniProtKB/Swiss-Prot. The third column of numbers represents how many sequences

would be expected to fall into each one of the bins; this is indicated by the asterisks in the

histogram. The hit list would immediately follow, and a portion of the hit list for this search

is shown in Figure 3.22. Here, the accession number and partial definition line for each hit is

given, along with its optimal similarity score (opt), a normalized score (bit), the expectation

value (E), percent identity and similarity figures, and the aligned length. Not shown here are

the individual alignments of each hit to the original query sequence, which would be found by

further scrolling down in the output. In the pairwise alignments, exact matches are indicated

by a colon, while conservative substitutions are indicated by a dot.

Statistical Signiﬁcance of Results

As before, the E values from a FASTA search represent the probability that a hit has occurred

purely by chance. Pearson (2016) puts forth the following guidelines for inferring homology

from protein-based searches, which are slightly different than those previously described for

BLAST: an E value < 10−6 almost certainly implies homology. When E < 10−3, the query and

found sequences are almost always homologous, but the user should guarantee that the highest

scoring unrelated sequence has an E value near 1.

Comparing FASTA and BLAST

Since both FASTA and BLAST employ rigorous algorithms to find sequences that are statis-

tically (and hopefully biologically) relevant, it is logical to ask which one of the methods is

the better choice. There actually is no good answer to the question, since both of the methods

===== PDF page 94 (Figure 3.21 caption) =====

Figure 3.21 Search summary from a protein–protein FASTA search, using the sequence of histone H2B.3 from Hydractinia echinata

(KX622131.1; Török et al. 2016) as the query and BLOSUM62 as the scoring matrix. The header indicates that the query is against the

Swiss-Prot database. The histogram indicates the distribution of all similarity scores computed for this search. The left-most column

provides a normalized similarity score, and the column marked opt gives the number of sequences with that score. The column

marked E() gives the number of sequences expected to achieve the score in the ﬁrst column. In this case, each equals sign in the

histogram represents 130 sequences in Swiss-Prot. The asterisks in each row indicate the expected, random distribution of hits. The

inset is a magniﬁed version of the histogram in that region.

Figure 3.22 Hit list for the protein–protein FASTA search described in Figure 3.21. Only the ﬁrst 18 hits are shown. For each hit, the

accession number and partial deﬁnition line for the hit is provided. The column marked opt gives the raw similarity score, the column

marked bits gives a normalized bit score (a measure of similarity between the two sequences), and the column marked E gives the

expectation value. The percentage columns indicate percent identity and percent similarity, respectively. The alen column gives the total

aligned length for each hit. The +- characters shown at the beginning of some lines indicate that more than one alignment was found

between the query and subject; in the case of the ﬁrst hit (Q7Z5P9), four alignments were returned. The align link at the end of each

row takes the user to the alignment for that hit (not shown).

bring significant strengths to the table. Summarized below are some of the fine points that

distinguish the two methods from one another.

• FASTA begins the search by looking for exact matches of words, while BLAST allows for

conservative substitutions in the first step.

• BLAST allows for automatic masking of sequences, while FASTA does not.

• FASTA will return one and only one alignment for a sequence in the hit list, while BLAST

can return multiple results for the same sequence, each result representing a distinct HSP.

• Since FASTA uses a version of the more rigorous Smith–Waterman alignment method, it

generally produces better final alignments and is more apt to find distantly related sequences

than BLAST. For highly similar sequences, their performance is fairly similar.

• When comparing translated DNA sequences with protein sequences or vice versa, FASTA

(specifically, FASTX/FASTY for translated DNA →protein and TFASTX/TFASTY for pro-

tein →translated DNA) allows for frameshifts.

• BLAST runs faster than FASTA, since FASTA is more computationally intensive.

Several studies have attempted to answer the “which method is better” question by per-

forming systematic analyses with test datasets (Pearson 1995; Agarawal and States 1998; Chen

2003). In one such study, Brenner et al. (1998) performed tests using a dataset derived from

already known homologies documented in the Structural Classification of Proteins database

(SCOP; Chapter 12). They found that FASTA performed better than BLAST in finding relation-

ships between proteins having >30% sequence identity, and that the performance of all meth-

ods declines below 30%. Importantly, while the statistical values reported by BLAST slightly

underestimated the true extent of errors when looking for known relationships, they found

that BLAST and FASTA (with ktup = 2) were both able to detect most known relationships,

calling them both “appropriate for rapid initial searches.”

中文译文

FASTA

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 90 下部 - PDF page 95；印刷页码 70-75。

> 实际 PDF 小节名：FASTA。

> 用途：

虽然检测序列相似性最常用的技术是 BLAST，但它并不是唯一能够快速且准确地比较序列的 heuristic method。事实上，第一个被广泛用于数据库相似性搜索的程序是 FASTA（Lipman and Pearson 1985; Pearson and Lipman 1988; Pearson 2000）。与 BLAST 一样，FASTA 使用户能够快速地将一条 query sequence 与大型数据库进行比较，并且该程序有多个版本可用（表 3.3）。除主要实现之外，还有多种专门化的 FASTA 版本，详见 Pearson（2016）。一个有趣的历史细节是，用于表示核苷酸和蛋白质序列的 FASTA format，正是随着 FASTA 算法的发展而产生的。

表 3.3 主要 FASTA 算法。

Program	Query	Database	Corresponding BLAST Program
FASTA	Nucleotide	Nucleotide	BLASTN
FASTA	Protein	Protein	BLASTP
FASTX/FASTY	DNA	Protein	BLASTX
TFASTYX/TFASTY	Protein	Translated DNA	TBLASTN

The Method

FASTA 算法可以分为四个主要步骤。在第一步中，FASTA 会在 query sequence 和目标数据库中的每一条序列中，确定某一长度的所有 overlapping words，并在此过程中创建两个列表。这里的 word length 参数称为 ktup，相当于 BLAST 中的 W。随后，这些 overlapping words 列表会相互比较，以识别两个列表中共有的 words。接着，该方法寻找彼此距离很近的 word matches，并将它们相互连接起来，中间序列也包括在内，但不引入任何 gaps。这可以用 dotplot format 表示（图 3.20a）。完成这一轮初始连接后，会为每一个相似区域计算一个初始分数（init1）。

在第二步中，对于给定的一次 pairwise alignment，只有 10 个最佳区域会被纳入进一步分析（图 3.20b）。此时，FASTA 会尝试把 dotplot 中彼此接近、但不位于同一条对角线上的相似区域连接在一起，目的是延长 alignment 的整体长度（图 3.20c）。这意味着现在允许插入和缺失，但每连接一条对角线都要付出 joining penalty。任意两条已连接对角线的净分数，是原始对角线分数之和减去 joining penalty。这个新分数称为 initn。

在第三步中，FASTA 对所有得到的对角线进行排序，然后只进一步考虑列表中“最佳”的对角线。对于每一条最佳对角线，FASTA 使用 Smith-Waterman algorithm（1981）的一种改良形式，在正在考虑的两条序列之间得到最优 pairwise alignment。随后在这个 pairwise alignment 上计算最终的最优分数（opt）。

在第四步，也是最后一步中，FASTA 通过估计随机生成序列的预期分数分布来评估 alignment 的显著性；这些随机序列具有相同的总体组成，即相同的序列长度以及氨基酸或核苷酸分布。基于这一随机化过程和原始 query 的结果，FASTA 计算 expectation value E（类似于 BLAST E value）；与前文一样，它表示某个报告的 hit 纯粹由偶然产生的概率。

图 3.20 FASTA 搜索策略。（a）一旦 FASTA 确定了 query sequence 和 target sequence 中共有的长度为 ktup 的 words，它就会连接彼此接近的 words，这些连接由对角线表示。（b）经过一轮初始评分后，选择排名前 10 的对角线进行进一步分析。（c）应用 Smith-Waterman algorithm，在正在考虑的两条序列之间得到最优 pairwise alignment。详见正文说明。

Running a FASTA Search

University of Virginia 提供了一个用于提交 FASTA queries 的 web front-end。该页面提供多种蛋白质和核苷酸数据库，并且一次运行最多可以选择两个数据库。在这个页面上，用户还可以指定要使用的 scoring matrix、gap and extension penalties，以及 ktup 的取值。对于基于蛋白质的搜索，ktup 的默认值为 2；对于基于核苷酸的搜索，默认值为 6。降低 ktup 的取值会提高运行的 sensitivity，但代价是速度下降。用户还可以将返回结果限制在特定的 E values 范围内。

FASTA query 返回的结果格式与 BLAST 返回的格式明显不同。以高度再生的刺胞动物 Hydractinia 中 histone H2B.3 的序列为例，这是一种用于替代 protamines 来压缩精子 DNA 的四种新型 H2B variants 之一（KX622131.1; Török et al. 2016）。将该序列作为 query，使用 BLOSUM62 作为 scoring matrix，并以 Swiss-Prot 作为目标数据库，得到的 FASTA 输出第一部分如图 3.21 所示，它以 histogram 的形式汇总结果。该 histogram 旨在表示本次特定搜索过程中计算出的所有 similarity scores 的分布。第一列表示 similarity scores 的 bins，分数随着页面向下而增加。第二列给出实际观察到落入每个 bin 的序列数量。这个计数也由 histogram 中每一行的长度表示，其中每个等号表示一定数量的序列；在该图中，每个等号对应 UniProtKB/Swiss-Prot 中的 130 条序列。第三列数字表示预期会落入每个 bin 的序列数量；在 histogram 中，这由星号表示。Hit list 会紧随其后；图 3.22 显示了本次搜索 hit list 的一部分。这里给出了每个 hit 的 accession number 和 partial definition line，同时还给出了 optimal similarity score（opt）、normalized score（bit）、expectation value（E）、percent identity 和 percent similarity 数值，以及 aligned length。这里没有显示每个 hit 与原始 query sequence 的 individual alignments；这些内容需要在输出中继续向下滚动才能看到。在 pairwise alignments 中，精确匹配由冒号表示，而 conservative substitutions 由点表示。

图 3.21 一次 protein-protein FASTA 搜索的 search summary，该搜索使用来自 Hydractinia echinata 的 histone H2B.3 序列（KX622131.1; Török et al. 2016）作为 query，并使用 BLOSUM62 作为 scoring matrix。Header 表明 query 是针对 Swiss-Prot database 执行的。Histogram 显示本次搜索计算出的所有 similarity scores 的分布。最左侧一列给出 normalized similarity score，标记为 opt 的列给出具有该分数的序列数量。标记为 E() 的列给出预期会达到第一列中该分数的序列数量。在本例中，histogram 中每个等号表示 Swiss-Prot 中的 130 条序列。每一行中的星号表示预期的随机 hits 分布。插图是该区域 histogram 的放大版本。

图 3.22 图 3.21 所述 protein-protein FASTA 搜索的 hit list。这里只显示前 18 个 hits。对于每个 hit，图中提供其 accession number 和 partial definition line。标记为 opt 的列给出 raw similarity score；标记为 bits 的列给出 normalized bit score，即两条序列之间相似性的一种度量；标记为 E 的列给出 expectation value。百分比列分别表示 percent identity 和 percent similarity。alen 列给出每个 hit 的 total aligned length。某些行开头显示的 +- 字符表示 query 和 subject 之间发现了不止一个 alignment；在第一个 hit（Q7Z5P9）的情况下，返回了四个 alignments。每行末尾的 align link 会把用户带到该 hit 的 alignment（图中未显示）。

Statistical Significance of Results

与前文一样，FASTA 搜索中的 E values 表示某个 hit 纯粹由偶然产生的概率。Pearson（2016）提出了以下从基于蛋白质的搜索中推断 homology 的指导原则，这些原则与前面为 BLAST 描述的原则略有不同：E value < 10^-6 几乎必然意味着 homology。当 E < 10^-3 时，query 与找到的 sequences 几乎总是 homologous，但用户应确保得分最高的非相关序列的 E value 接近 1。

Comparing FASTA and BLAST

由于 FASTA 和 BLAST 都采用严谨的算法来寻找在统计上，并且希望在生物学上，相关的序列，因此很自然会问哪一种方法是更好的选择。实际上，这个问题并没有一个很好的答案，因为两种方法都各有重要优势。下面总结了一些区分两种方法的细节。

FASTA 在搜索开始时寻找 words 的 exact matches，而 BLAST 在第一步中允许 conservative substitutions。
BLAST 允许自动 masking sequences，而 FASTA 不允许。
对 hit list 中的一条序列，FASTA 只会返回一个且仅一个 alignment；而 BLAST 可以为同一条序列返回多个结果，每个结果代表一个不同的 HSP。
由于 FASTA 使用了更严格的 Smith-Waterman alignment method 的一个版本，它通常会产生更好的最终 alignments，并且比 BLAST 更容易找到远缘相关序列。对于高度相似的序列，二者的性能相当接近。
当比较 translated DNA sequences 与 protein sequences，或反向比较时，FASTA 允许 frameshifts；具体而言，FASTX/FASTY 用于 translated DNA -> protein，TFASTX/TFASTY 用于 protein -> translated DNA。
BLAST 运行速度快于 FASTA，因为 FASTA 的计算量更大。

已有多项研究试图通过使用测试数据集进行系统分析，回答“哪种方法更好”这个问题（Pearson 1995; Agarawal and States 1998; Chen 2003）。在其中一项研究中，Brenner et al.（1998）使用一个来自 Structural Classification of Proteins database（SCOP；第 12 章）的数据集进行测试，该数据集基于其中记录的已知 homologies。他们发现，在寻找 sequence identity >30% 的蛋白质之间关系时，FASTA 的表现优于 BLAST，而所有方法在低于 30% 时性能都会下降。重要的是，虽然在寻找已知关系时，BLAST 报告的统计值略微低估了真实错误程度，但他们发现 BLAST 和 FASTA（ktup = 2）都能够检测到大多数已知关系，并称二者都“适合快速初始搜索”。

PDF 插图 (10 页)

026

Summary

PDF page 96；印刷页码 76

▶

English SourcePDF extracted

Summary

The ability to perform pairwise sequence alignments and interpret the results from such anal-

yses has become commonplace for nearly all biologists, no longer being a technique employed

solely by bioinformaticians. With time, these methods have undergone a continual evolution,

keeping pace with the types and scale of data that are being generated both in individual

laboratories and by systematic, organismal sequencing projects. As with all computational

techniques, the reader should have a firm grasp of the underlying algorithm, always keep-

ing in mind the algorithm’s capabilities and limitations. Intelligent use of the tools presented

in this chapter can lead to powerful and interesting biological discoveries, but there have also

been many cases documented where improper use of the tools has led to incorrect biological

conclusions. By understanding the methods, users can optimally use them and end up with

a better set of results than if these methods were treated simply as a “black box.” As biol-

ogy is increasingly undertaken in a sequence-based fashion, using sequence data to underpin

the design and interpretation of experiments, it becomes increasingly important that compu-

tational results, such as those generated using BLAST and FASTA, are cross-checked in the

laboratory, against the literature, and with additional computational analyses to ensure that

any conclusions drawn not only make biological sense but also are actually correct.

Ch3 Internet Resources 原文抽取

>

> 实际 PDF 小节名：Internet Resources

> 范围：PDF page 96；印刷页码 76

> 边界：从网络资源标题开始，到后续扩展阅读标题前结束；网址保留原文。

Internet Resources

BLAST

European Bioinformatics Institute (EBI)

www.ebi.ac.uk/blastall

National Center for Biotechnology Information (NCBI)

blast.ncbi.nlm.nih.gov

BLAST-Like Alignment Tool (BLAT)

genome.ucsc.edu/cgi-bin/hgBlat

NCBI Conserved Domain Database (CDD)

ncbi.nlm.nih.gov/cdd

Cancer Genome Anatomy Project (CGAP)

ocg.cancer.gov/programs/cgap

FASTA

EBI

www.ebi.ac.uk/Tools/sss/fasta

University of Virginia

fasta.bioch.virginia.edu

RefSeq

ncbi.nlm.nih.gov/refseq

Structural Classification of Proteins (SCOP)

scop.berkeley.edu

Swiss-Prot

www.uniprot.org

中文译文

Summary

> 来源：Bioinformatics: A Practical Guide to the Analysis of Sequences and Genomes, 4th ed.

> 范围：PDF page 96；印刷页码 76。

> 实际 PDF 小节名：Summary。

> 用途：

执行 pairwise sequence alignments 并解释这类分析结果的能力，已经成为几乎所有生物学家的常规技能，不再只是 bioinformaticians 才会使用的技术。随着时间推移，这些方法一直在持续演化，以跟上数据类型和数据规模的变化；这些数据既来自单个实验室，也来自系统性的生物体测序项目。

与所有计算技术一样，读者应当牢固掌握其底层 algorithm，并始终牢记该 algorithm 的能力和局限。合理使用本章介绍的工具，可以带来有力而有趣的生物学发现；但也已有许多案例表明，不当使用这些工具会导致错误的生物学结论。理解这些方法后，用户就能够以最佳方式使用它们，并获得比把这些方法简单视为 “black box” 时更好的结果。

随着生物学越来越多地以 sequence-based 的方式开展，即使用序列数据来支撑实验设计和实验结果解释，对计算结果进行交叉检查也变得越来越重要。例如，由 BLAST 和 FASTA 生成的结果，应当在实验室中、依据文献，并结合额外的计算分析来核查，以确保由此得出的任何结论不仅在生物学上合理，而且实际上也是正确的。

---

Internet Resources

> 范围：PDF page 96；印刷页码 76。

> 实际 PDF 小节名：Internet Resources。

BLAST

European Bioinformatics Institute（EBI）

www.ebi.ac.uk/blastall

National Center for Biotechnology Information（NCBI）

blast.ncbi.nlm.nih.gov

BLAST-Like Alignment Tool（BLAT）

genome.ucsc.edu/cgi-bin/hgBlat

NCBI Conserved Domain Database（CDD）

ncbi.nlm.nih.gov/cdd

Cancer Genome Anatomy Project（CGAP）

ocg.cancer.gov/programs/cgap

FASTA

EBI

www.ebi.ac.uk/Tools/sss/fasta

University of Virginia

fasta.bioch.virginia.edu

RefSeq

ncbi.nlm.nih.gov/refseq

Structural Classification of Proteins（SCOP）

Swiss-Prot

PDF 插图 (2 页)

027

Henikoff, S. and Henikoff, J.G. (2000). Amino acid substitution matrices. Adv. Protein Chem. 54: 73–97. 一篇全面综述，涵盖构建蛋白质评分矩阵时至关重要的因素。

Koonin, E. (2005). Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39: 309–338. 对直系同源基因、旁系同源基因及其亚型的深入阐释，并讨论它们的进化起源以及检测策略。

Pearson, W.R. (2016). Finding protein and nucleotide similarities with FASTA. Curr. Protoc. Bioinf. 53: 3.9.1–3.9.23. 对 FASTA 算法的深入讨论，包括示例演示，以及关于运行选项和使用场景的补充信息。

Wheeler, D.G. (2003). Selecting the right protein scoring matrix. Curr. Protoc. Bioinf. 1: 3.5.1–3.5.6. 对 PAM、BLOSUM 和专用评分矩阵的讨论，并就特定类型蛋白质分析中如何正确选择矩阵提供指导。

---

参考文献

Agarawal, P. and States, D.J. (1998). Comparative accuracy of methods for protein similarity search. Bioinformatics. 14: 40–47.

Altschul, S.F. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219: 555–565.

Altschul, S.F. and Koonin, E.V. (1998). Iterated profile searches with PSI-BLAST: a tool for discovery in protein databases. Trends Biochem. Sci. 23: 444–447.

Altschul, S.F., Gish, W., Miller, W. et al. (1991). Basic local alignment search tool. J. Mol. Biol. 215: 403–410.

Altschul, S.F., Madden, T.L., Schäffer, A.A. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402.

Brenner, S.E., Chothia, C., and Hubbard, T.J.P. (1998). Assessing sequence comparison methods with reliable structurally identified evolutionary relationships. Proc. Natl. Acad. Sci. USA. 95: 6073–6078.

Bücher, P., Karplus, K., Moeri, N., and Hofmann, K. (1996). A flexible motif search technique based on generalized profiles. Comput. Chem. 20: 3–23.

Chen, Z. (2003). Assessing sequence comparison methods with the average precision criterion. Bioinformatics. 19: 2456–2460.

Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978). A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5 (ed. M.O. Dayhoff), 345–352. Washington, DC: National Biomedical Research Foundation.

Doolittle, R.F. (1981). Similar amino acid sequences: chance or common ancestry. Science 214: 149–159.

Doolittle, R.F. (1989). Similar amino acid sequences revisited. Trends Biochem. Sci. 14: 244–245.

Gonnet, G.H., Cohen, M.A., and Benner, S.A. (1992). Exhaustive matching of the entire protein sequence database. Proteins. 256: 1443–1445.

Gribskov, M., McLachlan, A.D., and Eisenberg, D. (1987). Profile analysis: detection of distantly-related proteins. Proc. Natl. Acad. Sci. USA. 84: 4355–4358.

Henikoff, S. and Henikoff, J.G. (1991). Automated assembly of protein blocks for database searching. Nucleic Acids Res. 19: 6565–6572.

Henikoff, S. and Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89: 10915–10919.

Henikoff, S. and Henikoff, J.G. (1993). Performance evaluation of amino acid substitution matrices. Proteins Struct. Funct. Genet. 17: 49–61.

Henikoff, S. and Henikoff, J.G. (2000). Amino acid substitution matrices. Adv. Protein Chem. 54: 73–97.

Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275–282.

Karlin, S. and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 87: 2264–2268.

Kent, W.J. (2002). BLAT: the BLAST-like alignment tool. Genome Res. 12: 656–664.

Lipman, D.J. and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science. 227: 1435–1441.

Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster and more sensitive homology search. Bioinformatics. 18: 440–445.

Pearson, W.R. (1995). Comparison of methods for searching protein sequence databases. Protein Sci. 4: 1145–1160.

Pearson, W.R. (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132: 185–219.

Pearson, W.R. (2016). Finding protein and nucleotide similarities with FASTA. Curr. Protoc. Bioinf. 53: 3.9.1–3.9.23.

Pearson, W.R. and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA. 85: 2444–2448.

Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng. 12: 85–94.

Ryan, J.F., Pang, K., Schnitzler, C.E. et al., and NISC Comparative Sequencing Program. (2013). The genome of the ctenophore Mnemiopsis leidyi. Science. 346: 436–439.

Schneider, T.D., Stormo, G.D., Gold, L., and Ehrenfeucht, A. (1986). Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188: 415–431.

Schnitzler, C.E., Simmons, D.K., Pang, K. et al. (2014). Expression of multiple Sox genes through embryonic development in the ctenophore Mnemiopsis leidyi is spatially restricted to zones of cell proliferation. J. Exp. Zool. (Mol. Dev. Evol.) 322B: 423–433.

Smith, T.F. and Waterman, M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197.

Staden, R. (1988). Methods to define and locate patterns of motifs in sequences. Comput. Appl. Biosci. 4: 53–60.

Tatusov, R.L., Altschul, S.F., and Koonin, E.V. (1994). Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. USA. 91: 12091–12095.

Tatusova, T.A. and Madden, T.L. (1999). BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174: 247–250.

Török, A., Schiffer, P.H., Schintzler, C.E. et al. (2016). The cnidarian Hydractinia echinata employs canonical and highly adapted histones to pack its DNA. Epigenet. Chromatin. 9: 36.

Vogt, G., Etzold, T., and Argos, P. (1995). An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249: 816–831.

Wheeler, D.G. (2003). Selecting the right protein scoring matrix. Curr. Protoc. Bioinf. 1: 3.5.1–3.5.6.

Wootton, J.C. and Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17: 149–163.

Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203–214.

---

作者声明

本章由 Andreas D. Baxevanis 博士以个人身份撰写。不应暗示或推断其获得了美国国立卫生研究院（National Institutes of Health）或美国卫生与公众服务部（United States Department of Health and Human Services）的官方支持或认可。

PDF 插图 (6 页)