Chapter 5

Genome Annotation

6 小节

035

Introduction

PDF page 137 - PDF page 138 上部；印刷页码 117-118

▶

English SourcePDF extracted

---

Genome Annotation

David S. Wishart

Introduction

Thanks to rapid advances in DNA sequencing technology and DNA analysis software, genome

projects that used to take years and cost millions of dollars to finish can now be completed in

just weeks at a cost of a few thousand dollars. The typical workflow for a modern genome

sequencing project involves performing whole genome DNA sequencing of selected organ-

isms using a next generation DNA sequencer, running a variety of programs to assemble a

reference genome, and using software to locate and identify all of the protein-coding ribo-

somal RNA (rRNA) and transfer RNA (tRNA) genes within the genomic sequence. This last

process is called genome annotation and it is the primary subject of this chapter. Strictly speak-

ing, genome annotation is not genome prediction. Gene or genome prediction is a subfield of

genome annotation. In particular, gene prediction uses mathematical or probabilistic models

to analyze DNA sequences and to identify gene boundaries and gene structures. On the other

hand, genome annotation uses gene (and genome) prediction results along with other lines of

evidence such as gene expression data, protein expression data, sequence homology to other

annotated genomes, and even literature assessments to generate a set of genome annotations.

These annotations include not only the location of the genes on each chromosome but also

their names (based on homology), calculated properties (such as sequence length, amino acid

composition, and molecular weight), expression levels (if available), and probable functions.

Depending on the type of organism that has been sequenced, the task of genome annotation

can be either quite easy or quite difficult. Prokaryotes (including bacteria and archaea) have

relatively small genomes, typically no more than 5 million base pairs, consisting of one or

two circular chromosomes and perhaps one or two small plasmids. The gene structure for

prokaryotes is very simple, with each gene being a contiguous open reading frame (ORF).

Furthermore, the coding density for prokaryotes is very high, with at least 85–90% of their

DNA coding for proteins, tRNAs, and rRNAs (Hou and Lin 2009). This makes the identification

of genes in prokaryotes relatively simple. On the other hand, eukaryotic gene identification

is often quite difficult. This is because eukaryotes have very large genomes (often billions of

base pairs), with very low (often <2%) coding densities (Hou and Lin 2009). Eukaryotic gene

structure is also much more complex than prokaryotic gene structure. In particular, eukaryotic

genes are split into exons and introns, and most eukaryotic genes are separated by very large

stretches of non-coding DNA (called intergenic regions).

While the cellular machinery in eukaryotic cells is able to recognize and process gene sig-

nals with remarkable accuracy and precision, our understanding of the molecular mechanisms

by which eukaryotic sequence signals are recognized and processed remains incomplete. As a

result, currently available eukaryotic gene prediction methods are not very accurate. Therefore,

in the absence of additional experimental or extrinsic information (e.g. gene expression data),

one should assume that eukaryotic gene predictions are only approximate. Even with consid-

erable experimental data at hand, it is still quite difficult to fully annotate the best-studied

Bioinformatics, Fourth Edition. Edited by Andreas D. Baxevanis, Gary D. Bader, and David S. Wishart.

Companion Website: www.wiley.com/go/baxevanis/Bioinformatics_4e

Genome Annotation

eukaryotic genomes. For instance, the DNA sequence for the human genome has been known

since 2001, but the actual number of genes encoded within our own genome has still not been

fully determined (Pennisi 2003; Ezkurdia et al. 2014).

This chapter briefly reviews some of the computational methods and algorithms underly-

ing computational gene prediction for both prokaryotes and eukaryotes. It also describes how

experimental evidence and database comparisons can be integrated into these gene prediction

tools to improve gene prediction performance and to ensure more complete genome annota-

tion. Methods for assessing the performance of computational gene finders are also described.

Finally, a number of genome annotation pipelines are highlighted, along with several tools for

visualizing the resulting annotations.

Gene Prediction Methods

Different methods for gene prediction have been developed separately for prokaryotes and

eukaryotes because of important differences in their overall gene organization. Gene-finding

programs, whether for prokaryotes or eukaryotes, fall into two general categories: intrinsic

(or ab initio) gene predictors and extrinsic (or evidence based) gene finders (Borodovsky et al.

1994).

Ab initio gene prediction approaches attempt to predict and annotate genes solely using

DNA sequence data as input and without direct comparison with other sequences or sequence

databases. Ab initio approaches involve searching for sequence signals that are potentially

involved in gene specification and/or looking for regions that show compositional bias that

has been correlated with coding regions. This combined approach to gene finding is called

searching by signal and searching by content. GeneMark (Borodovsky and McIninch 1993),

GLIMMER (Delcher et al. 1999, 2007), EasyGene (Larsen and Krogh 2003), and GENSCAN

(Burge and Karlin 1997) are well-known examples of intrinsic or ab initio gene-finding pro-

grams. In contrast, extrinsic gene-finding methods involve both homology-based and compar-

ative approaches, in which the gene structure is determined through comparison with other

sequences whose characteristics are already known. BLASTX is an example of an extrinsic

gene-finding program that has been frequently applied for gene identification in prokaryotic

genomes (Borodovsky et al. 1994). Extrinsic gene prediction methods depend on having exper-

imental evidence (such as messenger RNA (mRNA) or RNA-seq data) and/or a large body of

pre-existing experimental sequencing data to perform sequence comparisons and gene identi-

fications. We will discuss these extrinsic methods and their role in genome annotation a little

later in this chapter. To begin with, we will focus on the intrinsic or ab initio gene prediction

methods.

中文译文

译文：Ch5 Genome Annotation / Introduction

章节：Ch5 Genome Annotation

Canonical 小节：Introduction

范围：PDF page 137 - PDF page 138 上部；印刷页码 117-118

---

5 基因组注释

David S. Wishart

引言

得益于 DNA 测序技术和 DNA 分析软件的快速进步，过去需要数年时间、耗费数百万美元才能完成的基因组项目，如今只需数周、花费几千美元即可完成。现代基因组测序项目的典型工作流程包括：使用新一代 DNA 测序仪对选定生物进行全基因组 DNA 测序；运行多种程序组装参考基因组；并利用软件在基因组序列中定位和识别所有编码蛋白质的基因、核糖体 RNA（ribosomal RNA, rRNA）基因和转运 RNA（transfer RNA, tRNA）基因。最后这一过程称为基因组注释（genome annotation），也是本章的主要主题。严格地说，基因组注释并不等同于基因组预测。基因预测或基因组预测是基因组注释的一个分支。具体而言，基因预测使用数学模型或概率模型分析 DNA 序列，并识别基因边界和基因结构。另一方面，基因组注释则利用基因（以及基因组）预测结果，并结合其他证据来源，例如基因表达数据、蛋白质表达数据、与其他已注释基因组的序列同源性，甚至文献评估，来生成一组基因组注释。这些注释不仅包括每条染色体上基因的位置，还包括基因名称（基于同源性）、计算得到的性质（例如序列长度、氨基酸组成和分子量）、表达水平（如果可获得），以及可能的功能。

基因组注释任务的难易程度取决于所测序生物的类型。原核生物（包括细菌和古菌）的基因组相对较小，通常不超过 500 万个碱基对，由一条或两条环状染色体以及可能存在的一两个小质粒组成。原核生物的基因结构非常简单，每个基因都是一个连续的开放阅读框（open reading frame, ORF）。此外，原核生物的编码密度很高，其 DNA 中至少有 85%–90% 编码蛋白质、tRNA 和 rRNA（Hou and Lin 2009）。这使得原核生物中的基因识别相对简单。相反，真核生物的基因识别往往相当困难。这是因为真核生物通常具有非常大的基因组（常常达到数十亿个碱基对），而编码密度却很低（通常低于 2%）（Hou and Lin 2009）。真核基因结构也比原核基因结构复杂得多。尤其是，真核基因被分割为外显子和内含子，而且大多数真核基因之间隔着很长的非编码 DNA 区段（称为基因间区）。

尽管真核细胞中的细胞机器能够以惊人的准确性和精确度识别并处理基因信号，但我们对真核序列信号如何被识别和处理的分子机制仍未完全理解。因此，目前可用的真核基因预测方法准确性并不很高。所以，在缺乏额外实验信息或外源性信息（例如基因表达数据）的情况下，应当认为真核基因预测只是近似结果。即使手头已有相当多的实验数据，要完整注释研究最充分的真核基因组仍然十分困难。例如，人类基因组的 DNA 序列自 2001 年以来已经为人所知，但我们自身基因组中实际编码的基因数量至今仍未完全确定（Pennisi 2003; Ezkurdia et al. 2014）。

本章将简要回顾支撑原核生物和真核生物计算基因预测的一些计算方法和算法。本章还将说明如何把实验证据和数据库比较整合进这些基因预测工具，以提升基因预测性能，并确保更完整的基因组注释。同时，本章还会介绍评估计算基因查找程序性能的方法。最后，本章将重点介绍若干基因组注释流水线，以及用于可视化所得注释结果的几种工具。

基因预测方法

由于原核生物和真核生物在整体基因组织方式上存在重要差异，针对二者的基因预测方法是分别发展起来的。无论用于原核生物还是真核生物，基因查找程序大体可分为两类：内源性（intrinsic，或 ab initio）基因预测器，以及外源性（extrinsic，或 evidence based）基因查找器（Borodovsky et al. 1994）。

Ab initio 基因预测方法试图仅以 DNA 序列数据作为输入来预测和注释基因，而不直接与其他序列或序列数据库进行比较。Ab initio 方法包括搜索可能参与基因界定的序列信号，和/或寻找表现出组成偏倚且这种偏倚与编码区相关的区域。这种基因查找的综合方法称为按信号搜索（searching by signal）和按内容搜索（searching by content）。GeneMark（Borodovsky and McIninch 1993）、GLIMMER（Delcher et al. 1999, 2007）、EasyGene（Larsen and Krogh 2003）和 GENSCAN（Burge and Karlin 1997）是内源性或 ab initio 基因查找程序的知名实例。相比之下，外源性基因查找方法包括基于同源性的策略和比较基因组学策略；在这些方法中，基因结构是通过与特征已知的其他序列进行比较来确定的。BLASTX 是一种外源性基因查找程序的例子，它常被用于原核基因组中的基因识别（Borodovsky et al. 1994）。外源性基因预测方法依赖实验证据（例如信使 RNA（messenger RNA, mRNA）或 RNA-seq 数据）和/或大量既有实验测序数据，以进行序列比较和基因识别。本章稍后将进一步讨论这些外源性方法及其在基因组注释中的作用。首先，我们将聚焦于内源性或 ab initio 基因预测方法。

术语表（12 条）

English	中文
genome annotation	基因组注释
gene prediction / genome prediction	基因预测 / 基因组预测
reference genome	参考基因组
open reading frame (ORF)	开放阅读框（ORF）
coding density	编码密度
intergenic regions	基因间区
intrinsic gene predictor	内源性基因预测器
extrinsic gene finder	外源性基因查找器
ab initio	保留拉丁术语，不译为“从头”以避免与 de novo 混淆；必要时可在后文补注“仅基于序列本身”。
evidence based	基于证据的
searching by signal	按信号搜索
searching by content	按内容搜索

036

Gene Prediction Methods

PDF page 138 - PDF page 147 上部；印刷页码 118-127

▶

English SourcePDF extracted

---

Ab Initio Gene Prediction in Prokaryotic Genomes

A prokaryotic gene typically begins with a start codon (e.g. ATG), ends with one of three stop

codons (e.g. TAG, TAA, or TGA), and is usually at least 100 bases long (Figure 5.1). These

protein-coding genes are called ORFs. Most of the genes in prokaryotic genomes are organized

into operons, which are gene clusters consisting of more than one ORF that are under the con-

trol of a shared set of regulatory sequences. These regulatory sequences can include enhancers,

silencers, terminators, operators, or promoters. Regulatory sequences typically constitute the

10–15% of the prokaryotic genome that is not coding for protein sequences. A prokaryotic

gene promoter is a small segment of DNA that initiates transcription of a particular gene. Pro-

moters are located near the transcription start sites (TSSs) of genes, on the same strand and

upstream of the gene or ORF. In prokaryotes, the promoter contains two short sequence ele-

ments approximately 10 bases and 35 nucleotides upstream from the TSS. The element located

10 bases upstream is called the TATA box in archaea or the Pribnow (TATAAT) box in bacteria

(Pribnow 1975). These abbreviations or letters actually indicate the consensus DNA sequences

Ab Initio Gene Prediction in Prokaryotic Genomes

ATGACAGATTACAGA......TGCAGTTACAGGATAG

TATA box

Start codon

Stop codon

ORF

Figure 5.1 A simpliﬁed depiction of a prokaryotic gene or open reading frame (ORF) including the start

codon (or translation initiation site), the stop codon (TAG), and the TATA or Pribnow box.

seen for these regions. In addition to the TSS, almost all prokaryotic genes have a ribosome

binding site (RBS) that is 8–10 bases upstream of the start (ATG) codon. The start codon is

also called the translation initiation site (TIS). The RBS exhibits a specific nucleotide pattern

(AGGAGG) called a Shine–Dalgarno (SD) consensus sequence (Shine and Dalgarno 1975).

The SD sequence enables interactions between mRNA and the cell’s translational machin-

ery. In bacteria and archaea, translation initiation is generally thought to occur through the

base-pairing interaction between the 3′ tail of the 16S rRNA of the 30S ribosomal subunit and

the site in the 5′ untranslated region (UTR) of an mRNA that carries the SD consensus.

Consensus sequences, while providing a useful reminder or mnemonic, are never really

used in modern gene signal or gene site (i.e. TIS, RBS, TSS, and terminator) identification.

Instead, most gene signals can be identified by using positional weight matrices (PWMs) or

position-specific scoring matrices (PSSMs; see also Chapter 3). These scoring matrices are cal-

culated from carefully aligning a set of known functional signals and determining the adjusted

frequency with which specific bases may appear in certain positions. An example of how to

calculate a PSSM is given in Box 5.1. Once calculated for a given signal, signal-specific PSSMs

can be used to rapidly compute, along the length of a sequence of interest, the position and

likelihood of the selected gene signals. A simplified gene prediction protocol for prokaryotes

involves the following steps.

• Start at the beginning of the genome sequence at the 5′ end of one DNA strand and find an

ATG start codon that makes the longest ORF (minimum 150 bases), then move to the next

ATG downstream of the previously identified ORF and repeat the process for the rest of the

genome sequence.

• Repeat the above process for the opposite DNA strand.

• For all identified ORFs, score the quality of the TSS and RBS signals using site-specific

PSSMs to refine the ORF predictions and produce a final list of genes.

Box 5.1 Position-Speciﬁc Scoring Matrices

Position-speciﬁc scoring matrices (PSSMs), which are also called positional weight matri-

ces (PWMs) or positional speciﬁc weight matrices (PSWMs), are usually derived from a set

of aligned sequences that are believed to be functionally related. In this example, ﬁve dif-

ferent DNA sequences consisting of 10 bases each, which are believed to be functionally

related (as promoter regions), are aligned.

A T T T A G T A T C

G T T C T G T A A C

A T T T T G T A G C

A A G C T G T A A C

C A T T T G T A C A

From this alignment, a simple positional frequency matrix (PFM) can be generated.

In this matrix the frequency of the As, Cs, Gs, and Ts is tabulated (based on the above

alignment) for each of the 10 base positions. So in the ﬁrst position there are three As,

one C, one G, and no Ts (see column 1). The PFM for the above alignment is:

(Continued)

Genome Annotation

Box 5.1 (Continued)

A 3 2 0 0 1 0 0 5 2 1

C 1 0 0 2 0 0 0 0 1 4

G 1 0 1 0 0 5 0 0 1 0

T 0 3 4 3 4 0 5 0 1 0

The PFM can now be converted to a positional probability matrix (PPM). A PPM is a

matrix consisting of a set of decimal values based on the percentage or frequency of

occurrences of each base in each position in the sequence alignment. In other words, we

must normalize the frequencies by dividing the nucleotide count at each position by the

number of sequences in the alignment. So if there are ﬁve sequences in the alignment

and three As in the ﬁrst position, then the positional probability for A in the ﬁrst position

is 3/5 = 0.6. Likewise, if there is one C in the ﬁrst position, its positional probability is

1/5 = 0.2. One G corresponds to a positional probability of 0.2, and no Ts corresponds to a

positional probability of 0 (see column 1). Performing this same calculation across all 10

positions of the alignment, the full PPM would appear as follows:

A .6 .4

0 .2 0 0 1 .4 .2

C .2

0 .4

0 0 0 0 .2 .8

G .2

0 .2

0 1 0 0 .2

T

0 .6 .8 .6 .8 0 1 0 .2

The probabilities in the above PPM can be multiplied together to calculate the probabil-

ity that a given DNA sequence is closely related to the original ﬁve sequences. For instance,

if we wanted to know if the new sequence ATTTTGTATA is closely related, we could mul-

tiply the values for each sequence position to calculate that sequence’s probability:

p = 0.6 × 0.6 × 0.8 × 0.6 × 0.8 × 1 × 1 × 1 × 0.2 × 0.2 = 0.0055

Note that if we had performed this same calculation on an almost identical sequence

such as ACTTTGTATA (which differs by only one base) we would get p = 0. We get a 0 prob-

ability because C was not observed in the second position of our training set. Building a

PPM with only ﬁve sequences means you are very likely to underestimate (or overesti-

mate) the true fractional frequencies of each base, leading to problems in calculating

probabilities similar to what we just saw. To account for the small size of our multiple

sequence alignment (MSA) we should introduce pseudocounts. Pseudocounts are used to

avoid issues that result from matrix entries having a value of 0. Pseudocounting is equiv-

alent to multiplying each column of the PPM by a Dirichlet distribution, thereby allowing

the probability to be calculated for the “unseen” or unused sequences. A simple way of

doing this is to normalize the data to match the overall base composition of the genome(s)

being considered and to add a correction factor that scales as the square root of the num-

ber of sequences in the MSA. Hence, the following formula can be used to rescore each

base position in the PPM:

score (Xi) = (Qx + Px)∕(N + B)

where Qx is the number of counts of base type X at position i, Px is the number of

pseudocounts of base type X, which is equal to B × the frequency of base type X, N

is the total number of sequences in the MSA, and B is the number of pseudocounts

(assumed to be √N). For the genome or genomes of interest the frequency of As is 0.32,

Ts is 0.32, Cs is 0.18, and Gs is 0.18. Using this information the value for A in the ﬁrst

position is (3 + (√5 × 0.32))/(5 + √5) = 0.51. The value for C in the second position is

(1 + (√5 × 0.18))/(5 + √5) = 0.19, and so on. The pseudocount corrected PPM is now:

A .51 .38 .09 .09 .24 .09 .09 .79 .38 .24

C .19 .06 .06 .33 .06 .06 .06 .06 .19 .61

G .19 .06 .19 .06 .06 .75 .06 .06 .19 .06

T .09 .51 .65 .51 .65 .09 .79 .09 .24 .09

Ab Initio Gene Prediction in Prokaryotic Genomes

Ideally each of the columns should sum to 1 but, because of rounding, the sums in

this example are sometimes slightly above or below 1. With this rescored matrix, you will

notice that there are now no zero entries. However, the calculation of probabilities through

multiplication is tedious (given the number of signiﬁcant digits) and difﬁcult. A simpler

way is to convert the PPM to a different type of matrix by taking the negative log10 of each

number in the PPM. This converts two-digit decimals to single-digit decimals and it also

allows one to add rather than multiply to calculate probabilities. If we take the −log10 of

the above PPM, we get:

A 0.3 0.4 1.0 1.0 0.6 1.0 1.0 0.1 0.4 0.6

C 0.7 1.2 1.2 0.5 1.2 1.2 1.2 1.2 0.7 0.2

G 0.7 1.2 0.7 1.2 1.2 0.1 1.2 1.2 0.7 1.2

T 1.1 0.3 0.2 0.3 0.2 1.0 0.1 1.0 0.6 1.0

This modiﬁed matrix is called a log likelihood scoring matrix or a PSSM. Using the above

PSSM, we can now calculate the score (or the log likelihood) for the query sequence

ATTTTGTATA: 0.3 + 0.3 + 0.2 + 0.3 + 0.2 + 0.1 + 0.1 + 0.1 + 0.6 + 0.6 = 2.8. The sequence

score gives an indication of how different the sequence is from a random sequence. The

higher the score, the more likely the sequence is a promoter/functional site and not a

random sequence. A score of 2.8 is very high. The sequence score can also be interpreted

in terms of the binding energy for that sequence.

However, such a simplified algorithm would only likely be 75–80% correct (Besemer et al.

2001). This is because prokaryotic genes are not always so simple to identify. For instance, the

ATG start codon is not always used for all bacterial genes. Among the 4284 genes identified

in Escherichia coli, 83% use ATG, 14% use GTG and 3% use TTG start codons (Blattner et al.

1997). Likewise, using a simple rule to identify only long ORFs may miss many short ORFs

or misidentify ORFs that have an unusual codon bias (indicating they are unlikely to code for

a gene). Indeed, the length distributions of ORFs known to code for proteins compared with

ORFs that occur by chance differ quite significantly. More specifically, coding ORFs have a

length distribution that resembles the gamma distribution (see Glossary), while non-coding

ORFs have a length distribution that resembles a simple exponential function (Lukashin and

Borodovsky 1998). In addition to these complications, it has recently been found that certain

prokaryotic genes have very unusual gene start signals because of a phenomenon called lead-

erless transcription (Slupska et al. 2001). In leaderless transcription, RNA transcripts have very

short 5′ UTRs, with a length < 6 bases. These regions are so short that they are unable to host

the RBS. This places the TSS at or very near to the TIS. In these cases, the promoter signal has

to be used for more accurate TIS identification.

Given the variations in the length and character of many prokaryotic gene signals, PSSMs are

not the most effective signal recognition tools available. More advanced methods of gene signal

recognition exist, such as Markov models (Box 5.2), hidden Markov models or HMMs (Box 5.3),

artificial neural networks, and support vector machines. These machine learning methods do a

far better job of handling variable lengths and conditional sequence dependencies that cannot

be captured with simple PSSMs.

Box 5.2 Markov Models

A Markov chain, model, or process refers to a series of observations in which the prob-

ability of an observation depends on a number of previous observations. The number of

observations deﬁnes the “order” of the chain. For example, in a ﬁrst-order Markov model,

the probability of an observation depends only on the previous observation. In a Markov

chain of order 5, the probability of an observation depends on the ﬁve preceding obser-

vations. A DNA sequence can be considered to be an example of a Markov model because

the likelihood of observing a particular base at a given position may depend on the bases

(Continued)

Genome Annotation

Box 5.2 (Continued)

preceding it. In particular, in coding regions, it is well known that the probability of a

given base depends on the ﬁve preceding bases, reﬂecting observed codon biases and

dependencies between adjacent codons. In non-coding regions, such dependence is not

observed. When scanning an anonymous genomic region, one can compute how well the

local nucleotide sequence conforms to the ﬁfth-order dependencies observed in coding

regions and assign appropriate coding likelihood scores.

Box 5.3 Hidden Markov Models in Gene Prediction

Hidden Markov models (HMMs) are used to provide a statistical representation of real

biological processes. They have found widespread use in many areas of bioinformatics,

including multiple sequence alignment, the characterization and classiﬁcation of protein

families, the comparison of protein structures, and the prediction of gene structure.

In this chapter, all of the gene-ﬁnding methods that are described have two things in

common: they use a raw nucleotide sequence as their input and, for each position in the

sequence, they attempt to predict whether a given base is most likely found in an intron,

an exon, or within an intergenic region. In making these predictions, the algorithm applied

(HMM or otherwise) must take into account what is known about the structure of a gene,

showed in a simpliﬁed fashion in Figure 5.2.

Working from the 5′ to 3′ end of the gene, the method must take into account the

unique characteristics of promoter regions, transcription start sites, 5′ UTRs, start codons,

exons, splice donors, introns, splice acceptors, stop codons, 3′ UTRs, and polyA tails. In

addition to any conserved sequences or compositional bias that may characterize each of

these regions (Box 5.1), the method also needs to take into account that each of these

elements appears with a controlled syntax; for example, the promoter (and its TATA box)

must appear before the start codon, an initial exon must follow the start codon, introns

must follow exons, introns can only be followed by internal or terminal exons, stop codons

cannot interrupt the coding region, and polyA signals must appear after the stop codon.

Finally, an ORF must be maintained throughout to produce a protein once all is said and

done.

Each of the elements – exons, introns, and so forth – are referred to as states. The

sequence characteristics and syntactical constraints described above allow a transition

probability to be assigned, indicating how likely a change of state is as one moves through

Transcribed region

Exon 1

Exon 2

Exon 3

Intron 1

Intron 2

Start codon

5′ UTR

3′ UTR

Stop codon

Downstream

intergenic

region

Upstream

intergenic

region

Figure 5.2 A simpliﬁed depiction of a eukaryotic gene illustrating the multi-intron/exon structure,

the location of the start and stop codons, the untranslated regions (UTRs), and the intergenic

regions that surround the transcribed gene.

Source: Ch5 Genome Annotation / Ab Initio Gene Prediction in Eukaryotic Genomes

PDF Pages: 143-147 | Print Pages: 123-127

Boundary: starts at second true section title on PDF page 143; includes Figure 5.3-Figure 5.5 and stops before How Well Do Gene Predictors Work?

Note: PDF page 143 top contains the tail of Box 5.3 from previous section; excluded here.

---

Ab Initio Gene Prediction in Eukaryotic Genomes

A diagram of how eukaryotic genes are organized is shown in Figure 5.2. As can be seen from

this figure, eukaryotic genes are somewhat more complex than prokaryotic genes. In partic-

ular, the density of protein-coding regions for eukaryotic genomes (and especially vertebrate

genomes) is 90–100 times lower than it is for prokaryotic genomes. These sparse protein-coding

regions are separated by long stretches of intergenic DNA while their coding sequences (the

exons) are interrupted by large, non-coding introns. Genes are recognized and transcribed by

eukaryotic RNA polymerases, and the resulting long RNA transcripts are then cut by various

small ribonuclear proteins (snRNPs) to remove the introns (Will and Lührmann 2011). The

remaining exons are then spliced together to form the much smaller protein-coding transcript.

The snRNPs recognize specific cut sites at the exon/intron junctions to ensure that the splicing

is always performed precisely.

In the human genome, just 1.1% of the genome is composed of exons, 24% is composed of

introns, and 75% of the genome constitutes intergenic DNA. On average there are 5.48 exons

per gene with each exon encoding a peptide fragment of 30–36 amino acids (Sakharkar et al.

2002). The longest exon in the human genome is 11 555 bases, while the shortest is just two

bases long (Sakharkar et al. 2002). Not only are exons “rare,” they vary tremendously in length.

What is more, they can be alternately spliced to produce very different combinations of final

gene (transcript) products. This makes gene prediction significantly more difficult for eukary-

otes than for prokaryotes.

Computational gene prediction for eukaryotes essentially involves mimicking the biological

transcriptional and splicing process. In the biological process, various proteins and protein

complexes within the cell scan through the DNA sequence, recognize and bind to specific DNA

Genome Annotation

sites, transcribe the gene, and then cut and splice the transcript to form a final gene product.

In the computational process, the proteins are replaced with various algorithms that:

• identify and score suitable splice sites and start and stop signals along the query sequence

• determine the location of the candidate exons, as deduced through the detection of these

signals

• score and identify the best exons as a function of both the signals used to detect the exons

as well as the coding statistics computed on the putative exon sequence itself

• assemble (or “splice”) a subset of these exon candidates into a predicted gene structure. The

assembly is produced in a way that maximizes a particular scoring function that is dependent

on the score of each of the individual exon candidates.

The way in which each of these tasks is actually implemented varies from program to pro-

gram. Rather than discuss each program in detail, we will describe the three major processes

common to almost all ab initio eukaryotic gene prediction programs: predicting exon-defining

signals, predicting and scoring exons, and, finally, exon assembly.

Predicting Exon-Deﬁning Signals

Just as prokaryotic genes have DNA signals, eukaryotic genes have distinct DNA signals

as well. Some of these elements are similar to prokaryotes, while others are quite different

(Figure 5.3). For instance, many eukaryotic genes have promoter elements that also share

some sequence similarity to prokaryotic genes. The most extensively studied core promoter

element in eukaryotes is known as the TATA box or the Goldberg–Hogness box (Lifton

et al. 1978), found 25–30 base pairs upstream from the TSS. The TATA box is also found

in archaea and bacteria and it appears to be a very ancient DNA signal. The TATA box in

eukaryotes has the consensus sequence TATA(A/T)A(A/T) and is often coupled to another

regulatory sequence called the CCAAT box (consensus: GGCCAATCT), located ∼150 base

pairs upstream of the TATA box. Only about 25–35% of mammalian genes contain TATA

boxes, while the rest contain other kinds of core promoter elements. Eukaryotic genes also

contain regulatory sequences beyond the core promoter, including enhancers, silencers,

and insulators. These regulatory sequences can be spread over a large genomic distance,

often hundreds of kilobases from the core promoters. In addition to having a wide variety of

promoter or enhancer signals, eukaryotic genes also have very specific DNA signals to define

the location of exons and introns.

More specifically, there are four basic DNA signals involved in defining exons: the TIS, the

5′ (or donor) splice site, the 3′ (or acceptor) splice site, and the translational stop codon. In

eukaryotes, the TIS is defined by the Kozak consensus sequence, often given as ACCATGG

(Kozak 1987), where the central ATG is the start codon. The 5′ donor splice site is typically

defined by a consensus sequence given as GG/GT, while the 3′ acceptor splice site has a con-

sensus sequence of CAG/G, where the slash indicates the cut sites for splicing (Figure 5.4). The

translational stop codons include the usual TAG, TAA, or TGA.

GC box

~200 bp

CCAAT box

~100 bp

TATA box

~30 bp

Gene

Transcription

start site

Exon

Intron

Figure 5.3 A schematic illustration of the upstream regions of a eukaryotic gene with the GC box located

∼200 bp upstream, the CCAAT box located ∼100 bp upstream, and the TATA box located ∼30 bp upstream

of the transcription start site.

Ab Initio Gene Prediction in Eukaryotic Genomes

Exon 1

Exon 2

Intron 1

Intron 2

Branchpoint site

5′ site

3′ site

AG/GT

CAG/NT

Figure 5.4 A schematic illustration of the splice site regions around exons and introns including the 5′

and 3′ splice sites and their consensus sequences.

The first methods used to identify exon-defining signals were simple PWMs or PSSMs.

These proved to be rather poor at identifying short DNA signals, such as splice sites. As a

result, these simple models have since given way to much more advanced pattern recognition

techniques such as HMMs (Box 5.3). These powerful pattern recognition approaches allow

very complex sequence patterns to be “learned” from large datasets consisting of well-known

or well-annotated exon-defining signals. An HMM is a statistical Markov model in which

the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden)

states. HMMs are commonly used in many real-life applications such as speech, handwriting,

and gesture recognition. The application of HMMs in bioinformatics began in the early 1990s

(Krogh et al. 1994) and led to a significant advance in gene prediction accuracy. HMMs make it

possible to define highly complex patterns of variable lengths, including many exon-defining

signals such as protein-coding regions (discussed below), donor, acceptor, and lariat sites, as

well as translational start and end sites.

Predicting and Scoring Exons

In addition to the identification of exon-defining signals, the accurate prediction of exons also

depends on content-based features. Exons can be divided into three basic types:

• initial exons: ORFs delimited by a start site and a donor site

• internal exons: ORFs delimited by a 5′ (donor) site and a 3′ (acceptor) site

• terminal exons: ORFs delimited by a 3′ (acceptor) site and a stop codon.

Most transcribed genes are composed of one initial exon, multiple internal exons, and a

single terminal exon. Zhang (2002) provides a more comprehensive discussion of these types

of eukaryotic exons.

Exons, by definition, are protein-coding regions. Protein-coding regions are known to

exhibit characteristic compositional bias when compared with non-coding regions. These

include somewhat richer GC content and a distinctly non-random codon (triplet) frequency

preference. The observed codon bias results from the uneven distribution of amino acids

in proteins, the uneven use of synonymous codons, and natural selection for translational

optimization in coding regions. To discriminate protein-coding regions from non-coding

regions, a number of DNA content-based measures were developed in the 1990s (Fickett and

Tung 1992; Gelfand 1995; Guigó 1999). These content measures, which are also referred to

as coding statistics, reflect the likelihood that a given DNA sequence codes for a protein or

protein fragment. Many methods for the computation of content-based measures have been

published over the years. Some of the first methods measured patterns seen in codon triplet

frequencies. However, more information was found in the frequencies of pairs of triplets (i.e.

hexamers). As a result, hexamer frequencies, usually in the form of codon position-dependent

fifth-order Markov models (Box 5.2; Borodovsky and McIninch 1993), seem to offer the best

Genome Annotation

discriminative power to identify protein-coding regions in exons. Currently, these hexamer

frequencies lie at the core of all modern eukaryotic gene predictors.

Exon Assembly

Once the exons are predicted (using a combination of hexamer frequencies and HMMs to iden-

tify key gene signals and exon/intron boundaries), they need to be assembled into some sort of

multi-exon gene structure. The main difficulty in exon assembly lies in simple combinatorics:

the number of possible exon assemblies grows exponentially with the number of predicted

exons for any given gene. To address this problem, a number of dynamic programming tech-

niques have been developed. Dynamic programming is an optimization technique that allows

one to solve a complex problem by breaking it down into a collection of simpler subproblems.

Each of those subproblems is solved just once, and their solutions are stored. The next time

the same subproblem occurs, instead of recomputing its solution, one simply looks up the pre-

viously computed solution (Bellman 1957; see also Appendix 6.A for a detailed discussion).

For the optimal exon assembly problem, dynamic programming has been shown to find the

solution quite efficiently, without having to enumerate or consider each and every possible

combination of exons (Gelfand and Roytberg 1993). Nearly all modern eukaryotic gene predic-

tion tools now use some kind of dynamic programming method (called the Viterbi algorithm

by Markov modelers, but also known as the Needleman–Wunsch algorithm by most people

doing sequence alignment). By combining HMM-based exon signal identification with dif-

ferent HMM-derived scores for exons and then using dynamic programing to assemble the

exons, it is possible to generate robust eukaryotic gene predictions. Some early examples of

HMM-based gene prediction methods that use dynamic programming include GENIE (Kulp

et al. 1996) and HMMgene (Krogh 1997). Perhaps the most popular example of an HMM-based

eukaryotic gene predictor is GENSCAN (Burge and Karlin 1997), an ab initio gene predictor

that has been widely used to annotate hundreds of eukaryotic genomes.

Given the popularity of GENSCAN, it is perhaps worthwhile explaining how this program

works in a bit more detail and providing an example of how it can be used. For any given query

sequence, GENSCAN determines the most likely gene structure given an underlying HMM. To

model donor splice sites, GENSCAN introduced a method called maximal dependence decom-

position. In this method, a series of weight matrices (instead of just one) are used to capture

dependencies between positions in these splice sites. In addition, GENSCAN uses parameters

that account for many higher order properties of genomic sequences (e.g. typical gene den-

sity, typical number of exons per gene, and the distribution of exon sizes for different types of

exons). Separate sets of gene model parameters can be used to adjust for the differences in gene

density and G + C composition seen across genomes. Models have also been developed for use

with maize and Arabidopsis sequences. This leads to higher scores for exons exhibiting simi-

larity to known proteins, but decreased scores for predicted exons having little to no similarity

with known proteins.

A typical GENSCAN output is shown in Figure 5.5, using the human uroporphyrinogen

decarboxylase (URO-D) gene (U30787) as the query. Each exon in the prediction is shown in

a separate line. The columns, going from left to right, represent the gene and exon number

(Gn.Ex), the type of prediction (Type, either the exon type or an identified polyA signal),

the strand on which the prediction was made (+ or –), the beginning and endpoints for the

prediction, the length of the predicted exon, its reading frame, several scoring columns, and

a probability value (P). GENSCAN exons having a very high probability value (p > 0.99) are

97.7% accurate when the prediction matches a true, annotated exon. These high-probability

predictions can be used in the rational design of polymerase chain reaction primers for comple-

mentary DNA (cDNA) amplification, or for other purposes where extremely high confidence

is necessary. GENSCAN exons that have probabilities in the range of 0.50–0.99 are deemed to

be correct most of the time. The best-case accuracies for p values higher than 0.90 is on the

order of 88%. Any predictions having p < 0.50 should be deemed unreliable, and those data

How Well Do Gene Predictors Work?

Figure 5.5 Sample output from a GENSCAN analysis of the uroporphyrinogen decarboxylase gene. See the text for a more detailed descrip-

tion of the output.

are not given in the data table. The predicted amino acid sequence is given below the gene

predictions. In the example shown here, GENSCAN correctly predicted nine of the 10 exons

in URO-D. Only the initial exon was missed.

中文译文

译文：Ch5 Genome Annotation / Gene Prediction Methods

章节：Ch5 Genome Annotation

Canonical 小节：Gene Prediction Methods

范围：PDF page 138 - PDF page 147 上部；印刷页码 118-127

---

原核基因组中的 Ab Initio 基因预测

原核基因通常以起始密码子（例如 ATG）开始，以三种终止密码子之一（例如 TAG、TAA 或 TGA）结束，并且通常至少有 100 个碱基长（图 5.1）。这些编码蛋白质的基因称为开放阅读框（open reading frame, ORF）。原核基因组中的大多数基因组织成操纵子（operon），即由多个 ORF 组成、并受一组共同调控序列控制的基因簇。这些调控序列可以包括增强子、沉默子、终止子、操纵基因或启动子。调控序列通常构成原核基因组中不编码蛋白质序列的 10%–15%。原核基因启动子是一小段 DNA，它启动某个特定基因的转录。启动子位于基因转录起始位点（transcription start site, TSS）附近，与基因或 ORF 位于同一条链上，并处于其上游。在原核生物中，启动子包含两个短序列元件，分别位于 TSS 上游约 10 个碱基和 35 个核苷酸处。位于上游 10 个碱基处的元件在古菌中称为 TATA box，在细菌中称为 Pribnow（TATAAT）box（Pribnow 1975）。这些缩写或字母实际上表示在这些区域中观察到的共有 DNA 序列。

ATGACAGATTACAGA......TGCAGTTACAGGATAG
TATA box
Start codon
Stop codon
ORF

图 5.1 原核基因或开放阅读框（ORF）的简化示意图，其中包括起始密码子（或翻译起始位点）、终止密码子（TAG），以及 TATA box 或 Pribnow box。

除 TSS 外，几乎所有原核基因都有一个核糖体结合位点（ribosome binding site, RBS），位于起始（ATG）密码子上游 8–10 个碱基处。起始密码子也称为翻译起始位点（translation initiation site, TIS）。RBS 呈现出一种特定的核苷酸模式（AGGAGG），称为 Shine–Dalgarno（SD）共有序列（Shine and Dalgarno 1975）。SD 序列使 mRNA 能够与细胞的翻译机器发生相互作用。在细菌和古菌中，通常认为翻译起始是通过 30S 核糖体亚基中 16S rRNA 的 3′ 末端，与携带 SD 共有序列的 mRNA 5′ 非翻译区（5′ untranslated region, UTR）中的位点发生碱基配对相互作用而完成的。

共有序列虽然可以作为有用的提示或记忆辅助，但在现代基因信号或基因位点（即 TIS、RBS、TSS 和终止子）识别中并不会真正直接使用。相反，大多数基因信号可以通过位置权重矩阵（positional weight matrix, PWM）或位置特异性评分矩阵（position-specific scoring matrix, PSSM；另见第 3 章）来识别。这些评分矩阵是通过仔细比对一组已知功能信号，并确定特定碱基在某些位置出现的校正频率而计算得到的。Box 5.1 给出了如何计算 PSSM 的示例。一旦针对某一给定信号完成计算，信号特异性的 PSSM 就可用于沿着目标序列快速计算所选基因信号的位置及其可能性。一个简化的原核生物基因预测流程包括以下步骤。

从某一条 DNA 链 5′ 端的基因组序列开头开始，寻找能够形成最长 ORF（最小 150 个碱基）的 ATG 起始密码子；然后移动到此前已识别 ORF 下游的下一个 ATG，并对基因组序列的其余部分重复这一过程。
对相反方向的 DNA 链重复上述过程。
对所有识别出的 ORF，使用位点特异性 PSSM 对 TSS 和 RBS 信号的质量进行评分，以细化 ORF 预测并生成最终基因列表。

Box 5.1 位置特异性评分矩阵

位置特异性评分矩阵（position-specific scoring matrix, PSSM）也称为位置权重矩阵（positional weight matrix, PWM）或位置特异性权重矩阵（positional specific weight matrix, PSWM），通常由一组被认为在功能上相关的比对序列推导而来。在本例中，将 5 条各由 10 个碱基组成、并被认为在功能上相关（作为启动子区域）的不同 DNA 序列进行比对。

A T T T A G T A T C
G T T C T G T A A C
A T T T T G T A G C
A A G C T G T A A C
C A T T T G T A C A

由这个比对可以生成一个简单的位置频率矩阵（positional frequency matrix, PFM）。在该矩阵中，A、C、G 和 T 的频率会根据上述比对，针对 10 个碱基位置中的每一个位置进行制表。因此，在第一个位置上有 3 个 A、1 个 C、1 个 G 和 0 个 T（见第 1 列）。上述比对对应的 PFM 如下：

A 3 2 0 0 1 0 0 5 2 1
C 1 0 0 2 0 0 0 0 1 4
G 1 0 1 0 0 5 0 0 1 0
T 0 3 4 3 4 0 5 0 1 0

PFM 现在可以转换为位置概率矩阵（positional probability matrix, PPM）。PPM 是由一组十进制数值构成的矩阵，这些数值基于序列比对中每个位置上各碱基出现的百分比或频率。换句话说，我们必须通过将每个位置上的核苷酸计数除以比对中的序列数量来归一化频率。因此，如果比对中有 5 条序列，并且第一个位置上有 3 个 A，那么第一个位置上 A 的位置概率就是 3/5 = 0.6。同样，如果第一个位置上有 1 个 C，则其位置概率为 1/5 = 0.2。1 个 G 对应的位置概率为 0.2，而没有 T 则对应位置概率为 0（见第 1 列）。对比对的全部 10 个位置执行同样计算后，完整的 PPM 如下：

A .6 .4 0 0 .2 0 0 1 .4 .2
C .2 0 0 .4 0 0 0 0 .2 .8
G .2 0 .2 0 0 1 0 0 .2 0
T 0 .6 .8 .6 .8 0 1 0 .2 0

可以将上述 PPM 中的概率相乘，以计算给定 DNA 序列与原始 5 条序列密切相关的概率。例如，如果我们想知道新序列 ATTTTGTATA 是否密切相关，就可以将每个序列位置对应的数值相乘来计算该序列的概率：

p = 0.6 × 0.6 × 0.8 × 0.6 × 0.8 × 1 × 1 × 1 × 0.2 × 0.2 = 0.0055

请注意，如果我们对一个几乎相同的序列（例如 ACTTTGTATA，仅相差一个碱基）执行相同计算，会得到 p = 0。之所以得到 0 概率，是因为在训练集中第二个位置没有观察到 C。仅用 5 条序列构建 PPM，意味着你很可能低估（或高估）每个碱基真实的分数频率，从而在计算概率时产生类似刚才看到的问题。为了解决多序列比对（multiple sequence alignment, MSA）规模较小的问题，我们应当引入伪计数（pseudocount）。伪计数用于避免矩阵项取值为 0 所导致的问题。使用伪计数等价于将 PPM 的每一列乘以一个 Dirichlet 分布，从而允许为“未观察到”或未使用过的序列计算概率。一种简单做法是把数据归一化，使其匹配所考虑基因组的总体碱基组成，并加入一个随 MSA 中序列数量平方根而变化的校正因子。因此，可以用以下公式对 PPM 中每个碱基位置重新评分：

score (Xi) = (Qx + Px)/(N + B)

其中，Qx 是位置 i 上 X 类型碱基的计数；Px 是 X 类型碱基的伪计数，等于 B × X 类型碱基的频率；N 是 MSA 中序列的总数；B 是伪计数的数量（假定为 √N）。对于目标基因组，A 的频率为 0.32，T 的频率为 0.32，C 的频率为 0.18，G 的频率为 0.18。利用这些信息，第一个位置上 A 的值为 (3 + (√5 × 0.32))/(5 + √5) = 0.51。第二个位置上 C 的值为 (1 + (√5 × 0.18))/(5 + √5) = 0.19，依此类推。经过伪计数校正后的 PPM 如下：

A .51 .38 .09 .09 .24 .09 .09 .79 .38 .24
C .19 .06 .06 .33 .06 .06 .06 .06 .19 .61
G .19 .06 .19 .06 .06 .75 .06 .06 .19 .06
T .09 .51 .65 .51 .65 .09 .79 .09 .24 .09

理想情况下，每一列的总和应为 1；但由于四舍五入，本例中的列和有时会略高于或略低于 1。使用这个重新评分后的矩阵，你会注意到现在已经没有零项。然而，通过乘法计算概率既繁琐（考虑到有效数字的数量），也较为困难。更简单的方法是通过取 PPM 中每个数值的负 log10，将 PPM 转换为另一种矩阵。这会把两位小数转换为一位小数，同时也允许通过加法而不是乘法来计算概率。如果对上述 PPM 取 −log10，可得到：

A 0.3 0.4 1.0 1.0 0.6 1.0 1.0 0.1 0.4 0.6
C 0.7 1.2 1.2 0.5 1.2 1.2 1.2 1.2 0.7 0.2
G 0.7 1.2 0.7 1.2 1.2 0.1 1.2 1.2 0.7 1.2
T 1.1 0.3 0.2 0.3 0.2 1.0 0.1 1.0 0.6 1.0

这个修改后的矩阵称为对数似然评分矩阵，或 PSSM。利用上述 PSSM，我们现在可以计算查询序列 ATTTTGTATA 的得分（或对数似然）：

0.3 + 0.3 + 0.2 + 0.3 + 0.2 + 0.1 + 0.1 + 0.1 + 0.6 + 0.6 = 2.8

序列得分提示该序列与随机序列有多大差异。得分越高，该序列越可能是启动子/功能位点，而不是随机序列。2.8 是一个很高的得分。序列得分也可以从该序列结合能的角度来解释。

然而，这样一个简化算法的正确率可能只有 75%–80%（Besemer et al. 2001）。这是因为原核基因并不总是那么容易识别。例如，并非所有细菌基因都使用 ATG 起始密码子。在大肠杆菌（Escherichia coli）中识别出的 4284 个基因里，83% 使用 ATG，14% 使用 GTG，3% 使用 TTG 起始密码子（Blattner et al. 1997）。同样，如果使用只识别长 ORF 的简单规则，可能会漏掉许多短 ORF，或错误识别具有异常密码子偏倚的 ORF（这提示它们不太可能编码基因）。事实上，已知编码蛋白质的 ORF 与偶然出现的 ORF 在长度分布上差异相当显著。更具体地说，编码 ORF 的长度分布类似 gamma 分布（见术语表），而非编码 ORF 的长度分布类似简单指数函数（Lukashin and Borodovsky 1998）。除这些复杂因素外，近来还发现某些原核基因由于一种称为无前导序列转录（leaderless transcription）的现象而具有非常异常的基因起始信号（Slupska et al. 2001）。在无前导序列转录中，RNA 转录本具有非常短的 5′ UTR，长度小于 6 个碱基。这些区域太短，无法容纳 RBS。这使 TSS 位于 TIS 处或非常接近 TIS。在这些情况下，必须使用启动子信号来更准确地识别 TIS。

鉴于许多原核基因信号在长度和特征上存在变化，PSSM 并不是最有效的信号识别工具。还有更高级的基因信号识别方法，例如 Markov 模型（Box 5.2）、隐 Markov 模型或 HMM（Box 5.3）、人工神经网络和支持向量机。这些机器学习方法在处理可变长度和条件性序列依赖关系方面表现好得多，而这些特征是简单 PSSM 无法捕捉的。

Box 5.2 Markov 模型

Markov 链、模型或过程指一系列观测，其中某一观测的概率取决于若干先前观测。观测数量定义了链的“阶数”。例如，在一阶 Markov 模型中，某一观测的概率只取决于前一个观测。在 5 阶 Markov 链中，某一观测的概率取决于前 5 个观测。DNA 序列可以被视为 Markov 模型的一个例子，因为在给定位置观察到某个特定碱基的可能性可能取决于它前面的碱基。特别是在编码区中，众所周知，某一给定碱基的概率取决于前 5 个碱基，这反映了观察到的密码子偏倚以及相邻密码子之间的依赖关系。在非编码区中，则观察不到这种依赖性。当扫描一个未知的基因组区域时，可以计算局部核苷酸序列在多大程度上符合编码区中观察到的 5 阶依赖关系，并赋予适当的编码可能性得分。

Box 5.3 基因预测中的隐 Markov 模型

隐 Markov 模型（hidden Markov model, HMM）用于为真实生物过程提供统计表示。它们已被广泛用于生物信息学的许多领域，包括多序列比对、蛋白质家族的表征和分类、蛋白质结构比较，以及基因结构预测。

在本章中，所描述的所有基因查找方法都有两个共同点：它们以原始核苷酸序列作为输入，并且对于序列中的每一个位置，尝试预测某个给定碱基最可能位于内含子、外显子还是基因间区。在进行这些预测时，所应用的算法（无论是否为 HMM）必须考虑基因结构的已知信息；图 5.2 以简化方式展示了这一结构。

从基因的 5′ 端到 3′ 端，该方法必须考虑启动子区域、转录起始位点、5′ UTR、起始密码子、外显子、剪接供体、内含子、剪接受体、终止密码子、3′ UTR 和 polyA 尾的独特特征。除每个区域可能具有的保守序列或组成偏倚外（Box 5.1），该方法还需要考虑这些元件均按受控语法出现；例如，启动子（及其 TATA box）必须出现在起始密码子之前，初始外显子必须跟在起始密码子之后，内含子必须跟在外显子之后，内含子之后只能是内部外显子或末端外显子，终止密码子不能打断编码区，而 polyA 信号必须出现在终止密码子之后。最后，在整个过程中必须维持一个 ORF，以便在一切完成后产生蛋白质。

这些元件——外显子、内含子等——称为状态（states）。上述序列特征和语法约束允许为其分配转移概率，用来表示沿着基因结构移动时发生状态变化的可能性。

Transcribed region
Exon 1    Exon 2    Exon 3
Intron 1  Intron 2
Start codon
5′ UTR
3′ UTR
Stop codon
Downstream intergenic region
Upstream intergenic region

图 5.2 真核基因的简化示意图，展示了多内含子/外显子结构、起始和终止密码子的位置、非翻译区（UTR），以及围绕转录基因的基因间区。

真核基因组中的 Ab Initio 基因预测

图 5.2 展示了真核基因的组织方式。从该图可以看出，真核基因比原核基因复杂一些。特别是，真核基因组（尤其是脊椎动物基因组）中编码蛋白质区域的密度比原核基因组低 90–100 倍。这些稀疏的蛋白质编码区域被很长的基因间 DNA 区段隔开，而它们的编码序列（即外显子）又被很大的非编码内含子打断。真核 RNA 聚合酶识别并转录基因，随后由多种小核核糖核蛋白（small ribonuclear proteins, snRNPs）切割产生的长 RNA 转录本，以去除内含子（Will and Lührmann 2011）。剩余的外显子随后被剪接在一起，形成小得多的蛋白质编码转录本。snRNPs 能够识别外显子/内含子连接处的特定切割位点，以确保剪接总是精确进行。

在人类基因组中，只有 1.1% 由外显子组成，24% 由内含子组成，而 75% 的基因组由基因间 DNA 构成。平均而言，每个基因有 5.48 个外显子，每个外显子编码 30–36 个氨基酸的肽段（Sakharkar et al. 2002）。人类基因组中最长的外显子有 11 555 个碱基，而最短的外显子只有 2 个碱基（Sakharkar et al. 2002）。外显子不仅“稀少”，其长度差异也极大。更重要的是，它们可以通过可变剪接产生非常不同的最终基因（转录本）产物。这使得真核生物中的基因预测明显比原核生物更加困难。

真核生物的计算基因预测，本质上是在模拟生物学中的转录和剪接过程。在生物学过程中，细胞内的多种蛋白质和蛋白质复合体扫描 DNA 序列，识别并结合特定 DNA 位点，转录基因，然后切割并剪接转录本，形成最终基因产物。在计算过程中，这些蛋白质被各种算法替代，这些算法会：

沿查询序列识别合适的剪接位点以及起始和终止信号，并对其评分；
通过检测这些信号推断候选外显子的位置；
根据用于检测外显子的信号，以及根据假定外显子序列本身计算出的编码统计量，对最佳外显子进行评分和识别；
将这些候选外显子的一个子集组装（或“剪接”）成预测的基因结构。该组装过程会以最大化某个特定评分函数的方式产生，而该评分函数依赖于每个候选外显子的得分。

这些任务的具体实现方式因程序而异。这里不逐一详细讨论每个程序，而是描述几乎所有 ab initio 真核基因预测程序共有的三大主要过程：预测外显子界定信号、预测并评分外显子，最后进行外显子组装。

预测外显子界定信号

正如原核基因具有 DNA 信号一样，真核基因也具有独特的 DNA 信号。其中一些元件与原核生物相似，另一些则差异很大（图 5.3）。例如，许多真核基因具有启动子元件，这些元件也与原核基因中的序列表现出一定相似性。真核生物中研究最充分的核心启动子元件称为 TATA box，或 Goldberg–Hogness box（Lifton et al. 1978），位于 TSS 上游 25–30 个碱基对处。TATA box 也存在于古菌和细菌中，似乎是一种非常古老的 DNA 信号。真核生物中的 TATA box 共有序列为 TATA(A/T)A(A/T)，并且常常与另一种称为 CCAAT box 的调控序列偶联；CCAAT box 的共有序列为 GGCCAATCT，位于 TATA box 上游约 150 个碱基对处。只有约 25%–35% 的哺乳动物基因含有 TATA box，其余基因则含有其他类型的核心启动子元件。真核基因还含有核心启动子之外的调控序列，包括增强子、沉默子和绝缘子。这些调控序列可以分布在很大的基因组距离范围内，常常距离核心启动子数百千碱基。除了具有多种启动子或增强子信号外，真核基因还具有非常特异的 DNA 信号，用于界定外显子和内含子的位置。

更具体地说，参与界定外显子的基本 DNA 信号有四类：TIS、5′（或供体）剪接位点、3′（或受体）剪接位点，以及翻译终止密码子。在真核生物中，TIS 由 Kozak 共有序列界定，常写作 ACCATGG（Kozak 1987），其中中央的 ATG 是起始密码子。5′ 供体剪接位点通常由 GG/GT 这一共有序列界定，而 3′ 受体剪接位点的共有序列为 CAG/G，其中斜线表示剪接切割位点（图 5.4）。翻译终止密码子包括通常的 TAG、TAA 或 TGA。

GC box        ~200 bp
CCAAT box     ~100 bp
TATA box      ~30 bp
Gene
Transcription start site
Exon
Exon
Intron

图 5.3 真核基因上游区域的示意图，其中 GC box 位于转录起始位点上游约 200 bp，CCAAT box 位于上游约 100 bp，TATA box 位于上游约 30 bp。

Exon 1
Exon 2
Intron 1
Intron 2
Branchpoint site
5′ site
3′ site
AG/GT
CAG/NT

图 5.4 外显子和内含子周围剪接位点区域的示意图，包括 5′ 和 3′ 剪接位点及其共有序列。

最早用于识别外显子界定信号的方法是简单的 PWM 或 PSSM。事实证明，这些方法在识别短 DNA 信号（如剪接位点）方面表现较差。因此，这些简单模型后来被更先进的模式识别技术所取代，例如 HMM（Box 5.3）。这些强大的模式识别方法能够从由已知或充分注释的外显子界定信号组成的大型数据集中“学习”非常复杂的序列模式。HMM 是一种统计 Markov 模型，在该模型中，被建模系统被假定为具有未观测（即隐藏）状态的 Markov 过程。HMM 广泛用于许多现实应用，例如语音识别、手写识别和手势识别。HMM 在生物信息学中的应用始于 20 世纪 90 年代初（Krogh et al. 1994），并显著提高了基因预测准确性。HMM 使定义长度可变的高度复杂模式成为可能，这些模式包括许多外显子界定信号，如蛋白质编码区（见下文）、供体位点、受体位点、套索位点，以及翻译起始和终止位点。

预测并评分外显子

除了识别外显子界定信号之外，准确预测外显子还依赖基于内容的特征。外显子可分为三种基本类型：

初始外显子：由起始位点和供体位点界定的 ORF；
内部外显子：由 5′（供体）位点和 3′（受体）位点界定的 ORF；
末端外显子：由 3′（受体）位点和终止密码子界定的 ORF。

大多数被转录的基因由一个初始外显子、多个内部外显子和一个末端外显子组成。Zhang（2002）对这些真核外显子类型作了更全面的讨论。

按定义，外显子是蛋白质编码区域。与非编码区域相比，已知蛋白质编码区域表现出特征性的组成偏倚。这些偏倚包括略高的 GC 含量，以及明显非随机的密码子（三联体）频率偏好。观察到的密码子偏倚源于蛋白质中氨基酸分布不均、同义密码子使用不均，以及编码区中针对翻译优化的自然选择。为了区分蛋白质编码区和非编码区，20 世纪 90 年代发展出了许多基于 DNA 内容的度量方法（Fickett and Tung 1992; Gelfand 1995; Guigó 1999）。这些内容度量也称为编码统计量，反映给定 DNA 序列编码某种蛋白质或蛋白质片段的可能性。多年来，已经发表了许多计算基于内容度量的方法。最早的一些方法测量密码子三联体频率中出现的模式。然而，人们发现三联体对（即六聚体）的频率中包含更多信息。因此，六聚体频率通常以依赖密码子位置的五阶 Markov 模型形式表示（Box 5.2；Borodovsky and McIninch 1993），似乎能为识别外显子中的蛋白质编码区提供最佳判别能力。目前，这些六聚体频率处于所有现代真核基因预测器的核心位置。

外显子组装

一旦外显子被预测出来（使用六聚体频率和 HMM 的组合来识别关键基因信号以及外显子/内含子边界），就需要将它们组装成某种多外显子基因结构。外显子组装的主要困难在于简单的组合数学：对于任何给定基因，可能的外显子组装数量会随着预测外显子数量呈指数增长。为了解决这个问题，人们发展出了多种动态规划技术。动态规划是一种优化技术，它允许把复杂问题拆解为一组较简单的子问题来求解。每个子问题只求解一次，并保存其解。下次遇到相同子问题时，不再重新计算解，而是直接查找此前计算过的解（Bellman 1957；详见附录 6.A）。

对于最优外显子组装问题，已有研究表明，动态规划能够相当高效地找到解，而不必枚举或考虑每一种可能的外显子组合（Gelfand and Roytberg 1993）。几乎所有现代真核基因预测工具现在都使用某种动态规划方法（Markov 模型研究者称之为 Viterbi 算法，而大多数从事序列比对的人则称之为 Needleman–Wunsch 算法）。通过将基于 HMM 的外显子信号识别、不同的 HMM 派生外显子得分以及用于组装外显子的动态规划结合起来，就可以生成稳健的真核基因预测结果。使用动态规划的早期 HMM 基因预测方法包括 GENIE（Kulp et al. 1996）和 HMMgene（Krogh 1997）。也许最流行的基于 HMM 的真核基因预测器是 GENSCAN（Burge and Karlin 1997），这是一种 ab initio 基因预测器，已被广泛用于注释数百个真核基因组。

鉴于 GENSCAN 的流行，较为详细地解释该程序的工作方式并提供一个使用示例是有价值的。对于任意给定查询序列，GENSCAN 会在底层 HMM 的基础上确定最可能的基因结构。为了建模供体剪接位点，GENSCAN 引入了一种称为最大依赖分解（maximal dependence decomposition）的方法。在这种方法中，使用一系列权重矩阵（而不是仅使用一个矩阵）来捕捉这些剪接位点中不同位置之间的依赖关系。此外，GENSCAN 还使用一些参数来解释基因组序列的许多高阶性质（例如典型基因密度、每个基因的典型外显子数量，以及不同类型外显子的大小分布）。可以使用不同的基因模型参数集来校正不同基因组之间在基因密度和 G + C 组成方面的差异。也已经开发出用于玉米和拟南芥序列的模型。这样会使与已知蛋白质相似的外显子获得更高得分，而使与已知蛋白质几乎没有或完全没有相似性的预测外显子得分降低。

图 5.5 展示了一个典型的 GENSCAN 输出，使用人尿卟啉原脱羧酶（uroporphyrinogen decarboxylase, URO-D）基因（U30787）作为查询。预测中的每个外显子各占一行。从左到右，各列分别表示基因和外显子编号（Gn.Ex）、预测类型（Type，即外显子类型或识别出的 polyA 信号）、作出预测的链（+ 或 –）、预测的起点和终点、预测外显子的长度、其阅读框、若干评分列，以及概率值（P）。如果 GENSCAN 外显子具有很高的概率值（p > 0.99），并且该预测与真实注释外显子相匹配，那么其准确率为 97.7%。这些高概率预测可用于聚合酶链式反应引物的合理设计，以扩增互补 DNA（complementary DNA, cDNA），也可用于其他需要极高置信度的目的。GENSCAN 中概率位于 0.50–0.99 范围内的外显子通常被认为在多数情况下是正确的。对于 p 值高于 0.90 的预测，其最佳情况下的准确率大约为 88%。任何 p < 0.50 的预测都应视为不可靠，这些数据不会出现在数据表中。预测的氨基酸序列列在基因预测结果下方。在此处所示示例中，GENSCAN 正确预测了 URO-D 中 10 个外显子中的 9 个；只有初始外显子被漏掉了。

图 5.5 对尿卟啉原脱羧酶基因进行 GENSCAN 分析的示例输出。关于该输出的更详细说明见正文。

术语表（19 条）

English	中文
ab initio	保留原文，不译为“从头”。
open reading frame (ORF)	开放阅读框（ORF）
operon	操纵子
regulatory sequence	调控序列
transcription start site (TSS)	转录起始位点（TSS）
ribosome binding site (RBS)	核糖体结合位点（RBS）
translation initiation site (TIS)	翻译起始位点（TIS）
Shine-Dalgarno consensus sequence	Shine-Dalgarno 共有序列
positional weight matrix (PWM)	位置权重矩阵（PWM）
position-specific scoring matrix (PSSM)	位置特异性评分矩阵（PSSM）
positional frequency matrix (PFM)	位置频率矩阵（PFM）
positional probability matrix (PPM)	位置概率矩阵（PPM）
pseudocount	伪计数
log likelihood scoring matrix	对数似然评分矩阵
leaderless transcription	无前导序列转录
Markov model	Markov 模型
hidden Markov model (HMM)	隐 Markov 模型（HMM）
state	状态
transition probability	转移概率

037

How Well Do Gene Predictors Work?

PDF page 147 中部 - PDF page 153 顶部；印刷页码 127-133

▶

English SourcePDF extracted

---

How Well Do Gene Predictors Work?

The accuracy of gene prediction programs is usually determined using controlled, well-defined

datasets, where the actual gene structure has been determined experimentally. Accuracy can

be computed at either the nucleotide, exon, or gene level, and each provides different insights

into the accuracy of a predictive method. In the field of prokaryotic gene prediction, the results

are almost always reported at the gene level and given in terms of a percentage – that is, the

Genome Annotation

TP

FP

FN

TN

TP

Actual

Predicted

Sensitivity

Specificity

Sn = TP/(TP + FN)

Sp = TN/(TN + FP)

Figure 5.6 Schematic representation of measures of gene prediction accuracy at the nucleotide level.

The actual gene structure is illustrated at the top with conﬁrmed exons identiﬁed with light blue bars and

conﬁrmed introns in black lines. The predicted gene structure is illustrated at the bottom with predicted

exons identiﬁed with red bars and predicted introns in black lines. The four possible outcomes of a

prediction are shown: true positives (TP), true negatives (TN), false positives (FP), and false negatives

(FN). The equations for sensitivity and speciﬁcity are also shown using appropriate combinations of TP,

TN, FP, and FN.

number of correct gene predictions divided by the total number of known or validated genes in

the test set. In some cases, the number or percentage of over-predicted genes (false positives)

is also reported. In the field of eukaryotic gene prediction, performance reporting tends to be

somewhat more convoluted. This is because the evaluation problem is more complex and the

overall performance is often much worse. As a general rule, two basic measures are used: sen-

sitivity (or Sn), defined as the proportion of coding nucleotides, exons, or genes that have been

predicted correctly; and specificity (or Sp), defined as the proportion of coding and non-coding

nucleotides, exons, or genes that have been predicted correctly (i.e. the overall fraction of the

prediction that is correct). A more detailed explanation of sensitivity, specificity, and a number

of other evaluation metrics used in gene (and protein structure) prediction is given in Box 5.4.

Also introduced in this box are the concepts of true positives (TPs), true negatives (TNs), false

positives (FPs), and false negatives (FNs).

An example of a eukaryotic gene prediction with the four possible outcomes is shown in

Figure 5.6. This figure schematically illustrates the differences between a gene prediction and

the known (or observed) gene structure. Neither sensitivity nor specificity alone provides a

perfect measure of global accuracy, as high sensitivity can be achieved with little specificity

and vice versa. An easier to understand measure that combines the sensitivity and specificity

values is called the Matthews correlation coefficient (MCC or just CC), which is described

more formally in Box 5.4. The MCC ranges from −1 to 1, where a value of 1 corresponds to

a perfect prediction; a value of −1 indicates that every coding region has been predicted as

non-coding, and vice versa. Other accuracy measures are sometimes used as well; however,

the above-mentioned ones have been most commonly employed in the large assessment

projects on eukaryotic genome prediction such as the human ENCODE Genome Annotation

Assessment Project (EGASP; Guigó and Reese 2005), the RNA-seq Genome Annotation

Assessment Project (RGASP; Steijger et al. 2013) and the Nematode Genome Annotation

Assessment Project (nGASP; Coghlan et al. 2008).

Box 5.4 Evaluating Binary Classiﬁcations or Predictions in Bioinformatics

Many predictions in bioinformatics involve essentially binary or binomial (i.e. true/false)

classiﬁcation problems. For instance, prokaryotic gene prediction can be framed as a

binary classiﬁcation problem where one tries to distinguish open reading frames (ORFs)

from non-ORFs. Similarly, eukaryotic gene prediction can be posited as a binary classiﬁca-

tion problem of predicting exons and non-exons (introns) or genes and intergenic regions.

Protein membrane helix prediction (discussed in Chapter 7) can be put in a similar binary

classiﬁcation frame, where one distinguishes between membrane helices and non-helices

(or non-membrane regions). Binary classiﬁcation problems can also be found in medicine,

where one tries to predict or diagnose sick patients versus healthy patients, or quality

control tasks in high-throughput manufacturing (pass versus fail).

How Well Do Gene Predictors Work?

The evaluation of binary classiﬁers or predictors normally follows a very standard prac-

tice with a common set of metrics and deﬁnitions. Unfortunately, this practice is not always

followed when bioinformaticians evaluate their own predictors or predictions. This is why

we have included this very important information box, one that is referred to frequently

throughout this book.

As shown in the diagram below, a binary classiﬁer or predictor can have four combina-

tions of outcomes: true positives (TP or correct positive assignments), true negatives (TN or

correct negative assignments), false positives (FP or incorrect positive assignments), and

false negatives (FN or incorrect negative assignments). In statistics, the false positives are

called type I errors and the false negatives are called type II errors (see Chapter 18).

Observed state

positive

Predicted state

positive

True positive

(TP)

False positive

(FP)

Predicted state

negative

False negative

(FN)

True negative

(TN)

Observed state

negative

PREDICTION

OBSERVATION

Once a binary classiﬁer has been run on a set of data, it is possible to calculate speciﬁc

numbers for each of these four outcomes using the above 2 × 2 contingency table. So

a gene predictor that predicted 1000 genes in a genome that had only 900 genes may

have 850 TPs, 200 TNs, 60 FPs, and 40 FNs. From this set of 4 outcomes it is possible to

calculate 8 ratios. These ratios can be obtained by dividing each of the four numbers

(TP, TN, FP, FN) by the sum of its row or column in the 2 × 2 contingency table. The most

important ratios and their names or abbreviations are listed (along with their formulae)

below. Also included are several other binary classiﬁer evaluation metrics that are used

by certain subdisciplines in bioinformatics or statistics.

Name

Formula

Sensitivity (Sn)

Recall

True positive rate (TPR)

TP

TP + FN

Speciﬁcity (Sp)

True negative rate (TNR)

TN

TN + FP

Precision

Positive predictive value (PPV)

TP

TP + FP

False positive rate (FPR)

FP

FP + TN

False discovery rate (FDR)

FP

FP + TP

Negative predictive value (NPV)

TN

TN + FN

Accuracy (ACC), Q2

TP + TN

TP + FP + TN + FN

F1 score

F score

F measure

2TP

(2TP + FP + FN)

Matthews correlation

coefﬁcient (MCC)

TP × TN −FP × FN

√

(TP + FP)(TP + FN)(TN + FP)(TN + FN)

(Continued)

Genome Annotation

Box 5.4 (Continued)

Sensitivity (Sn, recall, or TPR) measures the proportion of actual positives that are cor-

rectly identiﬁed as such, while speciﬁcity (Sp or TNR) measures the proportion of actual

negatives that are correctly identiﬁed as such. Precision (PPV) is the proportion of positive

results that are true positive results, while NPV is the proportion of negative results that

are true negative results. FDR is the binary (not the multiple testing) measure of false

positives divided by all positive predictions. Accuracy or ACC (for binary classiﬁcation) is

deﬁned as the number of correct predictions made divided by the total number of pre-

dictions made. ACC is one of the best ways of assessing binary test or predictor accuracy.

The F1 score is another measure of test accuracy and is deﬁned as the harmonic average

of precision (PPV) and recall (Sn). MCC is a popular measure of test or predictor accuracy.

It is essentially a chi-squared statistic for a standard 2 × 2 contingency table. In effect, MCC

is the correlation coefﬁcient between the observed and predicted binary classiﬁcations.

Different ﬁelds of science have different preferences for different metrics owing to dif-

ferent traditions or different objectives. In medicine and most ﬁelds of biology (including

bioinformatics), sensitivity and speciﬁcity are most often used to assess a binary classi-

ﬁer, while in machine learning and information retrieval, precision and recall are usually

preferred. Likewise, different prediction tasks within bioinformatics tend to report perfor-

mance with different measures. Gene predictors generally report Sn, Sp, and ACC, while

protein structure predictors generally report ACC and MCC. The accuracy (ACC) score in

protein secondary structure prediction has also been termed Qn, where n is the number

of secondary structure classes (usually n = 3). In gene prediction, the ACC score is given

as Q2 since only two classes are identiﬁed (either exon/intron or ORF/non-ORF). Each of

the above ratios (except for MCC) can take on values from 0 to 1. For a perfect prediction,

Sn = 1, Sp = 1, PPV = 1, NPV = 1, ACC = 1, F1 = 1, MCC = 1, FPR = 0, or FDR = 0, whereas

a completely incorrect prediction would yield Sn = 0, Sp = 0, PPV = 0, NPV = 0, ACC = 0,

F1 = 0, MCC = −1, FPR = 1, or FDR = 1.

The performance of any binary predictor has to be assessed based on the existing bias

in the numbers within different classes (i.e. uneven class distribution). For example, an

ACC of 0.95 may seem excellent, but if 95% of the dataset belongs to just one class, then

the same ACC score could be easily achieved by simply predicting everything to be in that

class. This is the situation with many mammalian genomes, which have large intergenic

regions. Therefore predicting every nucleotide as being “intergenic” would easily create

a nucleotide-based gene predictor that would be >95% accurate. Such a predictor would,

of course, be completely useless.

Assessing the accuracy of gene prediction methods requires sets of reliably annotated genes

verified by experimental or computational evidence derived from complementary sources of

information. Experimental evidence can be provided by mass spectrometry-based proteomics

or by structural biology methods such as nuclear magnetic resonance spectroscopy or X-ray

crystallography (Chapter 12) that provide direct, visual confirmation of the protein sequences.

Computational evidence can appear in the form of similarity of the derived protein sequence

to the primary structures of proteins whose functions were verified experimentally. Extensive

gene prediction assessments have been done for both prokaryotic and eukaryotic organisms.

Assessing Prokaryotic Gene Predictors

The evaluation of prokaryotic gene predictors has been ongoing for many years, with each

publication that describes a new program (or a new version of an existing program) providing

a detailed performance assessment (Larsen and Krogh 2003; Delcher et al. 2007; Hyatt et al.

Assessing Eukaryotic Gene Predictors

2010; Borodovsky and Lomsadze 2011). One of the most recent and comprehensive assess-

ments of prokaryotic gene predictors was conducted by Hyatt et al. in 2010. In this paper the

authors compared five different programs: Prodigal 1.20 (Hyatt et al. 2010), GeneMarkHMM

2.6 (Borodovsky and Lomsadze 2011), GLIMMER 3.02 (Delcher et al. 2007), EasyGene 1.2

(Larsen and Krogh 2003), and MED 2.0 (Zhu et al. 2007) on two different tasks. The first task

involved predicting experimentally verified genes with experimentally verified TISs from 10

different bacterial and archaeal genomes. In this case, only 2443 (out of a possible 35 000+)

genes were considered experimentally verified. Hyatt et al. found that all five of the programs

were able to achieve a 98–99.8% level of accuracy (at the gene level) for the 3′ end of these ver-

ified genes and an 87–96.7% level of accuracy (at the gene level) for the complete genes (both

the 5′ and 3′ ends being correctly predicted). The second task involved predicting GenBank

(mostly manually) annotated genes from seven different bacterial genomes. In this case, a total

of 23 648 genes were evaluated. All of the programs were able to achieve a 95–99% level of accu-

racy (at the gene level) for the 3′ end of these genes. However, their performance for the full

gene prediction task (both 5′ and 3′ being correctly predicted) was much more variable, with

accuracy values ranging from 69% to 91% correct. The overall prediction average for all five

programs over all genes in this second task was about 80%. It is also notable that all five pro-

grams generally over-predicted the number of genes annotated in GenBank by about 4–5%,

with some programs (MED 2.0) over-predicting by as much as 40%.

Based on the data provided by Hyatt et al. (2010), the two best-performing prokaryotic gene

prediction programs were Prodigal and GeneMark, while the other three programs were only

marginally worse. Their results also show that the task of predicting the 3′ ends of prokaryotic

genes is essentially solved, while the challenge of predicting the 5′ ends of prokaryotic genes

needs more work. It is also evident that some prokaryotic genomes are harder to predict than

others, with the full-gene prediction performance on the E. coli genome often hovering about

90% while the performance for less studied genomes (such as Halobacterium salinarum) often

is around 70%. These results reflect the fact that ab initio gene predictors (both prokaryotic and

eukaryotic) require very extensive training on a large number of high-quality gene models.

Once trained, these tools can perform very impressively, especially in well-studied genomes

for which ample training data are available. However, the level of training required to reach

very high accuracy is often hard to achieve for newly assembled bacterial genomes.

Assessing Eukaryotic Gene Predictors

The assessment of eukaryotic gene predictors has been going on for more than 20 years. In

the early days, most eukaryotic gene prediction evaluations were conducted on single genes

whose exon/intron structure had been well characterized. This reflected the fact that very few

(if any) eukaryotic genomes had been fully sequenced and only a small number of eukaryotic

genes had their exon/intron structure fully determined. It also made the gene prediction tasks

much simpler, as the coding (exon) density is much higher (25–50%) than what would be found

over an entire genome (which is often <2%). This also led to overly optimistic performance

ratings. More recently, the field has evolved to assessing gene prediction performance over

entire genomes.

Burset and Guigó (1996) published one of the first systematic evaluations of eukaryotic gene

predictors. Their study evaluated seven programs, using a set of 570 vertebrate single-gene

sequences. The average CC at the nucleotide level for these programs ranged from 0.65 to 0.80.

Later, Rogic et al. (2001) performed a similar analysis of seven gene prediction programs, using

a set of 195 single-gene sequences from human and rodent species. The programs tested in the

Rogic et al. study showed substantially higher accuracy than those reported on in the Burset

and Guigó study, with the average CC at the nucleotide level ranging from 0.66 to 0.91. This

increase in the upper part of the range illustrates the significant advances that occurred in the

development of gene prediction methods over a relatively short period of time.

Genome Annotation

The early evaluations put forth by Burset and Guigó (1996), Rogic et al. (2001), and others all

suffered from the same limitation: the gene finders were tested using controlled datasets com-

prising short genomic sequences encoding a single gene with simple gene structures. These

datasets are obviously not representative of genomic sequences as a whole. Complete genome

sequences contain long stretches with low coding density, stretches coding for multiple or

incomplete genes (or both), and stretches having very complex or alternative gene structures.

As a result, two large-scale studies were conducted to assess the performance of ab initio

eukaryotic gene predictors on real-world mammalian genomic data. The first was based on an

analysis of human chromosome 22 (Parra et al. 2003) and the second was based on an analysis

of the human ENCODE regions (Guigó et al. 2006), covering about 1% of the human genome.

When human chromosome 22 was sequenced, it was subjected to very extensive manual

analyses, experimental confirmation, and detailed annotation by many experts (Dunham et al.

1999). This was done to provide a useful gold standard (at the time) for assessing genome

prediction and genome annotation tools. As a result, Parra et al. used the manually annotated

data for chromosome 22 to assess the performance of GENSCAN (Burge and Karlin 1997),

GenomeScan (Yeh et al. 2001), TBLASTX (Gish and States 1993), GeneID (Blanco et al. 2002),

and SGP-2 (Parra et al. 2003) at the nucleotide, exon, and whole gene/transcript level. The

results were quite disappointing. At the nucleotide level the programs had an average sensitiv-

ity/specificity ([Sp + Sn]/2) value ranging from 0.62 to 0.75 and a CC value ranging from 0.54 to

0.73. At the exon level, the programs had an average sensitivity/specificity value ranging from

0.54 to 0.62 and at the gene/transcript level the average sensitivity/specificity value ranged

from 0.05 to 0.11. The latter values are the numbers of greatest interest as they reflect the true

level of gene prediction performance. Interestingly, GENSCAN and GenomeScan performed

somewhat worse than GeneID and SGP-2. Indeed, SGP-2 consistently performed better than

all of the “pure” ab initio predictors as it also made use of comparative genomic data from

mouse chromosome 22. The inclusion of experimental sequence data technically made SGP-2

an extrinsic gene finder rather than a pure ab initio gene predictor.

A similar level of very high-quality manual annotation was achieved in 2005–2006 during

the first phase of the Encyclopedia of DNA Elements (ENCODE) project. The ENCODE project

is a long-term, multi-phase project that started in 2003 with the goal of identifying all of the

functional elements within the human genome sequence. During its pilot phase, a number of

regions from the human genome (approximately 1%) were selected for detailed investigation.

The availability of this “gold standard” dataset led to a second, much larger evaluation that

looked at the predictive performance of pure ab initio predictors, as well as gene finders that

used additional extrinsic data such as sequence homology and experimental sequencing data

(Guigó et al. 2006). For the Guigó et al. study, four ab initio predictors were tested: AUGUSTUS

(Hoff and Stanke 2013), GeneMark-A (Besemer and Borodovsky 2005), GeneMark-B (Besemer

and Borodovsky 2005), and GeneZilla (Allen et al. 2006). Once again, the results were quite dis-

appointing. At the nucleotide level, the programs had a CC value ranging from 0.53 to 0.76. At

the exon level the programs had an average sensitivity/specificity value ranging from 0.40 to

0.57, and at the gene or transcript level the average sensitivity/specificity value ranged from

0.05 to 0.14. Overall, AUGUSTUS performed significantly better than the other ab initio pro-

grams but not at a level that would allow one to use it to automatically annotate a eukaryotic

genome. However, the most important findings from this study were that significant improve-

ments (up to two times better at the exon level and up to four times better at the gene level)

could be made to the quality of eukaryotic gene annotations if comparative genomic data or

other experimental/extrinsic evidence were employed in the prediction process.

It was because of these studies that a significant change in the gene prediction community

occurred. In particular, the developers of gene predictors moved from reluctantly using experi-

mental or extrinsic data to wholeheartedly embracing experimental data. In other words, gene

prediction began to change to gene finding and genome prediction began to evolve toward

genome annotation. In doing so, genome analysis became a more holistic, evidence-based

process that combined ab initio gene prediction with extrinsic gene-finding methods. These

extrinsic gene-finding methods combined many other computational tools and other lines of

evidence including gene expression data, proteomic data, sequence homology to other anno-

tated genomes, and even literature-derived data.

中文译文

译文：Ch5 Genome Annotation / How Well Do Gene Predictors Work?

章节：Ch5 Genome Annotation

Canonical 小节：How Well Do Gene Predictors Work?

范围：PDF page 147 中部 - PDF page 153 顶部；印刷页码 127-133

---

第5章基因组注释

基因预测程序的效果如何？

基因预测程序的准确性通常用受控且界定清楚的数据集来测定；在这些数据集中，真实的基因结构已经通过实验确定。准确性可以在核苷酸、外显子或基因三个层面计算，每一种层面都能从不同角度反映预测方法的准确程度。在原核基因预测领域，结果几乎总是在基因层面报告，并以百分比表示，也就是用正确预测的基因数除以测试集中已知或经验证的基因总数。在某些情况下，也会报告过度预测基因（假阳性）的数量或百分比。

在真核基因预测领域，性能报告往往更复杂一些。这是因为评价问题本身更复杂，而且总体性能通常也差得多。一般来说，常用两个基本度量：敏感性（sensitivity，Sn），定义为被正确预测出来的编码核苷酸、外显子或基因所占比例；特异性（specificity，Sp），定义为被正确预测出来的编码与非编码核苷酸、外显子或基因所占比例（也就是预测结果中正确部分的总体比例）。Box 5.4 对敏感性、特异性以及基因预测（和蛋白质结构预测）中使用的若干其他评价指标作了更详细说明。该框还介绍了真阳性（true positives，TP）、真阴性（true negatives，TN）、假阳性（false positives，FP）和假阴性（false negatives，FN）这些概念。

图 5.6 给出了一个真核基因预测例子，其中展示了四种可能的预测结果。该图用示意方式说明了基因预测结果与已知（或观测到的）基因结构之间的差异。单独使用敏感性或特异性都不能完美衡量全局准确性，因为高敏感性可能以很低的特异性为代价获得，反之亦然。一个更容易理解、同时结合敏感性和特异性数值的度量称为 Matthews 相关系数（Matthews correlation coefficient，MCC，也简称 CC），Box 5.4 对它作了更正式的说明。MCC 的取值范围为 -1 到 1；其中 1 表示完美预测，-1 表示所有编码区都被预测为非编码区，反之亦然。有时也会使用其他准确性度量；不过，上述指标是在真核基因组预测大型评估项目中最常用的指标，例如人类 ENCODE 基因组注释评估项目（ENCODE Genome Annotation Assessment Project，EGASP；Guigó and Reese 2005）、RNA-seq 基因组注释评估项目（RNA-seq Genome Annotation Assessment Project，RGASP；Steijger et al. 2013）以及线虫基因组注释评估项目（Nematode Genome Annotation Assessment Project，nGASP；Coghlan et al. 2008）。

图 5.6

图 5.6 核苷酸层面基因预测准确性度量的示意图。上方给出真实基因结构，其中经确认的外显子用浅蓝色条表示，经确认的内含子用黑线表示。下方给出预测基因结构，其中预测外显子用红色条表示，预测内含子用黑线表示。图中展示了预测可能出现的四种结果：真阳性（TP）、真阴性（TN）、假阳性（FP）和假阴性（FN）。图中还用 TP、TN、FP 和 FN 的相应组合给出了敏感性和特异性的公式：

Sn = TP / (TP + FN)
Sp = TN / (TN + FP)

版面证据：../03_layout/page_148_render.png

Box 5.4 生物信息学中二分类或二项预测的评价

生物信息学中的许多预测本质上都涉及二分类或二项（即真/假）分类问题。例如，原核基因预测可以表述为一个二分类问题：尝试区分开放阅读框（open reading frames，ORFs）和非 ORF。类似地，真核基因预测也可以表述为预测外显子与非外显子（内含子），或预测基因与基因间区域的二分类问题。蛋白质膜螺旋预测（第7章讨论）同样可以放在类似的二分类框架中，即区分膜螺旋与非螺旋（或非膜区）。二分类问题也存在于医学中，例如预测或诊断患病者与健康者；也存在于高通量制造的质量控制任务中，例如合格与不合格（pass versus fail）。

二分类器或预测器的评价通常遵循一套非常标准的实践，并使用一组共同的指标和定义。遗憾的是，当生物信息学研究者评价自己的预测器或预测结果时，这套实践并不总是得到遵守。这就是本书加入这个非常重要的信息框的原因；本书后续会频繁引用它。

如下方示意表所示，一个二分类器或预测器可以产生四种结果组合：真阳性（TP，或正确的阳性判定）、真阴性（TN，或正确的阴性判定）、假阳性（FP，或错误的阳性判定）和假阴性（FN，或错误的阴性判定）。在统计学中，假阳性称为 I 型错误，假阴性称为 II 型错误（见第18章）。

预测状态 / 观测状态	观测为阳性	观测为阴性
预测为阳性	真阳性（TP）	假阳性（FP）
预测为阴性	假阴性（FN）	真阴性（TN）

当一个二分类器已经在一组数据上运行后，就可以用上述 2 × 2 列联表为这四类结果分别计算出具体数值。举例来说，如果某个基因预测器在一个实际只有 900 个基因的基因组中预测出了 1000 个基因，它可能得到 850 个 TP、200 个 TN、60 个 FP 和 40 个 FN。根据这四类结果，可以计算出 8 个比率。这些比率可以通过将 TP、TN、FP、FN 四个数中的每一个，除以其在 2 × 2 列联表中所在行或所在列的总和来获得。下面列出最重要的比率及其名称或缩写（并给出公式）。此外，还列出了一些在生物信息学或统计学某些子领域中使用的其他二分类器评价指标。

名称	公式
敏感性（Sn）；召回率（Recall）；真阳性率（TPR）	TP / (TP + FN)
特异性（Sp）；真阴性率（TNR）	TN / (TN + FP)
精确率（Precision）；阳性预测值（PPV）	TP / (TP + FP)
假阳性率（FPR）	FP / (FP + TN)
错误发现率（FDR）	FP / (FP + TP)
阴性预测值（NPV）	TN / (TN + FN)
准确率（ACC），Q2	(TP + TN) / (TP + FP + TN + FN)
F1 score；F score；F measure	2TP / (2TP + FP + FN)
Matthews 相关系数（MCC）	(TP × TN - FP × FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN))

敏感性（Sn、召回率或 TPR）衡量实际阳性中被正确识别为阳性的比例，而特异性（Sp 或 TNR）衡量实际阴性中被正确识别为阴性的比例。精确率（PPV）是阳性结果中真正为阳性的比例，而 NPV 是阴性结果中真正为阴性的比例。FDR 在这里指二分类意义上的度量，即假阳性除以所有阳性预测；不是多重检验中的 FDR。准确率或 ACC（用于二分类）定义为正确预测数除以预测总数。ACC 是评价二分类测试或预测器准确性的最佳方式之一。F1 score 是另一种测试准确性度量，定义为精确率（PPV）和召回率（Sn）的调和平均数。MCC 是一种常用的测试或预测器准确性度量。它本质上是标准 2 × 2 列联表的卡方统计量。实际上，MCC 是观测二分类结果与预测二分类结果之间的相关系数。

由于不同传统或不同目标，不同科学领域会偏好不同指标。在医学和大多数生物学领域（包括生物信息学）中，敏感性和特异性最常用于评价二分类器；而在机器学习和信息检索中，精确率和召回率通常更受偏好。同样，生物信息学内部不同预测任务也倾向于用不同度量报告性能。基因预测器通常报告 Sn、Sp 和 ACC，而蛋白质结构预测器通常报告 ACC 和 MCC。蛋白质二级结构预测中的准确率（ACC）也称为 Qn，其中 n 是二级结构类别数（通常 n = 3）。在基因预测中，ACC 分数记作 Q2，因为只识别两个类别（外显子/内含子，或 ORF/非 ORF）。除 MCC 外，上述每一个比率的取值都可以从 0 到 1。对于完美预测，Sn = 1、Sp = 1、PPV = 1、NPV = 1、ACC = 1、F1 = 1、MCC = 1、FPR = 0 或 FDR = 0；而完全错误的预测会得到 Sn = 0、Sp = 0、PPV = 0、NPV = 0、ACC = 0、F1 = 0、MCC = -1、FPR = 1 或 FDR = 1。

任何二分类预测器的性能，都必须结合不同类别中已有的数量偏倚（即类别分布不均衡）来评价。例如，ACC 为 0.95 看起来可能很优秀；但如果数据集中 95% 都属于同一类别，那么只要把所有对象都预测为这个类别，也很容易得到同样的 ACC。这正是许多哺乳动物基因组中的情况，因为它们拥有大片基因间区域。因此，如果把每一个核苷酸都预测为“基因间”，很容易得到一个在核苷酸层面准确率 >95% 的基因预测器。当然，这样的预测器完全没有用。

评价基因预测方法的准确性，需要使用一组可靠注释的基因；这些基因要由来自互补信息来源的实验或计算证据加以验证。实验依据可以来自基于质谱的蛋白质组学，或来自核磁共振波谱、X 射线晶体学（第12章）等结构生物学方法；这些方法可以对蛋白质序列提供直接、可视化的确认。计算依据可以表现为：由预测基因推导出的蛋白质序列，与功能已被实验验证的蛋白质一级结构具有相似性。针对原核生物和真核生物的基因预测评估都已经开展了大量工作。

评估原核基因预测程序

多年来，原核基因预测程序的评价一直在持续进行；每一篇描述新程序（或已有程序新版本）的论文，通常都会提供详细的性能评估（Larsen and Krogh 2003；Delcher et al. 2007；Hyatt et al. 2010；Borodovsky and Lomsadze 2011）。最近且最全面的原核基因预测程序评估之一，是 Hyatt 等人在 2010 年完成的。在这篇论文中，作者在两个不同任务上比较了五个程序：Prodigal 1.20（Hyatt et al. 2010）、GeneMarkHMM 2.6（Borodovsky and Lomsadze 2011）、GLIMMER 3.02（Delcher et al. 2007）、EasyGene 1.2（Larsen and Krogh 2003）和 MED 2.0（Zhu et al. 2007）。

第一个任务涉及从 10 个不同细菌和古菌基因组中，预测具有实验验证翻译起始位点（translation initiation sites，TISs）的实验验证基因。在这种情况下，只有 2443 个基因（总可能数量超过 35 000 个）被认为是实验验证基因。Hyatt 等人发现，五个程序在这些经验证基因 3′ 端的预测上都能达到 98-99.8% 的准确率（基因层面）；对于完整基因（5′ 端和 3′ 端都被正确预测）的预测，准确率为 87-96.7%（基因层面）。

第二个任务涉及预测七个不同细菌基因组中 GenBank 注释的基因（大多为人工注释）。在这种情况下，共评估了 23 648 个基因。所有程序在这些基因 3′ 端的预测上都能达到 95-99% 的准确率（基因层面）。然而，它们在完整基因预测任务（5′ 端和 3′ 端都被正确预测）上的表现差异要大得多，准确率从 69% 到 91% 不等。五个程序在第二个任务所有基因上的总体预测平均值约为 80%。还值得注意的是，五个程序通常都会比 GenBank 注释多预测约 4-5% 的基因，其中一些程序（MED 2.0）的过度预测幅度高达 40%。

根据 Hyatt 等人（2010）提供的数据，表现最好的两个原核基因预测程序是 Prodigal 和 GeneMark，其他三个程序仅略差一些。他们的结果还表明，预测原核基因 3′ 端的任务基本上已经解决，而预测原核基因 5′ 端的挑战仍需进一步研究。显然，一些原核基因组比另一些更难预测：在大肠杆菌（E. coli）基因组上，完整基因预测性能常常徘徊在约 90%；而在研究较少的基因组（如 Halobacterium salinarum）上，性能通常约为 70%。这些结果反映了这样一个事实：ab initio 基因预测器（无论原核还是真核）都需要基于大量高质量基因模型进行非常充分的训练。一旦完成训练，这些工具可以有非常出色的表现，尤其是在研究充分、可获得充足训练数据的基因组中。不过，对于新组装的细菌基因组，要达到很高准确率所需的训练水平往往很难实现。

评估真核基因预测程序

真核基因预测程序的评估已经持续了 20 多年。在早期，大多数真核基因预测评价是在单个基因上进行的；这些基因的外显子/内含子结构已经得到充分表征。这反映出当时很少有（如果有的话）真核基因组被完整测序，而且只有少数真核基因的外显子/内含子结构得到完整确定。这也使基因预测任务简单得多，因为这些测试序列中的编码区（外显子）密度很高（25-50%），远高于整个基因组中的编码区密度（通常 <2%）。这也导致了过于乐观的性能评价。近年来，该领域已经发展到在整个基因组范围内评估基因预测性能。

Burset 和 Guigó（1996）发表了最早的真核基因预测程序系统评价之一。他们的研究使用 570 条脊椎动物单基因序列，对七个程序进行了评价。这些程序在核苷酸层面的平均 CC 从 0.65 到 0.80 不等。随后，Rogic 等人（2001）使用 195 条来自人类和啮齿类物种的单基因序列，对七个基因预测程序进行了类似分析。Rogic 等人研究中测试的程序，准确性显著高于 Burset 和 Guigó 研究报告的结果，其核苷酸层面的平均 CC 从 0.66 到 0.91 不等。该范围上端的提高说明，在相对较短的一段时间内，基因预测方法的发展取得了显著进步。

Burset 和 Guigó（1996）、Rogic 等人（2001）以及其他早期评价都有同样的局限：这些基因查找程序是在受控数据集上测试的，而这些数据集由短基因组序列组成，每条序列编码一个结构简单的单个基因。显然，这些数据集并不能代表整体基因组序列。完整基因组序列包含大片编码密度低的区域，包含编码多个基因或不完整基因（或二者兼有）的区域，也包含具有非常复杂或可变基因结构的区域。

因此，人们开展了两项大规模研究，以评估 ab initio 真核基因预测程序在真实世界哺乳动物基因组数据上的性能。第一项基于人类 22 号染色体的分析（Parra et al. 2003），第二项基于人类 ENCODE 区域的分析（Guigó et al. 2006），覆盖约 1% 的人类基因组。

在人类 22 号染色体完成测序时，许多专家对其进行了非常广泛的人工分析、实验确认和详细注释（Dunham et al. 1999）。这样做的目的是为评价基因组预测和基因组注释工具提供一个有用的“金标准”（在当时）。因此，Parra 等人使用 22 号染色体的人工注释数据，从核苷酸、外显子以及完整基因/转录本层面评估了 GENSCAN（Burge and Karlin 1997）、GenomeScan（Yeh et al. 2001）、TBLASTX（Gish and States 1993）、GeneID（Blanco et al. 2002）和 SGP-2（Parra et al. 2003）的性能。结果相当令人失望。在核苷酸层面，这些程序的平均敏感性/特异性（[Sp + Sn]/2）值为 0.62 到 0.75，CC 值为 0.54 到 0.73。在外显子层面，这些程序的平均敏感性/特异性值为 0.54 到 0.62；而在基因/转录本层面，平均敏感性/特异性值为 0.05 到 0.11。最后一组数值最值得关注，因为它们反映了基因预测性能的真实水平。有趣的是，GENSCAN 和 GenomeScan 的表现略差于 GeneID 和 SGP-2。事实上，SGP-2 的表现始终优于所有“纯”ab initio 预测器，因为它还利用了来自小鼠 22 号染色体的比较基因组数据。从技术上说，纳入实验序列数据使 SGP-2 成为一种外源性基因查找程序，而不是纯粹的 ab initio 基因预测程序。

2005-2006 年，DNA 元件百科全书（Encyclopedia of DNA Elements，ENCODE）项目第一阶段也实现了类似水平的高质量人工注释。ENCODE 是一个长期、多阶段项目，始于 2003 年，目标是识别人类基因组序列中的所有功能元件。在其试点阶段，研究者选择了人类基因组中的若干区域（约占 1%）进行详细研究。这个“金标准”数据集的可用性促成了第二项规模大得多的评价，该评价考察了纯 ab initio 预测器的预测性能，也考察了使用额外外源性数据（如序列同源性和实验测序数据）的基因查找程序的预测性能（Guigó et al. 2006）。在 Guigó 等人的研究中，测试了四个 ab initio 预测器：AUGUSTUS（Hoff and Stanke 2013）、GeneMark-A（Besemer and Borodovsky 2005）、GeneMark-B（Besemer and Borodovsky 2005）和 GeneZilla（Allen et al. 2006）。结果再次相当令人失望。在核苷酸层面，这些程序的 CC 值为 0.53 到 0.76。在外显子层面，这些程序的平均敏感性/特异性值为 0.40 到 0.57；在基因或转录本层面，平均敏感性/特异性值为 0.05 到 0.14。总体而言，AUGUSTUS 的表现显著优于其他 ab initio 程序，但仍未达到可以用它自动注释真核基因组的水平。然而，这项研究最重要的发现是：如果在预测过程中使用比较基因组数据或其他实验/外源性证据，真核基因注释质量可以显著提高（外显子层面最高可提高 2 倍，基因层面最高可提高 4 倍）。

正是因为这些研究，基因预测领域发生了重大变化。特别是，基因预测程序开发者从勉强使用实验或外源性数据，转变为积极拥抱实验数据。换言之，基因预测开始转变为基因查找，基因组预测也开始向基因组注释演化。在这一过程中，基因组分析成为一种更整体化、以证据为基础的过程，将 ab initio 基因预测与外源性基因查找方法结合起来。这些外源性基因查找方法整合了许多其他计算工具和其他证据线索，包括基因表达数据、蛋白质组数据、与其他已注释基因组的序列同源性，甚至来自文献的数据。

术语表（15 条）

English	中文
sensitivity	敏感性
specificity	特异性
true positive	真阳性
true negative	真阴性
false positive	假阳性
false negative	假阴性
Matthews correlation coefficient	Matthews 相关系数
contingency table	列联表
precision	精确率
positive predictive value	阳性预测值
negative predictive value	阴性预测值
false discovery rate	错误发现率
translation initiation site	翻译起始位点
gold standard	金标准
extrinsic gene finder	外源性基因查找程序

038

Evidence Generation for Genome Annotation

PDF page 153 中部 - PDF page 161 上部；印刷页码 133-141

▶

English SourcePDF extracted

---

Evidence Generation for Genome Annotation

Genomic evidence is any information that can be used to identify or inform the structure

of a gene in an organism – be it a prokaryotic or a eukaryotic organism. Some of the most

useful evidence comes from experimental work such as transcriptional data (mRNA or

DNA data derived from RNA-seq experiments) or protein sequence data gathered about the

organism of interest or a closely related organism. Other kinds of evidence can be collected

through running various bioinformatic programs that identify genomic features such as

sequence repeats, tRNA and rRNA genes, pseudogenes, transcription factor binding sites,

retroviruses, prophages, and so on. In the following sections, we will briefly review some

of the evidence-generating approaches used for both extrinsic gene finding and genome

annotation.

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation Using RNA-seq Data

PDF Pages: 153-154 | Print Pages: 133-134

Boundary: starts at RNA-seq subsection heading on PDF page 153; stops before the next subsection heading on PDF page 154.

---

Gene Annotation and Evidence Generation Using RNA-seq Data

RNA sequencing (RNA-seq) is a next-generation DNA sequencing (NGS) technique that

involves converting RNA (mRNA, tRNA, and rRNA) transcripts into double-stranded cDNA

fragments, then sequencing them using low-cost NGS sequencing methods (Wang et al. 2009).

Over the last decade, RNA-seq has helped to revolutionize genome annotation methods for

both eukaryotes and prokaryotes (Trapnell et al. 2009; Sallet et al. 2014). A typical RNA-seq

experiment generates thousands of short DNA sequence reads corresponding to gene-coding

regions (also known as coding sequence or CDS segments). These sequences can then be

aligned to the reference genome sequence using gapped, short-read aligners to determine

which genome regions were being transcribed. Some of the more popular gapped short-read

aligners include TopHat2 (Kim et al. 2013), Stampy (Lunter and Goodson 2011), and GSNAP

(Wu et al. 2016). These alignments can be further processed into putative transcripts using

tools such as Cufflinks (Trapnell et al. 2012), StringTie (Pertea et al. 2015), or Trinity (Grabherr

et al. 2011). In this way, RNA-seq provides experimental evidence (through DNA sequencing)

regarding the location of gene-coding regions.

The improvement in gene-finding performance and gene annotation quality when RNA-seq

data are used is quite substantial. In RGASP (Steijger et al. 2013), a diverse set of 14 genome

annotation approaches were compared (including intrinsic/ab initio methods, extrinsic meth-

ods, and hybrid extrinsic/intrinsic methods). The gold standard for comparison was the refer-

ence human genome annotations from the GENCODE project, consisting of computationally,

manually, and experimentally determined gene annotations (Harrow et al. 2012). It turned

out that the best-performing programs for the task of identifying protein-coding genes were

gene annotation tools that used RNA-seq data. Examples of gene annotation programs that

incorporate RNA-seq data into their gene finders include AUGUSTUS (Hoff and Stanke 2013),

mGENE (Schweikert et al. 2009), Trembly, and Transomics (Sperisen et al. 2004).

As noted earlier, when processing RNA-seq data for genome annotation, one can either

splice-align the raw reads against the genome or, alternatively, transcript fragments can first

be assembled de novo and then aligned to the genome via BLASTN. This “mapping-first”

approach was shown to lead to more accurate annotations in the RGASP assessment and so it

is highly recommended. Spliced alignment can be done with tools such as GSNAP (Wu et al.

2016), Stampy (Lunter and Goodson 2011), TopHat2 (Kim et al. 2013), or STAR (Dobin et al.

2013). Integrating coverage information from RNA-seq data into a gene annotation tool can

typically be done by increasing the score of candidate exons that are covered by RNA-seq by

a certain factor that depends on the local coverage of each covered exonic region. Rewarding

Genome Annotation

individual splice sites, supported by RNA-seq evidence, is relatively easy for HMMs. Some

gene annotation tools integrate evidence for complete introns (i.e. splice site pairs) from

RNA-seq data.

Newer technologies that produce longer RNA-seq reads (10 000+ bases) have greatly

improved the ability to predict alternatively spliced transcripts in comparison with short reads

(100–400 bases), which mostly help to find local alternative splice variants. Long reads are

often near-complete transcripts and each spliced alignment gives the structure of a transcript,

albeit only approximately owing to the relatively high sequencing error rate. Gene finders

such as AUGUSTUS can integrate evidence from long-read alignments to further improve

their performance.

While RNA-seq has greatly improved the performance of many eukaryotic gene find-

ers, there is still a long way to go. According to the RGASP assessment (Steijger et al.

2013), the best-performing methods identified ∼59% of protein-coding transcripts from

the Caenorhabditis elegans genome (AUGUSTUS, mGene, and Transomics), 43% from the

Drosophila melanogaster genome (AUGUSTUS), and just 21% from the Homo sapiens genome

(Trembly). So, RNA-seq data have not (yet) been the key to “solving” the problem of accurate

and automatic eukaryotic genome annotation. Important issues still remain, including the

fact that a significant fraction of genes or splice forms may not be expressed in any RNA-seq

sample, that transcribed sequences may not be protein coding and, if they are, the correct

protein-coding ORFs remain to be identified, and that transcript assemblies and mapping

of the transcripts to the genome are notoriously error prone. These errors typically are seen

around exon boundaries, with the assemblies often extending into introns and, at times,

missing whole exons. Several programs have been developed to help address these mapping

problems, including Exonerate (Slater and Birney 2005) and GeneWise (Birney et al. 2004).

Both programs are “splice-aware” tools that can be used to polish BLAST alignments. These

polished alignments can then be used to improve the annotation of the exons, introns, splice

sites, and 5′ and 3′ UTRs.

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation Using Protein Sequence Databases

PDF Pages: 154-155 | Print Pages: 134-135

Boundary: starts at protein sequence databases subsection heading on PDF page 154; includes the page 155 tail despite the running header; stops before the true comparative gene prediction heading.

---

Gene Annotation and Evidence Generation Using Protein Sequence Databases

Just as RNA-seq data can be used as evidence for the existence of genes, so too can sequence

homology be used to locate or identify new genes in newly sequenced organisms. In

homology-based gene finding, the DNA sequence of the newly sequenced organism is trans-

lated into putative protein sequences and these putative sequences are then compared against

databases of known proteins. Homologous matches at the protein level can then be used to

annotate, identify, and locate the genes at the DNA level. A key advantage of homology-based

gene finding over ab initio gene prediction is that homology-based methods provide not only

the identification and location (as ab initio approaches do) but also the probable gene name

and probable gene function as inferred by the sequence similarity of the newly identified gene

to previously annotated proteins in the protein sequence databases.

Translated nucleotide searches such as BLASTX searches (Gish and States 1993) constitute

one of the simplest homology-based gene prediction approaches. These searches are partic-

ularly useful when comparing ORFs in prokaryotic genomes. However, when dealing with

the split nature of eukaryotic genes, BLASTX-like searches do not resolve exon splice bound-

aries particularly well. One useful approach is to use both the results of translated nucleotide

searches along with those produced through the use of ab initio methods. Examples of this

hybrid approach include programs such as GenomeScan (Yeh et al. 2001), GeneID (Blanco

et al. 2002), and AUGUSTUS (Hoff and Stanke 2013). GenomeScan is an extension of GEN-

SCAN that incorporates sequence similarity to known proteins using BLASTX.

A more sophisticated approach to eukaryotic gene prediction via sequence homology

involves aligning the genomic query against a protein target that is presumed to be homol-

ogous to the protein encoded in the genomic sequence that is being annotated. In these

alignments, often referred to as spliced alignments, large gaps corresponding to introns in the

query sequence are only allowed at “legal” splice junctions. Examples of programs using this

approach include PROCRUSTES (Gelfand et al. 1996), GeneWise (Birney and Durbin 1997),

Exonerate (Slater and Birney 2005), BLAT (Kent 2002), and GenomeThreader (Gremme et al.

2005).

The spliced alignment approach does not exploit all the information typically available for

homology-based gene prediction. In fact, for any given protein, a whole family of related pro-

teins is often available. Such an ensemble of sequences carries more information than just a

single protein. For instance, a well-constructed MSA shows which regions are well conserved

and which ones are prone to insertions or deletions. Using an MSA, it is possible to calculate

the probability that a certain amino acid occurs at a certain site. Using these data, it is possible

to calculate PWMs or PSSMs and to create what is called an MSA profile. While the task for

creating MSAs for prokaryotic genomes is relatively easy, the task of creating MSAs for eukary-

otic genomes is particularly challenging owing to the presence of repeats, as well as large-scale

genome rearrangements, duplications, and deletions.

So, rather than trying to find genes or exons with a set of individual protein sequences,

one may use MSAs of aligned protein families to do the job. These MSAs can be found in

orthology databases such as OrthoDB (Waterhouse et al. 2013). Several excellent software tools

have been developed to search the gene structures of members of a protein family given an

MSA profile representation of that family. These include GeneWise (Birney and Durbin 1997)

and AUGUSTUS-PPX (Keller et al. 2011), where PPX stands for Protein Profile eXtension.

AUGUSTUS-PPX has been shown to improve gene prediction accuracy over spliced align-

ment methods, especially when dealing with genes having large numbers of exons. However,

the MSA approach is limited to the availability of homologous families and by the degree of

sequence similarity. Therefore, MSA gene finding is best used in situations involving medium

to high sequence similarity.

More recently, this MSA concept has been extended to cover situations with more remote

sequence similarity through the development of BUSCO (Simão et al. 2015). BUSCO stands

for Benchmarking Universal Single-Copy Orthologs. These single-copy orthologs correspond

to a relatively small set of proteins that are highly conserved and that are found as single-copy

genes across many different phyla in the tree of life. The BUSCO dataset currently includes

3023 genes for vertebrates, 2675 for arthropods, 843 for metazoans, 1438 for fungi, 429 for

eukaryotes, and 40 universal marker genes for prokaryotes. Using HMMER (Eddy 2009), the

BUSCO gene set can be rapidly searched against any given query genome. The presence or

absence of these BUSCO genes in a given organism provides a good measure of the complete-

ness of a genome assembly. It also provides a good measure of the completeness of a given

genome annotation or a given genome prediction.

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation using Comparative Gene Prediction

PDF Pages: 155-156 | Print Pages: 135-136

Boundary: starts at true comparative gene prediction subsection heading on PDF page 155; stops before the next non-protein-coding/foreign genes section on PDF page 156.

---

Gene Annotation and Evidence Generation using Comparative

Gene Prediction

Another approach to homology-based gene prediction exploits the fact that there is a large and

growing number of completely sequenced and well-annotated genomes now available. This

has given rise to a technique called comparative gene prediction. The rationale behind com-

parative gene prediction is that functional regions (the protein-coding regions) tend to be more

conserved than non-protein-coding regions. This observation provides the basis for identifying

protein-coding regions in newly sequenced genomes. Comparative gene prediction methods

exploit sequence homology but at a far more global scale than the protein sequence similar-

ity methods described above. In comparative gene prediction, the “known” and “unknown”

genomes are from different species, but the species are assumed to be so closely related that

their entire genomes can be aligned. Because these genomes are so long (millions to billions

of bases), the pairwise alignment or MSA is typically broken down into many local alignments

of syntenic (homologous) regions.

Genome Annotation

Early methods for comparative gene finding typically used just two genomic sequences as

input, such as DOUBLESCAN (Meyer and Durbin 2002), TWINSCAN (Korf et al. 2001), SLAM

(Alexandersson et al. 2003), or SGP-2 (Parra et al. 2003). SLAM is an HMM-based method

in which gene predictions and sequence alignments are performed simultaneously. TWIN-

SCAN and DOUBLESCAN are extensions of GENSCAN, whereas SGP-2 is an extension of

GeneID. Later on, comparative gene-finding methods were developed that could use more

than two genomic sequences to predict genes in a new genome, but they only did so for a sin-

gle target genome. These methods include programs such as N-SCAN (Gross and Brent 2006),

CONTRAST (Gross et al. 2007), and Mugsy-Annotator (Angiuoli et al. 2011). More recently, a

method called clade annotation has been developed and implemented in a version of AUGUS-

TUS known as “comparative AUGUSTUS” (König et al. 2016). Clade annotation allows the

simultaneous alignment and annotation of multiple target genomes. For example, compara-

tive AUGUSTUS can be used to simultaneously annotate the genomes of multiple (up to 20)

different mouse strains.

Source: Ch5 Genome Annotation / Evidence Generation for Non-Protein-Coding, Non-Coding, or Foreign Genes

PDF Pages: 156 | Print Page: 136

Boundary: starts at non-protein-coding/non-coding/foreign genes section heading; stops before tRNA and rRNA Gene Finding subsection.

---

Evidence Generation for Non-Protein-Coding, Non-Coding, or Foreign Genes

One of the best ways of determining the location of a protein-coding gene is to determine where

it is not. In other words, knowing that a DNA segment cannot possibly code for a protein allows

one to exclude it from the gene/protein-finding process. Prokaryotic genomes contain many

genes that do not code for proteins. These include tRNA and rRNA genes, along with many

foreign prophage genes that may or may not code for real phage proteins. Likewise, eukaryotic

genomes are filled with repeat regions, pseudogenes, retrotransposons, and retroviral genes,

along with an assortment of tRNA and rRNA genes. These non-protein-coding or non-coding

elements can account for 20–30% of a given prokaryotic genome (Casjens 2003) and more than

90% of a eukaryotic genome (Li et al. 2004).

Ch5 Genome Annotation — tRNA and rRNA Gene Finding

PDF page 156 下部 – page 157；印刷页码 136-137

Both prokaryotes and eukaryotes have a significant portion of their genome occupied by tRNA and rRNA genes. Prokaryotes (including bacteria and archaea) typically contain 70–80 copies each of tRNA genes and between three and 45 copies of rRNA genes. tRNA molecules are L-shaped adaptor RNA molecules, typically 76–90 nucleotides in length, that are essential for the translational process (Figure 5.7). In principle, a total of 61 tRNA genes are needed to permit the translation of all 61 coding (sense) codons. However, because of a phenomenon known as base wobble, many organisms are able to have a single tRNA serving two or more codons. As a result, most prokaryotes have 35–40 unique tRNA genes, but one or two copies of each.

rRNA molecules are the principal constituents (>60% by mass) of the ribosome, the translational engine of all cells. In prokaryotes the ribosome consists of two subunits, the small and the large subunit, which pair up to form the ribosome. The rRNAs present in prokaryotes are the 5S and 23S rRNAs in the large subunit and the 16S rRNA in the small subunit. Genes encoding these rRNAs are typically arranged into an operon (the rrn operon), with an internally transcribed spacer between the 16S and 23S rRNA genes. The number of rrn operons in prokaryotes ranges from one to 15 per genome.

tRNA and rRNA genes in eukaryotes share many similarities (in both structure and size) with those in prokaryotes. There are, however, some minor differences. For instance, eukaryotes typically have many more copies of tRNA genes than prokaryotes. There are 275 tRNA genes in Saccharomyces cerevisiae, 620 copies of tRNA genes in C. elegans, and 497 copies of tRNA genes in humans. All eukaryotes have 22 mitochondrial tRNA genes. Like prokaryotes, eukaryotic rRNA genes are also divided according to their location in the large or small subunit in the ribosome. However, instead of just two large subunit rRNAs, there are three rRNAs in the eukaryotic large subunit: 5S, 5.8S, and 28S. Just as with prokaryotes, eukaryotes have one rRNA gene (18S rRNA) for their small ribosomal subunit, but they also encode rRNA genes for their mitochondrial ribosomes (12S and 16S rRNA genes). Unlike prokaryotes, eukaryotes generally have many copies of the rRNA genes organized into tandem repeats. In humans, approximately 300–400 rRNA repeats are present in five clusters, located on five separate chromosomes. Unlike prokaryotic tRNA genes, some tRNA genes in eukaryotes are interrupted with introns.

The structure of tRNAs is highly conserved across all major kingdoms of life and there are a large number of tRNA sequences that are known for both prokaryotes and eukaryotes. As a result, most methods for identifying tRNA genes take advantage of common sequence motifs (recognizable via HMMs) and employ some kind of sequence homology or database comparison to identify tRNA genes. The best performing and most popular methods are RNAmmer (Lagesen et al. 2007), tRNAfinder (Kinouchi and Kogawa 2006), and tRNAscan-SE (Lowe and Eddy 1997). These programs are able to identify tRNA genes in both prokaryotes and eukaryotes with very high accuracy (>95%). In addition to these programs, there are several dedicated databases of tRNA sequences to assist with comparative tRNA identification approaches; these include tRNAdb and tRNA-DB-CE (Jühling et al. 2009; Abe et al. 2014). Currently the identification of tRNA genes is considered to be a "solved" problem.

Just like tRNA genes, rRNA genes also exhibit a very high level of sequence conservation, and there are a number of rRNA motifs that can be described by HMMs. These HMMs have been integrated into the program (and web server) called RNAmmer (Lagesen et al. 2007). RNAmmer is able to identify all rRNAs from prokaryotes and eukaryotes, with the exception of 5.8S rRNA. Owing to the complexity, size, and relatively poor annotation of rRNA genes, the performance for rRNA prediction is not yet at the same level as tRNA prediction. In addition to the RNAmmer predictor, there is also an RNA database called Rfam (Kalvari et al. 2018) that contains >2600 RNA families (including rRNA and tRNA sequence families). Each sequence family in Rfam is represented by an MSA, a consensus secondary structure, and a covariance model. Rfam can be used for rRNA (and other RNA gene) identification via sequence comparisons or MSAs. Regardless of their current shortcomings, the use of tRNA and rRNA gene identification tools invariably improves the accuracy of any protein-coding gene-finding or gene-prediction effort. It also enhances the quality of the overall genome annotation.

Ch5 Genome Annotation — Prophage Finding in Prokaryotes

PDF page 157 下部 – page 158 中部；印刷页码 137-138

Prokaryotes are subject to constant attack by bacterial viruses called bacteriophages, which kill or disable susceptible bacteria. Bacteriophages are the most abundant biological entities on the planet and they play a major role in the bacterial ecosystem and in driving microbial genetic variation or genetic diversity. This genetic diversity is brought on through a particularly unique part of the bacteriophage lifecycle called lysogeny. Lysogeny involves the integration of the phage genome (often consisting of 10–20 genes) into the host bacterial chromosome at well-defined insertion points. The genetically integrated phages are called prophages. In some cases, prophages can become permanently embedded into the bacterial genome, becoming cryptic prophages (Little 2005). These cryptic prophages often serve as genetic “fodder” for future evolutionary changes of the host microbe (Bobay et al. 2014). Furthermore, prophages and cryptic prophages tend to introduce pathogenic elements or pathogenic islands that exhibit a very different base composition than the host genome. Prophages and cryptic prophages can account for up to 20% of the genetic material in some bacterial genomes (Casjens 2003), with some prophage genes coding for expressed proteins and others not.

Given their high abundance, the identification of these phage-specific genetic elements can be quite important, especially when it comes to annotating bacterial genomes. Prophage and cryptic prophage sequences exhibit certain sequence features (such as the presence of integrases and transposases, attachment sites, and an altered base composition) that can be used to distinguish them from “normal” bacterial genes. When combined with HMMs to improve the sequence feature recognition tasks, it is possible to identify prophage and cryptic prophage sequences with relatively good accuracy. The accuracy can be improved further if comparative genome analyses to databases of known phage sequences are performed. Several bacterial prophage-finding programs have been developed and deployed over the past decade, including Phage_Finder (Fouts 2006) and ProphageFinder (Bose and Barber 2006). More recently, phage finding has moved from stand-alone programs to web servers. In particular, two new web servers have been released that provide somewhat greater speed and improved accuracy for prophage finding over existing tools. These are known as PHAST (Zhou et al. 2011) and PHASTER (Arndt et al. 2016). Both web servers are between 85% and 95% accurate (depending on the test being conducted) and both provide rich graphical output, as well as detailed annotations of the prophage sequences and the surrounding bacterial genomic sequences (Figure 5.8).

Regardless of the method chosen, the annotation of prophage and cryptic prophage genes certainly enhances the quality of a prokaryotic genome annotation and it usually improves the accuracy of any prokaryotic gene predictions.

Ch5 Genome Annotation — Repetitive Sequence Finding/Masking in Eukaryotes

PDF page 158 中部 – page 161 上部；印刷页码 138-141

Unlike prokaryotes, eukaryotic genomes contain considerable quantities of repetitive DNA. These repetitive sequences include retrotransposons and DNA transposons, both of which are referred to as dispersed repeats, as well as highly repetitive sequences, typically called tandem repeats. The most abundant repeat sequences in eukaryotes are retrotransposons. Retrotransposons are genetic elements that can amplify themselves using a “copy-and-paste” mechanism similar to that used by retroviruses. To replicate and amplify, they are transcribed into RNA, then converted back into identical DNA sequences using reverse transcription and then inserted into the genome at specific target sites. In contrast to retrotransposons, DNA transposons do their copying and pasting without an RNA intermediate, instead using the protein called transposase. Approximately 52% of the human genome is made up of retrotransposons, while DNA transposons account for another 3% (Lander et al. 2001; Wheeler et al. 2013). In plants, retrotransposons are much more abundant, accounting for between 60% and 90% of the DNA in any given plant genome (Li et al. 2004).

Within the retrotransposon family are two subfamilies: long terminal repeat retrotransposons (LTR retrotransposons) and non-LTR retrotransposons. LTR retrotransposons are retrovirus-like sequences that contain LTRs that range from ∼100 bp to over 5 kb in length. In fact, a retrovirus can be transformed into an LTR retrotransposon simply through inactivation or deletion of certain genes (such as the envelope protein) that enable cell-to-cell viral transmission. Most LTR retrotransposons are non-functional endogenous retroviruses that are also called proviruses. In this regard, eukaryotic LTR retrotransposons can be thought of as the equivalent of prokaryotic prophages or cryptic prophages. Human endogenous retroviral sequences, all of which appear to be defective or non-replicative, account for about 8% of the human genome (Taruscio and Mantovani 2004).

Non-LTR retrotransposons consist of two subtypes: long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs). LINEs are typically 7000 bp long and encode several genes to cover all the functions needed for retrotransposition. These include reverse transcriptase and endonuclease genes, as well as several genes needed to form a ribonucleoprotein particle. More than 850000 copies of LINEs exist in the human genome, covering 21% of all human DNA (Cordaux and Batzer 2009). However, more than 99% of LINEs are genetically “dead,” having lost their retrotransposition functions. In contrast to LINEs, SINEs are much smaller, typically consisting of DNA stretches spanning just 80–500 bp. SINEs are very abundant (in the millions of copies), accounting for about 10% of the DNA in the human genome. The most common SINEs in humans are the Alu repeats (Häsler and Strub 2006). Alu repeats are about 300 bp long. They are highly conserved in primates and are subject to frequent DNA methylation events.

In addition to transposable elements (or dispersed repeats), eukaryotes also contain large numbers of tandem repeats, including minisatellite DNA, microsatellite DNA (also known as short tandem repeats [STRs] or simple sequence repeats [SSRs]), and telomere repeats. Minisatellite DNA consists of repeats of 10–60 bp in length that stretch for about 2 kb and are scattered throughout the genome. Microsatellite DNA consists of repeats of 1–6 bp, extending for hundreds of kilobases, particularly around the centromeres. Telomere repeats consist of a highly conserved 6 bp sequence (TTAGGG) that is repeated 250–1000 times and found exclusively at the ends of eukaryotic chromosomes. Mini- and microsatellite DNA account for about 5% of the DNA in the human genome (Subramanian et al. 2003).

The fact that eukaryotes have so many repeat sequences, combined with the fact that these repeats account for such a large portion of their genomes (often >50%), has led to concerted efforts by genome annotators to identify, remove, or mask these sequences. This is because repeat sequences can seriously hinder gene identification and genome annotation activities. For instance, retrotransposons and DNA transposons can beeasily be mistaken as exons by ab initio gene predictors. Likewise, STRs can lead to spurious alignments when comparative genomics approaches are used in gene finding. STRs (which are also called low-complexity regions) can often be dealt with through two techniques: soft or hard masking. Soft masking is done by changing the case of the letters in the sequence file from uppercase to lowercase, while hard masking changes the offending sequence to Ns, thereby removing them completely from consideration. Soft masking prevents the masked region from seeding alignments but preserves sequence identity so that off-target alignments are minimized. Soft masking is routinely done by the programs SEG and DUST (Wootton and Federhen 1993), which are found in most versions of the BLAST sequence alignment suite.

While tandem repeats are relatively easy to deal with, repeat transposable elements (such as retrotransposons) are much harder to handle. This is because these sequences are far larger and far more complex. The Repbase (Jurka et al. 2005) database contains a comprehensive collection of repeat and transposable elements from a wide range of species. This resource is frequently used to identify repeat elements via comparative sequence analysis. However, if the transposon sequences are highly divergent from those found in RepBase, then other methods or other databases may need to be used. Dfam (Wheeler et al. 2013) is an example of a more advanced repetitive element database. In Dfam, the original Repbase sequences have been converted to HMMs. The use of these HMMs has allowed many more transposable elements to be identified (up to 54.5% vs. 44% in humans) with much improved accuracy (Wheeler et al. 2013).

In addition to Dfam (which is available as both a server and a downloadable resource), there are several stand-alone programs and web servers that have been developed to specifically identify retrotransposons, including RECON (Bao and Eddy 2002), RepeatScout (Price et al. 2005), RetroPred (Naik et al. 2008), LTR_FINDER (Xu and Wang 2007), LTRharvest (Ellinghaus et al. 2008), and MITE-Hunter (Han and Wessler 2010). These programs identify and label either LTR or non-LTR retrotransposons. While this information may be useful to some, many genome annotators are simply interested in removing retrotransposons from consideration. In this regard, RepeatMasker (Tarailo-Graovac and Chen 2009) has become the tool of choice as it simply hard masks (i.e. removes) all detectable retrotransposon sequences from the genome of interest.

As a general rule, hard masking of transposable elements is often the first step performed in a eukaryotic genome annotation. Hard masking with tools such as Dfam or programs such as RepeatMasker not only removes “uninteresting” genetic data, it also accelerates the gene identification process and improves annotation accuracy. Because coding exons tend not to overlap or to contain repetitive elements, ab initio gene prediction programs tend to predict fewer false-positive exons when using hard-masked sequences. For instance, when chromosome 22 was analyzed using different ab initio gene predictors, it was found that a significant reduction in false-positive gene predictions occurred (Parra et al. 2003). In particular, GENSCAN initially predicted 1128 protein-coding genes without using sequence masking, but when sequence masking was used the number of predicted genes dropped to 789. When GeneID was used, the number fell from 1119 to 730. The actual number of protein-coding genes in chromosome 22, according to the latest GENCODE annotation, is 489.

Ch5 Genome Annotation — Finding and Removing Pseudogenes in Eukaryotes

PDF page 161 上部；印刷页码 141

A particular challenge with eukaryotic genome annotation is differentiating between predictions identifying “real” genes from those that correspond to non-functional pseudogenes. Database searches may not help to provide any clearer picture, as many pseudogenes are similar to functional, paralogous genes. The absence of an RNA transcript from an RNA-seq experiment cannot be used as a criterion either, because RNA transcripts do not always exist for actual genes because of variations in tissue expression or developmental stages. In general, intronless gene predictions for which multi-exon paralogous genes exist in the same genome are suspicious, as they may indicate sequences that have arisen through retrotransposition.

Multi-exon predictions, however, can also correspond to pseudogenes arising through a recent gene duplication event. If homologs in another organism exist, one solution is to compute the synonymous versus non-synonymous substitution rate (Ka/Ks; Fay and Wu 2003). Ka/Ks values approaching 1 are indicative of neutral evolution, suggesting a pseudogene. Support for multi-exon gene predictions can come from assessing the conservation of the overall gene structure in close homologs. For instance, the prediction or identification of homologous genes in two modestly related organisms (e.g. mouse and human) most likely indicates that the gene is real and is not a pseudogene (Guigó et al. 2003).

中文译文

译文：Ch5 Genome Annotation / Evidence Generation for Genome Annotation

章节：Ch5 Genome Annotation

Canonical 小节：Evidence Generation for Genome Annotation

范围：PDF page 153 中部 - PDF page 161 上部；印刷页码 133-141

---

第5章基因组注释

基因组注释的证据生成

基因组证据是指任何可用于识别或说明某一生物体中基因结构的信息，无论该生物体是原核生物还是真核生物。一些最有用的证据来自实验工作，例如转录数据（mRNA，或来自 RNA-seq 实验的 DNA 数据），或来自目标生物体及其近缘生物体的蛋白质序列数据。其他类型的证据可以通过运行各种生物信息学程序来获得；这些程序能够识别基因组特征，例如序列重复、tRNA 和 rRNA 基因、假基因、转录因子结合位点、逆转录病毒、前噬菌体，等等。在下面几节中，我们将简要回顾一些用于外源性基因查找和基因组注释的证据生成方法。

第5章基因组注释

使用 RNA-seq 数据进行基因注释和证据生成

RNA 测序（RNA sequencing，RNA-seq）是一种下一代 DNA 测序（next-generation DNA sequencing，NGS）技术。它先将 RNA（mRNA、tRNA 和 rRNA）转录本转化为双链 cDNA 片段，然后使用低成本 NGS 测序方法对其进行测序（Wang et al. 2009）。在过去十年中，RNA-seq 帮助革新了真核生物和原核生物的基因组注释方法（Trapnell et al. 2009；Sallet et al. 2014）。

一个典型的 RNA-seq 实验会生成成千上万条短 DNA 序列读段，这些读段对应于基因编码区域（也称为编码序列，coding sequence，CDS 片段）。随后，可以使用带缺口的短读段比对程序将这些序列比对到参考基因组序列上，以确定哪些基因组区域正在被转录。较常用的带缺口短读段比对程序包括 TopHat2（Kim et al. 2013）、Stampy（Lunter and Goodson 2011）和 GSNAP（Wu et al. 2016）。这些比对结果还可以使用 Cufflinks（Trapnell et al. 2012）、StringTie（Pertea et al. 2015）或 Trinity（Grabherr et al. 2011）等工具进一步处理为候选转录本。通过这种方式，RNA-seq 可以提供关于基因编码区域位置的实验依据（通过 DNA 测序获得）。

当使用 RNA-seq 数据时，基因查找性能和基因注释质量的提高相当显著。在 RGASP（Steijger et al. 2013）中，研究者比较了 14 种不同的基因组注释方法（包括内源性/ab initio 方法、外源性方法以及外源性/内源性混合方法）。用于比较的金标准是 GENCODE 项目提供的人类参考基因组注释，其中包含由计算、人工和实验方式确定的基因注释（Harrow et al. 2012）。结果表明，在识别蛋白质编码基因这一任务中，表现最好的程序是使用 RNA-seq 数据的基因注释工具。将 RNA-seq 数据纳入基因查找过程的基因注释程序包括 AUGUSTUS（Hoff and Stanke 2013）、mGENE（Schweikert et al. 2009）、Trembly 和 Transomics（Sperisen et al. 2004）。

如前所述，在为基因组注释处理 RNA-seq 数据时，可以将原始读段与基因组进行剪接比对；或者，也可以先从头组装转录本片段，再通过 BLASTN 将其比对到基因组上。RGASP 评估显示，这种“先比对”（mapping-first）方法可以产生更准确的注释，因此强烈推荐使用。剪接比对可以用 GSNAP（Wu et al. 2016）、Stampy（Lunter and Goodson 2011）、TopHat2（Kim et al. 2013）或 STAR（Dobin et al. 2013）等工具完成。将 RNA-seq 数据中的覆盖度信息整合到基因注释工具中，通常可以通过提高被 RNA-seq 覆盖的候选外显子的得分来实现；提高幅度取决于每个被覆盖外显子区域的局部覆盖度。对于 HMM 来说，奖励由 RNA-seq 证据支持的单个剪接位点相对容易。一些基因注释工具还会整合来自 RNA-seq 数据的完整内含子证据（即剪接位点对）。

与短读段（100-400 个碱基）相比，能够产生更长 RNA-seq 读段（10 000+ 个碱基）的新技术，极大提高了预测可变剪接转录本的能力；短读段主要有助于发现局部可变剪接变体。长读段往往接近完整转录本，每一次剪接比对都能给出一个转录本结构，尽管由于测序错误率相对较高，这种结构通常只是近似的。AUGUSTUS 等基因查找程序可以整合来自长读段比对的证据，从而进一步提高其性能。

虽然 RNA-seq 极大改善了许多真核基因查找程序的表现，但仍然还有很长的路要走。根据 RGASP 评估（Steijger et al. 2013），表现最好的方法从秀丽隐杆线虫（Caenorhabditis elegans）基因组中识别出约 59% 的蛋白质编码转录本（AUGUSTUS、mGene 和 Transomics），从黑腹果蝇（Drosophila melanogaster）基因组中识别出 43%（AUGUSTUS），而从人类（Homo sapiens）基因组中仅识别出 21%（Trembly）。因此，RNA-seq 数据还没有（至少目前还没有）成为“解决”真核基因组准确自动注释问题的关键。仍然存在一些重要问题，包括：相当一部分基因或剪接形式可能不会在任何 RNA-seq 样本中表达；被转录的序列可能并不编码蛋白质；即使它们编码蛋白质，正确的蛋白质编码 ORF 仍需识别；转录本组装以及将转录本映射回基因组的过程也出了名地容易出错。这些错误通常出现在外显子边界附近，组装结果常常会延伸进内含子，有时还会漏掉整个外显子。

已经开发出若干程序来帮助解决这些映射问题，包括 Exonerate（Slater and Birney 2005）和 GeneWise（Birney et al. 2004）。这两个程序都是“剪接感知”（splice-aware）工具，可用于修正 BLAST 比对。修正后的比对随后可用于改进对外显子、内含子、剪接位点以及 5′ 和 3′ UTR 的注释。

第5章基因组注释

使用蛋白质序列数据库进行基因注释和证据生成

正如 RNA-seq 数据可以作为基因存在的证据一样，序列同源性也可以用于在新测序生物中定位或识别新基因。在基于同源性的基因查找中，新测序生物的 DNA 序列会被翻译成候选蛋白质序列，然后将这些候选序列与已知蛋白质数据库进行比较。蛋白质层面的同源匹配随后可用于在 DNA 层面对基因进行注释、识别和定位。与 ab initio 基因预测相比，基于同源性的基因查找有一个关键优势：同源性方法不仅能提供识别和定位信息（这一点 ab initio 方法也能做到），还可以根据新识别基因与蛋白质序列数据库中既有注释蛋白之间的序列相似性，推断出可能的基因名称和可能的基因功能。

翻译核苷酸搜索，例如 BLASTX 搜索（Gish and States 1993），是最简单的基于同源性的基因预测方法之一。在比较原核基因组中的 ORF 时，这些搜索尤其有用。然而，当处理真核基因的分裂结构时，类似 BLASTX 的搜索并不能很好地解析外显子剪接边界。一种有用的方法是同时使用翻译核苷酸搜索结果和 ab initio 方法产生的结果。这类混合方法的例子包括 GenomeScan（Yeh et al. 2001）、GeneID（Blanco et al. 2002）和 AUGUSTUS（Hoff and Stanke 2013）等程序。GenomeScan 是 GENSCAN 的扩展版本，它利用 BLASTX 纳入了与已知蛋白质的序列相似性。

通过序列同源性进行真核基因预测时，一种更复杂的方法是将基因组查询序列与一个蛋白质目标序列进行比对；该目标序列被认为与正在注释的基因组序列所编码的蛋白质同源。在这些通常称为剪接比对（spliced alignments）的比对中，查询序列中对应于内含子的大缺口只允许出现在“合法”的剪接连接处。使用这种方法的程序包括 PROCRUSTES（Gelfand et al. 1996）、GeneWise（Birney and Durbin 1997）、Exonerate（Slater and Birney 2005）、BLAT（Kent 2002）和 GenomeThreader（Gremme et al. 2005）。

剪接比对方法并没有利用基于同源性的基因预测中通常可获得的全部信息。事实上，对于任意给定蛋白质，往往可以获得一整个相关蛋白家族。这样一组序列所携带的信息多于单个蛋白质。例如，一个构建良好的多序列比对（multiple sequence alignment，MSA）可以显示哪些区域高度保守，哪些区域容易发生插入或缺失。利用 MSA，可以计算某个氨基酸出现在某个位点的概率。利用这些数据，可以计算 PWM 或 PSSM，并创建所谓的 MSA profile。虽然为原核基因组创建 MSA 的任务相对容易，但由于重复序列以及大规模基因组重排、重复和缺失的存在，为真核基因组创建 MSA 尤其具有挑战性。

因此，与其尝试用一组单独的蛋白质序列来查找基因或外显子，不如使用已比对蛋白质家族的 MSA 来完成这项工作。这些 MSA 可以在 OrthoDB 等直系同源数据库中找到（Waterhouse et al. 2013）。已经开发出若干优秀软件工具，可以在给定某个蛋白质家族的 MSA profile 表示后，搜索该家族成员的基因结构。这些工具包括 GeneWise（Birney and Durbin 1997）和 AUGUSTUS-PPX（Keller et al. 2011），其中 PPX 代表 Protein Profile eXtension。已有研究显示，与剪接比对方法相比，AUGUSTUS-PPX 可以提高基因预测准确性，尤其是在处理含有大量外显子的基因时。不过，MSA 方法受限于是否存在同源家族以及序列相似性的程度。因此，MSA 基因查找最适合用于中等到高度序列相似性的情况。

近来，随着 BUSCO（Simão et al. 2015）的开发，这一 MSA 概念被扩展到覆盖序列相似性更远的情况。BUSCO 是 Benchmarking Universal Single-Copy Orthologs（通用单拷贝直系同源基准）的缩写。这些单拷贝直系同源物对应于一组数量相对较少、但高度保守的蛋白质；它们在生命树中许多不同门类里都以单拷贝基因形式存在。目前，BUSCO 数据集包括脊椎动物 3023 个基因、节肢动物 2675 个基因、后生动物 843 个基因、真菌 1438 个基因、真核生物 429 个基因，以及原核生物 40 个通用标记基因。使用 HMMER（Eddy 2009），可以将 BUSCO 基因集快速搜索到任何给定查询基因组中。某一生物中这些 BUSCO 基因的存在或缺失，可以很好地衡量基因组组装的完整性。它也可以很好地衡量给定基因组注释或给定基因组预测的完整性。

第5章基因组注释

使用比较基因预测进行基因注释和证据生成

另一种基于同源性的基因预测方法利用了这样一个事实：目前已经有大量且数量仍在增长的完整测序、注释良好的基因组可用。这催生了一种称为比较基因预测（comparative gene prediction）的技术。比较基因预测背后的基本理由是，功能区域（即蛋白质编码区域）往往比非蛋白质编码区域更加保守。这个观察为在新测序基因组中识别蛋白质编码区域提供了基础。

比较基因预测方法利用序列同源性，但其尺度比前面描述的蛋白质序列相似性方法要全局得多。在比较基因预测中，“已知”基因组和“未知”基因组来自不同物种，但这些物种被假定为亲缘关系足够近，以至于它们的整个基因组可以相互比对。由于这些基因组非常长（从数百万到数十亿个碱基不等），成对比对或 MSA 通常会被分解成许多共线性（同源）区域的局部比对。

早期的比较基因查找方法通常只使用两条基因组序列作为输入，例如 DOUBLESCAN（Meyer and Durbin 2002）、TWINSCAN（Korf et al. 2001）、SLAM（Alexandersson et al. 2003）或 SGP-2（Parra et al. 2003）。SLAM 是一种基于 HMM 的方法，在这种方法中，基因预测和序列比对同时进行。TWINSCAN 和 DOUBLESCAN 是 GENSCAN 的扩展版本，而 SGP-2 是 GeneID 的扩展版本。

后来，人们开发出了可以使用两条以上基因组序列来预测新基因组中基因的比较基因查找方法，不过这些方法当时仍只针对单个目标基因组进行预测。这些方法包括 N-SCAN（Gross and Brent 2006）、CONTRAST（Gross et al. 2007）和 Mugsy-Annotator（Angiuoli et al. 2011）等程序。近来，一种称为进化枝注释（clade annotation）的方法被开发出来，并被实现到 AUGUSTUS 的一个版本中，即“comparative AUGUSTUS”（König et al. 2016）。进化枝注释允许对多个目标基因组进行同步比对和注释。例如，comparative AUGUSTUS 可用于同时注释多个（最多 20 个）不同小鼠品系的基因组。

第5章基因组注释

非蛋白质编码、非编码或外源基因的证据生成

确定蛋白质编码基因位置的最佳方法之一，是确定它不在什么地方。换句话说，如果知道某个 DNA 片段不可能编码蛋白质，就可以将其从基因/蛋白质查找过程中排除。原核基因组包含许多不编码蛋白质的基因。这些基因包括 tRNA 和 rRNA 基因，也包括许多外源性前噬菌体基因；这些前噬菌体基因可能编码真实的噬菌体蛋白，也可能不编码。类似地，真核基因组中充满了重复区域、假基因、反转座子和逆转录病毒基因，也包含各种 tRNA 和 rRNA 基因。这些非蛋白质编码或非编码元件可占某一原核基因组的 20-30%（Casjens 2003），并可占真核基因组的 90% 以上（Li et al. 2004）。

第五章基因组注释

tRNA and rRNA Gene Finding（tRNA 和 rRNA 基因寻找）

原核生物和真核生物的基因组中都有相当大的一部分被 tRNA 和 rRNA 基因所占据。细菌和古菌通常各自含有 70–80 个 tRNA 基因拷贝，以及 3–45 个 rRNA 基因拷贝。tRNA 分子是 L 形的 adaptor RNA 分子，长度通常为 76–90 个核苷酸，是翻译过程中必不可少的分子（如图 5.7 所示）。从原理上讲，翻译所有 61 个编码（正义）密码子共需要 61 个 tRNA 基因。然而，由于一种名为"碱基摆动"（base wobble）的现象，许多生物能够用一个 tRNA 来识别两个或更多的密码子。因此，大多数原核生物拥有 35–40 个不同的 tRNA 基因，但每个基因只有一或两个拷贝。

rRNA 分子是核糖体（所有细胞的翻译引擎）的主要成分（按质量计 >60%）。在原核生物中，核糖体由两个亚基——小亚基和大亚基——配对组成。原核生物中的 rRNA 包括大亚基中的 5S 和 23S rRNA，以及小亚基中的 16S rRNA。编码这些 rRNA 的基因通常排列成操纵子（即 rrn operon），在 16S 和 23S rRNA 基因之间有一个内部转录间隔区。原核生物中 rrn operon 的数量因基因组而异，从 1 个到 15 个不等。

真核生物的 tRNA 和 rRNA 基因与原核生物有许多相似之处（结构和大小均相似）。然而也存在一些细微差异。例如，真核生物通常拥有比原核生物多得多拷贝的 tRNA 基因。酿酒酵母（Saccharomyces cerevisiae）有 275 个 tRNA 基因，秀丽隐杆线虫（C. elegans）有 620 个 tRNA 基因拷贝，人类则有 497 个 tRNA 基因拷贝。所有真核生物都拥有 22 个线粒体 tRNA 基因。与原核生物类似，真核生物的 rRNA 基因也按其在核糖体大亚基或小亚基中的位置进行分类。然而，真核生物的大亚基中有三种 rRNA（而非原核的两种）：5S、5.8S 和 28S。与原核生物相同，真核生物在小核糖体亚基中有一个 rRNA 基因（18S rRNA），但同时也为线粒体核糖体编码 rRNA 基因（12S 和 16S rRNA 基因）。与原核生物不同，真核生物的 rRNA 基因通常以串联重复的形式存在大量拷贝。在人类中，约有 300–400 个 rRNA 重复单元，分布在五条不同染色体的五个簇中。与原核生物的 tRNA 基因不同，真核生物的部分 tRNA 基因含有内含子。

tRNA 的结构在所有主要生物界中高度保守，针对原核生物和真核生物已知的 tRNA 序列也数量庞大。因此，大多数 tRNA 基因鉴定方法都利用共同的序列模体（可通过 HMM 识别）并采用某种形式的序列同源性或数据库比对来鉴定 tRNA 基因。目前性能最佳且最流行的方法是 RNAmmer（Lagesen et al. 2007）、tRNAfinder（Kinouchi and Kogawa 2006）和 tRNAscan-SE（Lowe and Eddy 1997）。这些程序能够以非常高的准确率（>95%）在原核生物和真核生物中鉴定 tRNA 基因。除这些程序外，还有若干专用的 tRNA 序列数据库可辅助比较 tRNA 鉴定方法，包括 tRNAdb 和 tRNA-DB-CE（Jühling et al. 2009; Abe et al. 2014）。目前，tRNA 基因的鉴定被认为是一个"已解决"的问题。

与 tRNA 基因一样，rRNA 基因也表现出非常高的序列保守性，有许多 rRNA 模体可用 HMM 来描述。这些 HMM 已被整合到一个名为 RNAmmer 的程序（及网络服务器）中（Lagesen et al. 2007）。RNAmmer 能够鉴定来自原核生物和真核生物的所有 rRNA，但 5.8S rRNA 除外。由于 rRNA 基因的复杂性、长度以及相对较少的已有注释，rRNA 预测的性能尚未达到与 tRNA 预测相同的水平。除 RNAmmer 预测工具外，还有一个名为 Rfam 的 RNA 数据库（Kalvari et al. 2018），其中包含 >2600 个 RNA 家族（包括 rRNA 和 tRNA 序列家族）。Rfam 中的每个序列家族由一个 MSA、一个共识二级结构和一个协方差模型来表示。Rfam 可通过序列比对或 MSA 来进行 rRNA（及其他 RNA 基因）的鉴定。无论这些工具目前有何不足，使用 tRNA 和 rRNA 基因鉴定工具总能提高任何蛋白质编码基因发现或基因预测工作的准确性，同时也能提升整体基因组注释的质量。

第五章基因组注释

Prophage Finding in Prokaryotes（原核生物中的前噬菌体识别）

原核生物不断受到称为 bacteriophages（噬菌体）的细菌病毒攻击，这些病毒能够杀死或削弱易感细菌。噬菌体是地球上数量最多的生物实体，在细菌生态系统中发挥着重要作用，并且是推动微生物遗传变异或遗传多样性的重要力量。这种遗传多样性主要通过噬菌体生活史中一个非常独特的阶段——lysogeny（溶原化）——而产生。溶原化是指噬菌体基因组（通常由 10–20 个基因组成）在明确的插入位点整合进宿主细菌染色体的过程。完成遗传整合后的噬菌体称为 prophages（前噬菌体）。在某些情况下，前噬菌体会永久嵌入细菌基因组，成为 cryptic prophages（隐匿前噬菌体）（Little 2005）。这些隐匿前噬菌体常常作为宿主微生物未来进化变化的遗传“素材”或“储备”（Bobay et al. 2014）。此外，前噬菌体和隐匿前噬菌体往往会引入致病相关元件或致病岛，而这些区域的碱基组成通常与宿主基因组存在明显差异。在某些细菌基因组中，前噬菌体和隐匿前噬菌体可占全部遗传物质的 20% 之多（Casjens 2003）；其中某些前噬菌体基因编码可表达蛋白，而另一些则不编码。

鉴于这类噬菌体特异性遗传元件数量很多，对其进行识别具有相当重要的意义，尤其是在细菌基因组注释中。前噬菌体和隐匿前噬菌体序列表现出某些特征性序列特征，例如整合酶（integrases）和转座酶（transposases）的存在、attachment sites（附着位点）以及异常的碱基组成；这些特征可用于将其与“正常”的细菌基因区分开来。若再结合 HMM 来增强这些序列特征的识别任务，就能够以相对较高的准确性识别前噬菌体和隐匿前噬菌体序列。若进一步将待分析基因组与已知噬菌体序列数据库进行比较基因组分析，则识别准确率还可进一步提升。在过去十年里，已经开发并部署了若干细菌前噬菌体识别程序，包括 Phage_Finder（Fouts 2006）和 ProphageFinder（Bose and Barber 2006）。近些年来，噬菌体识别已经从独立程序逐步转向 web server。特别是，已有两个新的 web server 发布，与现有工具相比，它们在前噬菌体识别中提供了更快的速度和更高的准确率，即 PHAST（Zhou et al. 2011）和 PHASTER（Arndt et al. 2016）。这两个 web server 的准确率都在 85%–95% 之间（取决于所进行的测试），并且都能提供丰富的图形化输出，以及对前噬菌体序列及其周围细菌基因组序列的详细注释（图 5.8）。

无论采用哪一种方法，对前噬菌体和隐匿前噬菌体基因进行注释都无疑会提升原核基因组注释的整体质量，而且通常还会提高原核基因预测的准确性。

第五章基因组注释

Repetitive Sequence Finding/Masking in Eukaryotes（真核生物中的重复序列寻找/遮蔽）

与原核生物不同，真核生物基因组含有大量 repetitive DNA（重复 DNA）。这些重复序列包括 retrotransposons（逆转座子）和 DNA transposons（DNA 转座子），二者都称为 dispersed repeats（分散重复）；此外还包括高度重复的序列，通常称为 tandem repeats（串联重复）。真核生物中最丰富的重复序列是逆转座子。逆转座子是一类遗传元件，能够通过类似逆转录病毒所用的“复制-粘贴”机制来扩增自身。为了复制和扩增，它们首先被转录为 RNA，随后通过 reverse transcription（逆转录）转换回相同的 DNA 序列，再插入基因组的特定靶位点。与逆转座子不同，DNA 转座子在复制和粘贴时不经过 RNA 中间体，而是使用一种称为 transposase（转座酶）的蛋白质。人类基因组约 52% 由逆转座子构成，而 DNA 转座子又占另外 3%（Lander et al. 2001; Wheeler et al. 2013）。在植物中，逆转座子更加丰富，可占任一植物基因组 DNA 的 60%–90%（Li et al. 2004）。

逆转座子家族包含两个亚家族：long terminal repeat retrotransposons（长末端重复逆转座子，LTR retrotransposons）和 non-LTR retrotransposons（非 LTR 逆转座子）。LTR retrotransposons 是一类类似逆转录病毒的序列，含有长度从约 100 bp 到超过 5 kb 不等的 LTR。事实上，只要使某些支持病毒在细胞间传播的基因（例如 envelope protein，包膜蛋白）失活或缺失，逆转录病毒就可以转变为 LTR retrotransposon。大多数 LTR retrotransposons 是无功能的 endogenous retroviruses（内源性逆转录病毒），也称为 proviruses（前病毒）。从这个意义上说，真核生物的 LTR retrotransposons 可被视为原核生物 prophages（前噬菌体）或 cryptic prophages（隐匿前噬菌体）的对应物。人类内源性逆转录病毒序列似乎全部存在缺陷或不能复制，约占人类基因组的 8%（Taruscio and Mantovani 2004）。

Non-LTR retrotransposons 包含两个亚型：long interspersed nuclear elements（长散在核元件，LINEs）和 short interspersed nuclear elements（短散在核元件，SINEs）。LINEs 通常长约 7000 bp，并编码若干基因，以覆盖逆转座所需的全部功能。这些基因包括 reverse transcriptase（逆转录酶）和 endonuclease（内切核酸酶）基因，以及若干形成 ribonucleoprotein particle（核糖核蛋白颗粒）所需的基因。人类基因组中存在超过 850000 个 LINE 拷贝，覆盖全部人类 DNA 的 21%（Cordaux and Batzer 2009）。然而，超过 99% 的 LINE 在遗传上已经“死亡”，失去了逆转座功能。与 LINEs 相比，SINEs 小得多，通常只是长度为 80–500 bp 的 DNA 片段。SINEs 数量极多（可达数百万拷贝），约占人类基因组 DNA 的 10%。人类中最常见的 SINE 是 Alu repeats（Alu 重复）（Häsler and Strub 2006）。Alu repeats 长约 300 bp，在灵长类动物中高度保守，并经常发生 DNA methylation（DNA 甲基化）事件。

除 transposable elements（转座元件，或 dispersed repeats）外，真核生物还含有大量 tandem repeats，包括 minisatellite DNA（小卫星 DNA）、microsatellite DNA（微卫星 DNA，也称为 short tandem repeats，STRs，短串联重复；或 simple sequence repeats，SSRs，简单序列重复）以及 telomere repeats（端粒重复）。Minisatellite DNA 由长度为 10–60 bp 的重复单元组成，整体可延伸约 2 kb，并散布于整个基因组中。Microsatellite DNA 由 1–6 bp 的重复单元构成，可延伸数百 kb，尤其常见于着丝粒周围。Telomere repeats 由高度保守的 6 bp 序列（TTAGGG）组成，该序列重复 250–1000 次，并且只存在于真核染色体末端。Mini- 和 microsatellite DNA 合计约占人类基因组 DNA 的 5%（Subramanian et al. 2003）。

真核生物拥有如此多的重复序列，而且这些重复序列又占其基因组很大比例（通常 >50%），这促使基因组注释人员集中精力去识别、移除或遮蔽这些序列。原因在于，重复序列会严重阻碍基因识别和基因组注释工作。例如，ab initio gene predictors（从头基因预测器）很容易把逆转座子和 DNA 转座子误判为外显子。同样，在基因寻找中使用比较基因组学方法时，STRs 也可能导致虚假的比对。STRs（也称为 low-complexity regions，低复杂度区域）通常可通过两种技术处理：soft masking（软遮蔽）或 hard masking（硬遮蔽）。软遮蔽是将序列文件中的字母大小写从大写改为小写；而硬遮蔽则把有问题的序列改为 N，从而将其完全排除在考虑之外。软遮蔽可防止被遮蔽区域作为比对种子，同时保留序列身份信息，从而尽量减少非目标比对。软遮蔽通常由 SEG 和 DUST 程序完成（Wootton and Federhen 1993），这两个程序存在于大多数版本的 BLAST 序列比对套件中。

虽然 tandem repeats 相对容易处理，但重复性 transposable elements（如 retrotransposons）要困难得多。这是因为这些序列大得多，也复杂得多。Repbase 数据库（Jurka et al. 2005）收录了来自广泛物种的重复序列和转座元件的综合集合。该资源常用于通过比较序列分析识别重复元件。然而，如果转座子序列与 RepBase 中的序列高度分化，就可能需要使用其他方法或其他数据库。Dfam（Wheeler et al. 2013）是一个更高级的重复元件数据库示例。在 Dfam 中，原始 Repbase 序列被转换为 HMM。使用这些 HMM 后，可识别出更多转座元件（在人类中最高可达 54.5%，而非 44%），并且准确性大幅提高（Wheeler et al. 2013）。

除 Dfam（既可作为 server 使用，也可下载为资源）外，还开发了若干独立程序和 web server，用于专门识别 retrotransposons，包括 RECON（Bao and Eddy 2002）、RepeatScout（Price et al. 2005）、RetroPred（Naik et al. 2008）、LTR_FINDER（Xu and Wang 2007）、LTRharvest（Ellinghaus et al. 2008）和 MITE-Hunter（Han and Wessler 2010）。这些程序能够识别并标记 LTR 或 non-LTR retrotransposons。虽然这类信息对某些研究者有用，但许多基因组注释人员只是希望把 retrotransposons 从分析考虑中移除。在这一方面，RepeatMasker（Tarailo-Graovac and Chen 2009）已经成为首选工具，因为它会直接 hard mask（即移除）目标基因组中所有可检测到的 retrotransposon 序列。

一般而言，对 transposable elements 进行 hard masking 通常是真核基因组注释的第一步。使用 Dfam 这类工具或 RepeatMasker 这类程序进行 hard masking，不仅可以移除“无关紧要”的遗传数据，还能加快基因识别过程并提高注释准确性。由于编码外显子通常不会与重复元件重叠，也不倾向于包含重复元件，因此在使用 hard-masked sequences 时，ab initio gene prediction programs 往往会预测出更少的假阳性外显子。例如，当使用不同的 ab initio gene predictors 分析 22 号染色体时，研究者发现假阳性基因预测显著减少（Parra et al. 2003）。具体而言，在不使用序列遮蔽时，GENSCAN 最初预测了 1128 个蛋白质编码基因；而使用序列遮蔽后，预测基因数降至 789。使用 GeneID 时，预测数量从 1119 降至 730。根据最新的 GENCODE 注释，22 号染色体中实际的蛋白质编码基因数量为 489。

第五章基因组注释

Finding and Removing Pseudogenes in Eukaryotes（真核生物中假基因的寻找与去除）

真核基因组注释中的一个特殊挑战，是区分那些识别出“真实”基因的预测结果，以及那些实际上对应 non-functional pseudogenes（无功能假基因）的预测结果。数据库搜索未必能让情况变得更清楚，因为许多假基因与有功能的 paralogous genes（旁系同源基因）相似。RNA-seq 实验中没有检测到 RNA transcript（RNA 转录本）也不能作为判断标准，因为真正的基因并不总是在所有条件下都有 RNA 转录本，这可能受到组织表达差异或发育阶段差异的影响。一般来说，如果某个无内含子的基因预测在同一基因组中存在多外显子旁系同源基因，那么它就值得怀疑，因为这可能提示该序列是通过 retrotransposition（逆转座）产生的。

然而，多外显子预测也可能对应由近期 gene duplication event（基因重复事件）产生的假基因。如果在另一种生物中存在 homologs（同源基因），一种解决办法是计算 synonymous versus non-synonymous substitution rate（同义替换率与非同义替换率之比，Ka/Ks；Fay and Wu 2003）。Ka/Ks 值接近 1 通常提示 neutral evolution（中性进化），从而暗示该序列可能是假基因。对多外显子基因预测的支持证据可以来自对近缘同源基因中整体基因结构保守性的评估。例如，在两个亲缘关系适中的生物（如小鼠和人类）中预测或识别到同源基因，通常提示该基因是真实基因，而不是假基因（Guigó et al. 2003）。

术语表（7 条）

English	中文
genomic evidence	基因组证据
transcriptional data	转录数据
pseudogene	假基因
transcription factor binding site	转录因子结合位点
retrovirus	逆转录病毒
prophage	前噬菌体
extrinsic gene finding	外源性基因查找

039

Genome Annotation Pipelines

PDF page 161 下部 - PDF page 165 上部；印刷页码 141-145

▶

English SourcePDF extracted

In the early days of genome annotation, when it would often take years just to sequence a single organism, teams of researchers and bioinformaticians would gather and work together for many months, or even years, to assemble the genome, perform the initial ab initio gene predictions, manually collate the experimental or literature-derived evidence, conduct comparative sequence analysis, and then synthesize the data into a consensus genome annotation. This was routinely done for both bacterial and eukaryotic genomes (Lander et al. 2001; Winsor et al. 2005; Riley et al. 2006). Indeed, it is still being done for the GENCODE project, which has been preparing and updating the reference human genome annotation since 2003 (Harrow et al. 2012). However, these efforts have required (and continue to require) enormous resources and time. With the appearance of very high-throughput NGSs and the ability to routinely sequence an entire genome in a few days, these manual approaches to genome annotation have become unsustainable. Now, most genome annotations are done through automated pipelines that help users to synthesize multiple pieces of evidence and data to generate a consensus genome annotation.

The choice of a pipeline tool depends on the type of organism (eukaryote vs. prokaryote), the computational resources available, the available evidence (RNA-seq or no RNA-seq data), and the similarity of the organism to previously annotated organisms. For instance, if one is annotating a genome with a closely related, previously annotated species, a simple comparative analysis or sequence projection should be sufficient. If the organism of interest has no closely related annotated species, a pipeline that uses RNA-seq or experimentally acquired protein sequence data will generate more accurate annotations. The most advanced genome annotation pipelines require many programs and perform complex analyses that need supercomputers such as large multi-core machines or massive computing clusters (maintained locally or available via the Cloud). For example, to annotate the loblolly pine genome (which contains 22 billion bases – seven times more than the human genome) required 8640 central processing units (CPUs) running for 14.6 hours (Wegrzyn et al. 2014). In the following sections we will briefly describe some commonly used annotation pipelines for prokaryotes and eukaryotes.

Ch5 Genome Annotation — Prokaryotic Genome Annotation Pipelines

PDF page 162 中部 – page 163 图注；印刷页码 142-143

Annotation pipelines for prokaryotes typically do not require the same computational resources as for eukaryotes. Indeed, most bacterial genomes can be annotated in less than 30 minutes, whether on a web server or on a desktop computer. However, the recent shift toward metagenomics or community bacterial genomics is beginning to lead to significantly greater computational demands that will be discussed in more detail in Chapter 16. Some of the more popular publicly available prokaryotic genome annotation pipelines include Prokka (Seemann 2014), Rapid Annotation using Subsystem Technology (RAST; Overbeek et al. 2014), and the Bacterial Annotation System (BASys; Van Domselaar et al. 2005). Prokka is an open-source Perl program that runs with a command-line interface (on UNIX). Prokka can be used to annotate pre-assembled bacterial, archaeal, and viral sequences. With Prokka, a typical 4 million base pair bacterial genome can be fully annotated in less than 10 minutes on a quad-core computer. Prokka is also capable of producing standards-compliant output files for further analysis or viewing. Prokka’s appeal lies in its speed and ability to perform “private” annotations on local computers.

In contrast to Prokka, RAST and BASys are genome annotation web servers. Web servers are generally easier to use but they do not offer the privacy of a locally installed program. RAST is a registration-based web server that accepts standard, pre-assembled DNA sequence files and then identifies protein-encoding, rRNA, and tRNA genes, assigns functions to the genes, and finally uses this information to reconstruct a metabolic network for the organism. In contrast to RAST, BASys is an open access web server. BASys accepts pre-assembled FASTA-formatted DNA or protein files from bacteria, archaea, and viruses and performs many of the same annotation functions as RAST. However, BASys provides a much greater depth of annotation (covering more than 50 calculable properties) and produces colorful, easily viewed genome maps (Figure 5.9) using a program called CGView (Stothard and Wishart 2005).

Figure 5.9 A screenshot of a BASys bacterial genome annotation output for the bacterium Salmonella enterica. The BASys image can be interactively zoomed-in to reveal rich annotations for all of the genes in the genome.

Ch5 Genome Annotation — Eukaryotic Genome Annotation Pipelines

PDF page 163 中部 – page 164；印刷页码 143-144

Given the complexity of eukaryotic genomes, their corresponding annotation pipelines must do somewhat more than those used for prokaryotic genomes. In particular, eukaryotic genome annotation pipelines must combine not only the ab initio gene predictions (or multiple gene predictions from multiple sources) but also many other pieces of evidence, including experimental data. As a result, almost all modern eukaryotic genome annotation pipelines use a technique called "evidence clustering" to identify gene regions and then use the aligned RNA (from RNA-seq) and protein evidence to improve the accuracy of the gene predictors. Some pipelines go even further and make use of a "combiner" algorithm to select the combination of exons that are best supported by the evidence. Two combiner programs in particular are very good at this: JIGSAW (Allen and Salzberg 2005) and EVidence Modeler, or EVM (Haas et al. 2008). These programs assess different types of evidence based on known error profiles and various kinds of user input and then choose the best combination of exons to minimize the error. In particular, EVM combines aligned protein and RNA transcript evidence with ab initio predictions into weighted consensus gene models, while JIGSAW uses non-linear models or weighted linear combiners to choose a single best consensus gene model.

Among the most widely used eukaryotic genome annotation pipelines, all of which use some kind of combiner algorithm, are MAKER2 (Holt and Yandell 2010), Ensembl (Fernández-Suárez and Schuster 2010), the National Center for Biotechnology Information (NCBI) Eukaryotic Annotation Pipeline (Thibaud-Nissen et al. 2016), PASA (Haas et al. 2008), and BRAKER1 (Hoff et al. 2016). The MAKER2 annotation pipeline is a highly parallelizable, stand-alone program that aligns and polishes protein sequence and transcriptome (RNA-seq) data with BLAST; it also provides evidence-based hints to various gene predictors and it creates an evidence trail with various quality metrics for each annotation. Some of MAKER2's quality metrics include the number of splice sites confirmed by RNA-seq evidence, the number of exons confirmed by RNA-seq data, and the lengths of 5' and 3' UTRs. MAKER2 also uses a quality metric called the Annotation Edit Distance, or AED (Eilbeck et al. 2009). The AED value ranges between 0 and 1, with higher quality annotations being associated with lower AEDs. MAKER2 uses these AED values to choose the best gene predictions from which to build its final annotation. Like the MAKER2 pipeline, the Ensembl genome annotation pipeline builds its gene models from aligned and polished protein sequence- and RNA-seq-derived transcriptome data. To complete the annotation process, Ensembl merges identical transcripts, and a non-redundant set of transcripts is reported for each gene. Both MAKER2 and Ensembl supply hints to indicate intron/exon boundaries using protein and RNA-seq alignments to their internal gene predictors. This helps to generate gene models that better represent the aligned evidence. This approach also helps improve gene prediction accuracy for poorly (or insufficiently) trained gene finders. Like the Ensembl and MAKER2 pipelines, the NCBI Annotation Pipeline aligns and polishes protein and transcriptome data. It also generates gene predictions using the Gnomon gene-finding program (Souvorov et al. 2010). The NCBI system typically assigns higher weights to manually curated evidence over computationally derived models or computationally generated evidence. The PASA genome annotation pipeline is one of the oldest annotation pipelines and was one of the first to use a combiner or evidence-clustering algorithm (EVM). PASA aligns RNA transcripts to the reference genome using BLAT (Kent 2002) or GMAP (Wu et al. 2016). PASA is capable of generating annotations based on RNA transcriptome data, on pre-existing gene models, or on ab initio gene predictions. PASA, along with the MAKER2 and Ensembl annotation pipelines, is able to add UTRs to their genome annotations via RNA-seq data to further increase their accuracy. One of the latest additions to publicly available eukaryotic genome annotation pipelines is the BRAKER suite of programs (Hoff et al. 2016). BRAKER1 (and most recently BRAKER2) combines the strengths of GeneMark-ET with AUGUSTUS – both of which use RNA-seq data to improve their gene annotation accuracy. In the BRAKER pipeline, GeneMark-ET is used first to train and generate initial gene structures, then AUGUSTUS makes use of the initially predicted genes for further training and integrates RNA-seq data into the final gene predictions. BRAKER1 has been shown to be 10–20% more accurate than MAKER2 in terms of gene and exon sensitivity/specificity.

Even with an exon accuracy of >90% (rarely achieved by even the best eukaryotic genome annotation pipelines), most genes in a genome will have at least one incorrectly annotated exon. Incorrectly identified genes or mistaken gene annotations can have very serious consequences for experimentalists who are designing experiments to study gene functions. Indeed, many failed molecular biology or gene-cloning experiments can be traced back to incorrect gene annotations. Furthermore, incorrect annotations can propagate, leading to a cascade of errors that affect many other scientists. This happens when an incorrect annotation is innocently passed on to another genome project and then used as evidence in still more genome annotation efforts which eventually end up in public databases. To help prevent these errors or reduce the magnitude of these mistakes, most annotation pipelines include some kind of quality metrics which are attached to each and every gene annotation. Most of these metrics are based on a score that measures the agreement of a given gene annotation to an aligned RNA/protein sequence or on the basis of the homology and synteny of the gene to closely related species. Some pipelines use a simple star rating (ranging from zero to five). Zero stars correspond to an annotation where none of the exons is supported by aligned evidence, while a five-star rating corresponds to a situation where every exon is supported and every splice site is confirmed by a single full-length cDNA. Other pipelines use more sophisticated metrics, such as the AED score (mentioned above). Protein family domains can also be good indicators of annotation quality and annotation completeness. Certainly, any annotation that contains an identifiable protein domain is more likely to encode a functional protein than one that does not. Domain matching has been used to rescue a number of gene annotations that would have otherwise received a "failing" quality score owing to a poor sequence alignment. Both Ensembl and MAKER2 report the fraction of annotations containing a protein family domain as a quality measure. Interestingly, this fraction (0.69) appears to be quite constant across genomes; the closer a given genome is to this fraction, the more confidence one has in its quality. In addition to the domain-matching fraction, the presence or absence of BUSCO genes can also be used to provide a measure of the completeness of a genome annotation (Simão et al. 2015). Another excellent route to ensure good quality annotations is through manual inspection with genome visualization and editing software. This is discussed in more detail below.

Ch5 Genome Annotation — Visualization and Quality Control

PDF page 165 上部；印刷页码 145

While automated or semi-automated pipelines for genome annotation have become the norm, there is still a need for a human factor in annotating genomes and assessing their quality. Having a knowledgeable biologist or some kind of "domain expert" carefully look through a genome annotation is essential to ensure that the annotations makes sense. This manual review process also allows one to catch and correct suspicious annotations or fill in missing annotations. However, to perform these manual reviews or curatorial tasks, it is necessary to visualize and interactively edit the annotations. Certainly two of the best known genome browsers are the University of California Santa Cruz Genome Browser (Casper et al. 2018) and Ensembl's Genome Browser (Fernández-Suárez and Schuster 2010), both of which have been thoroughly reviewed in Chapter 4. While these tools are excellent for visualizing genome annotations, there are also a number of other tools that support both visualizing and editing genome annotations, including Web Apollo (Lee et al. 2013), GenomeView (Abeel et al. 2012), and Artemis (Carver et al. 2012).

Web Apollo is both a visualization tool and a genome editor. More specifically, it is a web-based plug-in for JBrowse (Westesson et al. 2013) that provides an editable, user-created annotation track. All edits in Web Apollo are visible in real time to all members of the annotation team. This feature is particularly helpful when undertaking a community annotation project or when many investigators are involved in a particular genome analysis. GenomeView is an open-source, stand-alone genome viewer and editor that allows users to dynamically browse large volumes of aligned short-read data. It supports dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide level. GenomeView is particularly noted for its ability to visualize whole genome alignments of dozens of genomes relative to a reference sequence. It also supports the visualization of synteny and multi-alignment data. Artemis is a genome browser and annotation tool that allows users to easily visualize, browse, and interpret large NGS datasets. It supports multiple sequence read views and variant displays, along with a comprehensive set of read alignment views and read alignment filters. It also has the ability to simultaneously display multiple different views of the same dataset to its users. Artemis can read EMBL and GENBANK database entries, FASTA sequence formats (indexed or raw), and other features in EMBL and GENBANK formats.

When reviewing an annotated genome (regardless of whether it is from a prokaryote or a eukaryote), it is always useful to randomly select a specific region and to use the chosen visualization/editing tools to carefully analyze the annotations together with the evidence provided. This evidence may include the ab initio predicted genes, the spliced RNA-seq alignments, or any homologous protein alignments. While browsing through the selected region, one may notice certain genes or clusters of genes that seem to contradict the displayed evidence. For instance, the RNA-seq data may support additional or different splice forms. Alternatively, certain cross-species proteins may map to genomic regions where no gene has been previously predicted. Visual inspection can also reveal certain systematic problems with the annotation process, such as a tendency to miss genes with known database homologs or the appearance of repeats that overlap or mask many protein-coding genes. These problems may be addressed by changing parameter settings on the genome annotation pipeline, performing the necessary edits manually, or by choosing another tool. Multiple, iterative rounds of manual reviewing and manual editing followed by automated pipeline annotation are often necessary to complete a full and thorough genome annotation.

中文译文

译文：Ch5 Genome Annotation / Genome Annotation Pipelines

章节：Ch5 Genome Annotation

Canonical 小节：Genome Annotation Pipelines

范围：PDF page 161 下部 - PDF page 165 上部；印刷页码 141-145

---

第五章基因组注释

Genome Annotation Pipelines（基因组注释流水线）

在基因组注释的早期，仅测序一个生物体通常就需要数年时间。研究人员和生物信息学人员会聚集在一起，协作数月甚至数年，完成基因组组装、初始的 ab initio gene predictions（从头基因预测）、手工汇总实验或文献来源的证据、进行比较序列分析，然后将这些数据综合为一个 consensus genome annotation（一致性基因组注释）。这种做法过去常规用于细菌和真核基因组（Lander et al. 2001; Winsor et al. 2005; Riley et al. 2006）。事实上，GENCODE 项目仍在这样做；该项目自 2003 年以来一直在制备并更新人类参考基因组注释（Harrow et al. 2012）。然而，这些工作过去需要、现在也仍然需要大量资源和时间。随着 very high-throughput NGSs（超高通量下一代测序技术）的出现，以及如今能够在几天内常规完成整个基因组测序，基因组注释的这些手工方法已经变得不可持续。现在，大多数基因组注释都通过 automated pipelines（自动化流水线）完成，这些流水线帮助用户综合多种证据和数据，生成一致性基因组注释。

pipeline 工具的选择取决于生物类型（真核生物还是原核生物）、可用计算资源、可用证据（是否有 RNA-seq 数据），以及该生物与已有注释生物之间的相似程度。例如，如果要注释的基因组有一个亲缘关系很近且已经注释过的物种，那么简单的比较分析或序列投射通常就足够了。如果目标生物没有近缘的已注释物种，那么使用 RNA-seq 或实验获得的蛋白质序列数据的 pipeline 会生成更准确的注释。最先进的基因组注释流水线需要调用许多程序并执行复杂分析，因此需要超级计算机级资源，例如大型多核机器或大规模计算集群（可本地维护，也可通过 Cloud 获得）。例如，注释火炬松（loblolly pine）基因组需要 8640 个 central processing units（CPUs）运行 14.6 小时；该基因组包含 220 亿个碱基，是人类基因组的 7 倍（Wegrzyn et al. 2014）。在以下几节中，我们将简要介绍一些常用的原核和真核生物注释流水线。

第五章基因组注释

Prokaryotic Genome Annotation Pipelines（原核基因组注释流水线）

原核生物的注释流水线通常不需要像真核生物注释那样多的计算资源。事实上，大多数细菌基因组无论是在 web server 上还是在台式计算机上，都可以在 30 分钟以内完成注释。不过，近来向 metagenomics（宏基因组学）或 community bacterial genomics（群落细菌基因组学）的转变，正在带来显著更高的计算需求；这一点将在第 16 章中更详细讨论。一些较受欢迎且公开可用的原核基因组注释流水线包括 Prokka（Seemann 2014）、Rapid Annotation using Subsystem Technology（RAST；Overbeek et al. 2014）和 Bacterial Annotation System（BASys；Van Domselaar et al. 2005）。Prokka 是一个开源 Perl 程序，在 UNIX 上以 command-line interface（命令行界面）运行。Prokka 可用于注释已经组装好的细菌、古菌和病毒序列。使用 Prokka 时，一个典型的 400 万碱基对细菌基因组可以在四核计算机上于 10 分钟内完成完整注释。Prokka 还能生成符合标准的输出文件，供后续分析或查看。Prokka 的吸引力在于速度快，并且能够在本地计算机上执行“私有”注释。

与 Prokka 不同，RAST 和 BASys 是基因组注释 web server。Web server 通常更易使用，但不能提供本地安装程序所具有的隐私性。RAST 是一个需要注册的 web server，它接受标准的、已组装的 DNA 序列文件，然后识别 protein-encoding genes（蛋白质编码基因）、rRNA 和 tRNA 基因，为这些基因分配功能，最后利用这些信息重建该生物的 metabolic network（代谢网络）。与 RAST 不同，BASys 是一个开放访问的 web server。BASys 接受来自细菌、古菌和病毒的、已组装的 FASTA 格式 DNA 或蛋白质文件，并执行许多与 RAST 相同的注释功能。不过，BASys 提供的注释深度要大得多（覆盖 50 多种可计算属性），并使用名为 CGView 的程序生成色彩丰富、易于查看的基因组图谱（图 5.9）（Stothard and Wishart 2005）。

图 5.9 BASys 对细菌 Salmonella enterica 进行基因组注释后输出结果的截图。BASys 图像可以交互式放大，从而显示基因组中所有基因的丰富注释信息。

第五章基因组注释

Eukaryotic Genome Annotation Pipelines（真核基因组注释流水线）

鉴于真核基因组的复杂性，其对应的注释流水线必须比原核生物使用的流水线做得更多。具体而言，真核基因组注释流水线不仅需要整合 ab initio gene predictions（从头基因预测）或来自多个来源的多个基因预测结果，还需要整合许多其他类型的证据，包括实验数据。因此，几乎所有现代真核基因组注释流水线都采用一种称为"evidence clustering"（证据聚类）的技术来识别基因区域，然后利用对齐的 RNA（来自 RNA-seq）和蛋白质证据来提高基因预测器的准确性。部分流水线更进一步，使用"combiner" algorithm（组合算法）来选择证据支持度最好的外显子组合。在这方面尤为出色的两个组合程序分别是 JIGSAW（Allen and Salzberg 2005）和 EVidence Modeler，即 EVM（Haas et al. 2008）。这些程序根据已知的错误特征和各种用户输入来评估不同类型的证据，然后选择最佳的外显子组合以使误差最小化。具体而言，EVM 将对齐的蛋白质和 RNA 转录本证据与从头预测整合为加权 consensus gene models（一致性基因模型），而 JIGSAW 则使用非线性模型或加权线性组合器来选择单一最佳一致性基因模型。

在最为广泛使用的真核基因组注释流水线中（它们都使用某种组合算法）包括 MAKER2（Holt and Yandell 2010）、Ensembl（Fernández-Suárez and Schuster 2010）、美国国家生物技术信息中心（NCBI）真核注释流水线（Thibaud-Nissen et al. 2016）、PASA（Haas et al. 2008）和 BRAKER1（Hoff et al. 2016）。MAKER2 注释流水线是一个高度可并行化的独立程序，使用 BLAST 来对齐和优化蛋白质序列及转录组（RNA-seq）数据；它还向各种基因预测器提供基于证据的提示，并为每个注释创建带有各种质量指标 evidence trail（证据链）。MAKER2 的一些质量指标包括：RNA-seq 证据确认的剪接位点数量、RNA-seq 数据确认的外显子数量，以及 5' 和 3' UTR 的长度。MAKER2 还使用一种称为 Annotation Edit Distance（AED，注释编辑距离）的质量指标（Eilbeck et al. 2009）。AED 值介于 0 和 1 之间，质量越高的注释对应的 AED 值越低。MAKER2 利用这些 AED 值来选择最佳基因预测结果，并以此构建最终注释。与 MAKER2 流水线类似，Ensembl 基因组注释流水线也是从对齐和优化的蛋白质序列及 RNA-seq 衍生的转录组数据来构建基因模型。注释过程完成后，Ensembl 会合并相同的转录本，并为每个基因报告一套非冗余的转录本。MAKER2 和 Ensembl 都利用蛋白质和 RNA-seq 比对结果向其内部基因预测器提供内含子/外显子边界提示。这有助于生成能更好代表比对证据的基因模型。这种方法也有助于提高对训练不足（或不充分训练）的基因寻找器的基因预测准确性。与 Ensembl 和 MAKER2 流水线类似，NCBI 注释流水线也对齐和优化蛋白质及转录组数据。它还使用 Gnomon 基因寻找程序来生成基因预测（Souvorov et al. 2010）。NCBI 系统通常会给人工审查证据分配比计算衍生模型或计算生成证据更高的权重。PASA 基因组注释流水线是最古老的注释流水线之一，也是最早使用组合算法或证据聚类算法（EVM）的工具之一。PASA 使用 BLAT（Kent 2002）或 GMAP（Wu et al. 2016）将 RNA 转录本比对到参考基因组。PASA 能够基于 RNA 转录组数据、已有基因模型或从头基因预测来生成注释。PASA 与 MAKER2 和 Ensembl 注释流水线一样，能够通过 RNA-seq 数据向基因组注释添加 UTR，以进一步提高准确性。在公开可用的真核基因组注释流水线中，最新加入的之一是 BRAKER 程序套件（Hoff et al. 2016）。BRAKER1（以及更新的 BRAKER2）将 GeneMark-ET 与 AUGUSTUS 的优势相结合——这两者都利用 RNA-seq 数据来提高基因注释准确性。在 BRAKER 流水线中，首先使用 GeneMark-ET 进行训练并生成初始基因结构，然后 AUGUSTUS 利用最初预测的基因进行进一步训练，并将 RNA-seq 数据整合到最终基因预测中。研究表明，就基因和外显子的敏感性/特异性而言，BRAKER1 比 MAKER2 准确率高 10–20%。

即便外显子准确率达到 >90%（这甚至连最好的真核基因组注释流水线也罕能达到），一个基因组中的大多数基因仍至少会有一个注释错误的外显子。错误识别的基因或错误的基因注释会对设计实验研究基因功能的实验人员产生非常严重的后果。事实上，许多失败的分子生物学或基因克隆实验都可以追溯到错误的基因注释。此外，错误的注释会传播，导致影响许多其他科学家的连锁错误。当一个错误的注释被无辜地传递给另一个基因组项目，然后被用作更多基因组注释工作的证据，最终进入公共数据库时，就会发生这种情况。为了帮助防止这些错误或减少其影响程度，大多数注释流水线都会为每个基因注释附带某种质量指标。这些指标大多基于一个分数，该分数衡量给定基因注释与对齐的 RNA/蛋白质序列的一致性程度，或基于该基因与近缘物种的同源性和共线性。一些流水线使用简单的星级评分（从零星到五星）。零星对应于没有任何外显子被比对证据支持的注释，而五星则对应于每个外显子都得到支持且每个剪接位点都由单个全长 cDNA 确认的情况。其他流水线使用更复杂的指标，如上述的 AED 分数。蛋白质家族结构域也可以作为注释质量和注释完整性的良好指标。可以肯定的是，包含可识别蛋白质结构域的注释比不包含的注释更可能编码有功能的蛋白质。结构域匹配已被用于拯救大量基因注释——这些注释原本因序列比对质量差而得到了“不合格”的质量分数。Ensembl 和 MAKER2 都将含有蛋白质家族结构域的注释比例作为质量度量指标。有趣的是，这一比例（0.69）在各基因组中相当恒定；给定基因组越接近这一比例，人们对其质量就越有信心。除了结构域匹配比例外，BUSCO 基因的存在与否也可用于提供基因组注释完整性的度量（Simão et al. 2015）。确保高质量注释的另一个极好途径是通过人工检查以及使用基因组可视化和编辑软件。这将在下文中详细讨论。

第五章基因组注释

Visualization and Quality Control（可视化与质量控制）

尽管 automated 或 semi-automated 的基因组注释流水线已经成为常态，但在基因组注释及其质量评估中，仍然需要人的参与。由一位知识扎实的生物学家或某种“domain expert”（领域专家）仔细审视一套基因组注释，对于确保这些注释在逻辑上讲得通至关重要。这样的人工审查过程也使人们能够发现并纠正可疑注释，或补全缺失注释。然而，要执行这些人工审查或人工审查任务，就必须能够对注释进行可视化并进行交互式编辑。最知名的两个 genome browser 无疑是 University of California Santa Cruz Genome Browser（Casper et al. 2018）和 Ensembl Genome Browser（Fernández-Suárez and Schuster 2010）；这两者在第 4 章中都已经做过详细介绍。虽然这些工具非常适合用来可视化基因组注释，但也有不少其他工具同时支持基因组注释的可视化与编辑，包括 Web Apollo（Lee et al. 2013）、GenomeView（Abeel et al. 2012）和 Artemis（Carver et al. 2012）。

Web Apollo 既是可视化工具，也是基因组编辑器。更具体地说，它是 JBrowse（Westesson et al. 2013）的一个基于 Web 的 plug-in，能够提供用户创建且可编辑的注释轨道。Web Apollo 中的所有编辑都会实时对注释团队的所有成员可见。这一特性在开展 community annotation project（社区注释项目）或在某个特定基因组分析中有许多研究人员参与时尤其有帮助。GenomeView 是一个开源、独立运行的基因组查看器和编辑器，允许用户动态浏览大量已对齐的 short-read 数据。它支持从 whole genome level（全基因组层面）到 single nucleotide level（单核苷酸层面）的动态导航和 semantic zooming（语义缩放）。GenomeView 尤其以能够可视化相对于参考序列的数十个基因组的全基因组比对而著称。它还支持 synteny（共线性）和 multi-alignment data（多重比对数据）的可视化。Artemis 是一个 genome browser 和注释工具，允许用户轻松地可视化、浏览并解释大型 NGS 数据集。它支持多种序列读段视图和变异显示，以及一整套读段比对视图和读段比对过滤器。它还能够同时向用户展示同一数据集的多个不同视图。Artemis 可以读取 EMBL 和 GENBANK 数据库条目、FASTA 序列格式（索引版或原始版），以及 EMBL 和 GENBANK 格式中的其他特征信息。

在审查一个已注释基因组时（无论它来自原核生物还是真核生物），随机选择一个特定区域，并使用所选的可视化/编辑工具，将注释与所提供的证据结合起来仔细分析，始终是有用的。这些证据可能包括 ab initio 预测得到的基因、剪接后的 RNA-seq 比对结果，或任何同源蛋白质比对结果。在浏览所选区域时，人们可能会注意到某些基因或基因簇似乎与展示出来的证据相矛盾。例如，RNA-seq 数据可能支持额外的或不同的剪接形式。或者，某些跨物种蛋白质可能会映射到此前没有预测到基因的基因组区域。可视化检查还可以揭示注释流程中的某些系统性问题，例如倾向于漏掉在数据库中已有已知同源物的基因，或者出现与许多蛋白质编码基因重叠或将其遮蔽的重复序列。这些问题可以通过修改基因组注释流水线中的参数设置、手动进行必要编辑，或改用其他工具来解决。为了完成一套完整而彻底的基因组注释，通常需要经历多轮、迭代式的人工审查与人工编辑，然后再接 automated pipeline annotation（自动化流水线注释）。

040

Summary + Acknowledgments + Internet Resources + Further Reading + References

PDF page 165 下部 - PDF page 174；印刷页码 145-154

▶

English SourcePDF extracted

Genome annotation has evolved considerably over the past two decades. These changes have been driven, in part, by significant improvements in computational techniques (for gene prediction) and in part by a significant expansion in the number of known and annotated genomes from an ever-growing number of diverse species. The availability of improved gene prediction tools, along with significantly expanded databases of well-annotated genes, proteins, and genomes, has moved genome annotation away from pure gene prediction to a more integrated, holistic approach that combines multiple lines of evidence to locate, identify, and functionally annotate genes. When combined with experimental data such as RNA-seq data or protein sequence data (from structural proteomics or expression-based proteomics), it is possible to obtain remarkably accurate and impressively complete annotations. This comprehensive blending of evidence is the basis for many newly developed, semi-automated or automated genome annotation pipelines and to many of the newer genome browsers and editors. However, not all genome annotation efforts can yield the same quantity or quality of information. Certainly prokaryotic genome annotation is faster, easier, and much more accurate than eukaryotic genome annotation. Indeed, the challenge of prokaryotic genome annotation is essentially a “solved problem,” while the challenge of eukaryotic genome annotation has to be considered as a “work in progress.”

Supplemental source: Acknowledgments / Internet Resources / Further Reading / References

PDF pages: 166-173 | Print pages: 146-153

Acknowledgments

TheauthorthanksAndyBaxevanisandRodericGuigófortheirhelpfulcommentsandtheuse

ofmaterialfromprioreditionsofthisbook.

Internet Resources

AbInitioProkaryoticGenePredictors

EasyGene(server) www.cbs.dtu.dk/services/EasyGene

GeneMark.hmm(server) opal.biology.gatech.edu/GeneMark/gmhmmp.cgi

GeneMarkS(server) opal.biology.gatech.edu/GeneMark/genemarks.cgi

GLIMMER(program) www.cs.jhu.edu/~genomics/Glimmer

Prodigal(program) github.com/hyattpd/Prodigal

AbInitioEukaryoticGenePredictors

GeneID%28server%29%20genome.crg.es/geneid.html

GeneMark-ES(program) opal.biology.gatech.edu/GeneMark

GeneZilla(program) www.genezilla.org

GenomeScan%28server%29%20hollywood.mit.edu/genomescan.html

GENSCAN%28server%29%20hollywood.mit.edu/GENSCAN.html

HMMgene(server) www.cbs.dtu.dk/services/HMMgene

SNAP%28program%29%20korflab.ucdavis.edu/software.html

Hybrid/ExtrinsicEukaryoticGenomeFinders

AUGUSTUS(server) bioinf.uni-greifswald.de/augustus

AUGUSTUS-PPX(program) bioinf.uni-greifswald.de/augustus

CONTRAST(program) contra.stanford.edu/contrast

GeneID(server) genome.crg.es/software/geneid

GeneWise(server) www.ebi.ac.uk/Tools/psa/genewise

GenomeThreader(program) genomethreader.org

GSNAP(program) research-pub.gene.com/gmap

mGENE(program) www.mgene.org

Internet Resources 147

Hybrid/ExtrinsicEukaryoticGenomeFinders

Mugsy-Annotator(program) mugsy.sourceforge.net

SGP-2(program) genome.crg.es/software/sgp2

STAR(program) code.google.com/archive/p/rna-star

Transomics(program) linux1.softberry.com/berry.phtml?topic=transomics

tRNAandrRNAFinders

Rfam(server) rfam.xfam.org

RNAmmer(server) www.cbs.dtu.dk/services/RNAmmer

RNAMotif%28program%29%20casegroup.rutgers.edu/casegr-sh-2.5.html

tRNAdb(server) trnadb.bioinf.uni-leipzig.de/DataOutput/Welcome

tRNADB-CE(server) trna.ie.niigata-u.ac.jp/cgi-bin/trnadb/index.cgi

tRNAfinder(server) ei4web.yz.yamagata-u.ac.jp/~kinouchi/tRNAfinder

tRNAscan-SE(server) lowelab.ucsc.edu/tRNAscan-SE

Phage-FindingTools

Phage_Finder(program) phage-finder.sourceforge.net

PHAST(server) phast.wishartlab.com

PHASTER(server) phaster.ca

RepeatFinding/MaskingTools

Dfam(server) www.dfam.org

LTR_FINDER(server) tlife.fudan.edu.cn/tlife/ltr_finder

LTRharvest%28program%29%20genometools.org/index.html

MITE-Hunter%28program%29%20target.iplantcollaborative.org/mite_hunter.html

Repbase(server) www.girinst.org/repbase

RepeatMasker(program) www.repeatmasker.org

RepeatScout(program) bix.ucsd.edu/repeatscout

RetroPred%28program%29%20www.juit.ac.in/attachments/RetroPred/home.html

ProkaryoticGenomeAnnotationPipelines

BASys(server) www.basys.ca

Prokka(program) www.vicbioinformatics.com/software.prokka.shtml

RAST(server/program) rast.nmpdr.org

EukaryoticGenomeAnnotationPipelines

BRAKER1(program) bioinf.uni-greifswald.de/bioinf/braker

EVM(program) evidencemodeler.github.io

JIGSAW(program) www.cbcb.umd.edu/software/jigsaw

MAKER2%28program%29%20www.yandell-lab.org/software/maker.html

PASA(program) github.com/PASApipeline/PASApipeline/wiki

GenomeBrowsersand/orEditors

Artemis(program) www.sanger.ac.uk/science/tools/artemis

Ensembl%28program%29%20uswest.ensembl.org/downloads.html

GenomeView(program) genomeview.org

JBrowse(program) jbrowse.org

UCSCGenomeBrowser%20hgdownload.cse.ucsc.edu/downloads.html

WebApollo(program) genomearchitect.github.io

148 Genome Annotation

译文：Ch5 Genome Annotation / Summary + Acknowledgments + Internet Resources + Further Reading + References

章节：Ch5 Genome Annotation

Canonical 小节：Summary + Acknowledgments + Internet Resources + Further Reading + References

范围：PDF page 165 下部 - PDF page 174；印刷页码 145-154

---

第五章基因组注释

Summary（小结）

在过去二十年中，基因组注释已经发生了显著演变。这些变化一方面源于计算技术（用于基因预测）的显著改进，另一方面也源于来自越来越多不同物种的已知和已注释基因组数量的大幅增加。改进后的基因预测工具，加上显著扩展且注释良好的基因、蛋白质和基因组数据库，已经推动基因组注释从单纯的 gene prediction（基因预测）转向一种更加 integrated、holistic（整合性、整体性）的方法：这种方法结合多条证据线索来定位、识别并在功能上注释基因。当这些证据再与 RNA-seq 数据或蛋白质序列数据（来自结构蛋白质组学或基于表达的蛋白质组学）等实验数据结合时，就有可能获得非常准确且相当完整的注释。这种对证据的综合融合，是许多新近开发的 semi-automated 或 automated genome annotation pipelines（半自动化或自动化基因组注释流水线），以及许多较新的 genome browsers 和 editors 的基础。

然而，并非所有基因组注释工作都能产生同等数量或同等质量的信息。可以肯定的是，原核基因组注释比真核基因组注释更快、更容易，也准确得多。事实上，原核基因组注释的挑战基本上已经是一个“已解决的问题”，而真核基因组注释的挑战则必须被视为一个“仍在推进中的工作”。

致谢

作者感谢 Andy Baxevanis 和 Roderic Guigó 提供有益评论，并感谢他们允许使用本书前几版中的相关材料。

网络资源

Ab Initio 原核基因预测器

EasyGene（server）：www.cbs.dtu.dk/services/EasyGene
GeneMark.hmm（server）：opal.biology.gatech.edu/GeneMark/gmhmmp.cgi
GeneMarkS（server）：opal.biology.gatech.edu/GeneMark/genemarks.cgi
GLIMMER（program）：www.cs.jhu.edu/~genomics/Glimmer
Prodigal（program）：github.com/hyattpd/Prodigal

Ab Initio 真核基因预测器

-%20GeneID%EF%BC%88server%EF%BC%89%EF%BC%9Agenome.crg.es/geneid.html

GeneMark-ES（program）：opal.biology.gatech.edu/GeneMark
GeneZilla（program）：www.genezilla.org

-%20GenomeScan%EF%BC%88server%EF%BC%89%EF%BC%9Ahollywood.mit.edu/genomescan.html

-%20GENSCAN%EF%BC%88server%EF%BC%89%EF%BC%9Ahollywood.mit.edu/GENSCAN.html

HMMgene（server）：www.cbs.dtu.dk/services/HMMgene

-%20SNAP%EF%BC%88program%EF%BC%89%EF%BC%9Akorflab.ucdavis.edu/software.html

Hybrid / Extrinsic 真核基因查找器

AUGUSTUS（server）：bioinf.uni-greifswald.de/augustus
AUGUSTUS-PPX（program）：bioinf.uni-greifswald.de/augustus
CONTRAST（program）：contra.stanford.edu/contrast
GeneID（server）：genome.crg.es/software/geneid
GeneWise（server）：www.ebi.ac.uk/Tools/psa/genewise
GenomeThreader（program）：genomethreader.org
GSNAP（program）：research-pub.gene.com/gmap
mGENE（program）：www.mgene.org
Mugsy-Annotator（program）：mugsy.sourceforge.net
SGP-2（program）：genome.crg.es/software/sgp2
STAR（program）：code.google.com/archive/p/rna-star
Transomics（program）：linux1.softberry.com/berry.phtml?topic=transomics

tRNA 和 rRNA 查找工具

Rfam（server）：rfam.xfam.org
RNAmmer（server）：www.cbs.dtu.dk/services/RNAmmer

-%20RNAMotif%EF%BC%88program%EF%BC%89%EF%BC%9Acasegroup.rutgers.edu/casegr-sh-2.5.html

tRNAdb（server）：trnadb.bioinf.uni-leipzig.de/DataOutput/Welcome
tRNADB-CE（server）：trna.ie.niigata-u.ac.jp/cgi-bin/trnadb/index.cgi
tRNAfinder（server）：ei4web.yz.yamagata-u.ac.jp/~kinouchi/tRNAfinder
tRNAscan-SE（server）：lowelab.ucsc.edu/tRNAscan-SE

噬菌体查找工具

Phage_Finder（program）：phage-finder.sourceforge.net
PHAST（server）：phast.wishartlab.com
PHASTER（server）：phaster.ca

重复序列查找 / 遮蔽工具

Dfam（server）：www.dfam.org
LTR_FINDER（server）：tlife.fudan.edu.cn/tlife/ltr_finder

-%20LTRharvest%EF%BC%88program%EF%BC%89%EF%BC%9Agenometools.org/index.html

-%20MITE-Hunter%EF%BC%88program%EF%BC%89%EF%BC%9Atarget.iplantcollaborative.org/mite_hunter.html

Repbase（server）：www.girinst.org/repbase
RepeatMasker（program）：www.repeatmasker.org
RepeatScout（program）：bix.ucsd.edu/repeatscout

-%20RetroPred%EF%BC%88program%EF%BC%89%EF%BC%9Awww.juit.ac.in/attachments/RetroPred/home.html

原核基因组注释流水线

BASys（server）：www.basys.ca
Prokka（program）：www.vicbioinformatics.com/software.prokka.shtml
RAST（server/program）：rast.nmpdr.org

真核基因组注释流水线

BRAKER1（program）：bioinf.uni-greifswald.de/bioinf/braker
EVM（program）：evidencemodeler.github.io
JIGSAW（program）：www.cbcb.umd.edu/software/jigsaw

-%20MAKER2%EF%BC%88program%EF%BC%89%EF%BC%9Awww.yandell-lab.org/software/maker.html

PASA（program）：github.com/PASApipeline/PASApipeline/wiki

基因组浏览器和 / 或编辑器

Artemis（program）：www.sanger.ac.uk/science/tools/artemis

-%20Ensembl%EF%BC%88program%EF%BC%89%EF%BC%9Auswest.ensembl.org/downloads.html

GenomeView（program）：genomeview.org
JBrowse（program）：jbrowse.org

-%20UCSC%20Genome%20Browser%EF%BC%9Ahgdownload.cse.ucsc.edu/downloads.html

Web Apollo（program）：genomearchitect.github.io

延伸阅读

Hoff, K.J. and Stanke, M. (2015). Current methods for automated annotation of protein-coding genes. Curr. Opin. Insect Sci. 7, 8–14.

一篇写得很好且内容较新的综述，总结了基因组注释领域的一些最新进展，并就应使用哪些注释工具提供了非常实用的建议。

Nielsen, P. and Krogh, A. (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 21, 4322–4329.

一篇非常易读的文章，评估了原核基因预测和基因组注释。

Yandell, M. and Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 13, 329–342.

一篇优秀、易读的入门文章，介绍真核基因组注释涉及的流程，并对可用计算工具和最佳实践作了有用说明。

Yoon, B. (2009). Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics 10, 402–415.

一篇关于 HMM 的综合教程，提供了许多有用示例，并解释不同 HMM 如何构建，以及如何用于基因预测和基因序列分析。

参考文献

> 以下参考文献题录按原文保留，不翻译。

References
Abe,T.,Inokuchi,H.,Yamada,Y.etal.(2014).tRNADB-CE:tRNAgenedatabasewell-timedin
theeraofbigsequencedata. Front.Genet. 5:114.
Abeel,T.,VanParys,T.,Saeys,Y.etal.(2012).GenomeView:anext-generationgenomebrowser.
NucleicAcidsRes. 40(2):e12.
Alexandersson,M.,Cawley,S.,andPatcher,L.(2003).SLAM:cross-speciesgenefindingand
alignmentwithageneralizedpairhiddenMarkovmodel. GenomeRes. 13:496–502.
Allen,J.E.andSalzberg,S.L.(2005).JIGSAW:integrationofmultiplesourcesofevidenceforgene
prediction.Bioinformatics21:3596–3603.
Allen,J.E.,Majoros,W.H.,Pertea,M.,andSalzberg,S.L.(2006).JIGSAW,GeneZilla,and
GlimmerHMM:puzzlingoutthefeaturesofhumangenesintheENCODEregions. Genome
Biol.7(Suppl1,S9):1–13.
Angiuoli,S.V.,DunningHotopp,J.C.,Salzberg,S.L.,andTettelin,H.(2011).Improving
pan-genomeannotationusingwholegenomemultiplealignment. BMCBioinf 12:272.
Arndt,D.,Grant,J.R.,Marcu,A.etal.(2016).PHASTER:abetter,fasterversionofthePHAST
phagesearchtool. NucleicAcidsRes. 44(W1):W16–W21.
Bao,Z.andEddy,S.R.(2002).Automateddenovoidentificationofrepeatsequencefamiliesin
sequencedgenomes. GenomeRes. 12:1269–1276.
Bellman,R.E.(1957). DynamicProgramming.Princeton:PrincetonUniversityPress.
Besemer,J.andBorodovsky,M.(2005).GeneMark:websoftwareforgenefindinginprokaryotes,
eukaryotesandviruses. NucleicAcidsRes. 33(WebServer):W451–W454.
Besemer,J.,Lomsadze,A.,andBorodovsky,M.(2001).GeneMarkS:aself-trainingmethodfor
predictionofgenestartsinmicrobialgenomes.Implicationsforfindingsequencemotifsin
regulatoryregions. NucleicAcidsRes. 29:2607–2618.
Birney,E.andDurbin,R.(1997).Dynamite:aflexiblecodegeneratinglanguagefordynamic
programmingmethodsusedinsequencecomparison.In: ProceedingsoftheFifthInternational
ConferenceonIntelligentSystemsforMolecularBiology,Halkidiki,Greece(21–26June1997) ,vol.
5,56–64.MenloPark,CA:AAAIPress.
Birney,E.,Clamp,M.,andDurbin,R.(2004).GeneWiseandGenomewise. GenomeRes. 14:
988–995.
Blanco,E.,Parra,G.,andGuigó,R.(2002).Usinggeneidtoidentifygenes.In: CurrentProtocolsin
Bioinformatics,vol.1,unit4.3.NewYork:Wiley.

===== PDF page 169 =====

References 149
Blattner,F.R.,Plunkett,G.3rd,,Bloch,C.A.etal.(1997).Thecompletegenomesequenceof
Escherichiacoli K-12.Science277:1453–1462.
Bobay,L.-M.,Touchon,M.,andRocha,E.P.C.(2014).Pervasivedomesticationofdefective
prophagesbybacteria. Proc.NatlAcad.Sci.USA. 111:12127–12132.
Borodovsky,M.andLomsadze,A.(2011).Geneidentificationinprokaryoticgenomes,phages,
metagenomes,andESTsequenceswithGeneMarkSsuite. Curr.Protoc.Bioinformatics .Chapter
4,Unit4.5.1–17.
Borodovsky,M.andMcIninch,J.(1993).GeneMark:parallelgenerecognitionforbothDNA
strands.Comput.Chem. 17:123–133.
Borodovsky,M.,Rudd,K.E.,andKoonin,E.V.(1994).Intrinsicandextrinsicapproachesfor
detectinggenesinabacterialgenome. NucleicAcidsRes. 22:4756–4767.
Bose,M.andBarber,R.D.(2006).ProphageFinder:aprophagelocipredictiontoolforprokaryotic
genomesequences. InSilicoBiol.(Gedrukt) 6:223–227.
Burge,C.andKarlin,S.(1997).PredictionofcompletegenestructuresinhumangenomicDNA. J.
Mol.Biol. 268:78–94.
Burset,M.andGuigó,R.(1996).Evaluationofgenestructurepredictionprograms. Genomics.34:
353–357.
Carver,T.,Harris,S.R.,Berriman,M.etal.(2012).Artemis:anintegratedplatformforvisualization
andanalysisofhigh-throughputsequence-basedexperimentaldata. Bioinformatics28:464–469.
Casjens,S.(2003).Prophagesandbacterialgenomics:whathavewelearnedsofar? Mol.Microbiol.
49:277–300.
Casper,J.,Zweig,A.S.,Villarreal,C.etal.(2018).TheUCSCGenomeBrowserdatabase:2018
update.NucleicAcidsRes. 46(D1):D762–D769.
Coghlan,A.,Fiedler,T.J.,McKay,S.J.etal.,andnGASPConsortium.(2008).nGASP–the
nematodegenomeannotationassessmentproject. BMCBioinf 9:549.
Cordaux,R.andBatzer,M.A.(2009).Theimpactofretrotransposonsonhumangenome
evolution.Nat.Rev.Genet. 10:691–703.
Delcher,A.L.,Harmon,D.,Kasif,S.etal.(1999).Improvedmicrobialgeneidentificationwith
GLIMMER.NucleicAcidsRes. 27:4636–4641.
Delcher,A.L.,Bratke,K.A.,Powers,E.C.,andSalzberg,S.L.(2007).Identifyingbacterialgenesand
endosymbiontDNAwithGlimmer. Bioinformatics23:673–679.
Dobin,A.,Davis,C.A.,Schlesinger,F.etal.(2013).STAR:ultrafastuniversalRNA-seqaligner.
Bioinformatics29:15–21.
Dunham,I.,Shimizu,N.,Roe,B.A.etal.(1999).TheDNAsequenceofhumanchromosome22.
Nature402:489–495.
Eddy,S.R.(2009).Anewgenerationofhomologysearchtoolsbasedonprobabilisticinference.
GenomeInform. 23:205–211.
Eilbeck,K.,Moore,B.,Holt,C.,andYandell,M.(2009).Quantitativemeasuresforthe
managementandcomparisonofannotatedgenomes. BMCBioinf 10:67.
Ellinghaus,D.,Kurtz,S.,andWillhoeft,U.(2008).LTRharvest,anefficientandflexiblesoftware
fordenovodetectionofLTRretrotransposons. BMCBioinf 9:18.
Ezkurdia,I.,Juan,D.,Rodriguez,J.M.etal.(2014).Multipleevidencestrandssuggestthatthere
maybeasfewas19,000humanprotein-codinggenes. Hum.Mol.Genet. 23:5866–5878.
Fay,J.C.andWu,C.(2003).Sequencedivergence,functionalconstraint,andselectioninprotein
evolution.Annu.Rev.GenomicsHum.Genet. 4:213–235.
Fernández-Suárez,X.M.andSchuster,M.K.(2010).Usingtheensemblgenomeservertobrowse
genomicsequencedata. Curr.Protoc.Bioinformatics .Chapter1,Unit1.15.
Fickett,J.W.andTung,C.S.(1992).Anassessmentofproteincodingmeasures. NucleicAcidsRes.
20:6441–6450.
Fouts,D.E.(2006).Phage_Finder:automatedidentificationandclassificationofprophageregions
incompletebacterialgenomesequences. NucleicAcidsRes. 34:5839–5851.
Gelfand,M.S.(1995).PredictionoffunctioninDNAsequenceanalysis. J.Comput.Biol. 2:87–117.

===== PDF page 170 =====

150 Genome Annotation
Gelfand,M.S.andRoytberg,M.A.(1993).Predictionoftheexon-intronstructurebyadynamic
programmingapproach. Biosystems.30:173–182.
Gelfand,M.S.,Mironov,A.A.,andPevner,P.A.(1996).Generecognitionviasplicedsequence
alignment.Proc.Natl.Acad.Sci.USA. 93:9061–9066.
Gish,W.andStates,D.(1993).Identificationofproteincodingregionsbydatabasesimilarity
search.Nat.Genet. 3:266–272.
Grabherr,M.G.,Haas,B.J.,Yassour,M.etal.(2011).Full-lengthtranscriptomeassemblyfrom
RNA-seqdatawithoutareferencegenome. Nat.Biotechnol. 29:644–652.
Gremme,G.,Brendel,V.,Sparks,M.E.,andKurtz,S.(2005).Engineeringasoftwaretoolforgene
structurepredictioninhigherorganisms. Inf.SoftwareTechnol. 47:965–978.
Gross,S.S.andBrent,M.R.(2006).Usingmultiplealignmentstoimprovegeneprediction. J.
Comput.Biol. 13:379–393.
Gross,S.S.,Do,C.B.,Sirota,M.,andBatzoglou,S.(2007).CONTRAST:adiscriminative,
phylogeny-freeapproachtomultipleinformantdenovogeneprediction. GenomeBiol. 8:R269.
Guigó,R.(1999).DNAcomposition,codonusageandexonprediction.In: GeneticDatabases (ed.
M.Bishop)),53–80.Cambridge,MA:AcademicPress.
Guigó,R.andReese,M.G.(2005).EGASP:collaborationthroughcompetitiontofindhuman
genes.Nat.Methods 2:575–577.
Guigó,R.,Dermitzakis,E.T.,Agarwal,P.etal.(2003).Comparisonofmouseandhumangenomes
followedbyexperimentalverificationyieldsanestimated1,019additionalgenes. Proc.Natl.
Acad.Sci.USA. 100:1140–1145.
Guigó,R.,Flicek,P.,Abril,J.F.etal.(2006).EGASP:thehumanENCODEgenomeannotation
assessmentproject. GenomeBiol. 7(Suppl1):S2.1–S2.31.
Haas,B.J.,Salzberg,S.L.,Zhu,W.etal.(2008).Automatedeukaryoticgenestructureannotation
usingEVidenceModelerandtheprogramtoassemblesplicedalignments. GenomeBiol. 9:R7.
Han,Y.andWessler,S.R.(2010).MITE-Hunter:aprogramfordiscoveringminiature
inverted-repeattransposableelementsfromgenomicsequences. NucleicAcidsRes. 38:e199.
Harrow,J.,Frankish,A.,Gonzalez,J.M.etal.(2012).GENCODE:thereferencehumangenome
annotationforTheENCODEProject. GenomeRes. 22:1760–1774.
Häsler,J.andStrub,K.(2006).Aluelementsasregulatorsofgeneexpression. NucleicAcidsRes.
34:5491–5497.
Hoff,K.J.andStanke,M.(2013).WebAUGUSTUS–awebservicefortrainingAUGUSTUSand
predictinggenesineukaryotes. NucleicAcidsRes. 41(WebServerissue):W123–W128.
Hoff,K.J.,Lange,S.,Lomsadze,A.etal.(2016).BRAKER1:unsupervisedRNA-seq-basedgenome
annotationwithGeneMark-ETandAUGUSTUS. Bioinformatics32:767–769.
Holt,C.andYandell,M.(2010).MAKER2:anannotationpipelineandgenome-database
managementtoolforsecond-generationgenomeprojects. BMCBioinf 12:491.
Hou,Y.andLin,S.(2009).Distinctgenenumber–genomesizerelationshipsforeukaryotesand
non-eukaryotes:genecontentestimationfordinoflagellategenomes. PLoSOne 4(9):e6978.
Hyatt,D.,Chen,G.L.,Locascio,P.F.etal.(2010).Prodigal:prokaryoticgenerecognitionand
translationinitiationsiteidentification. BMCBioinf 11:119.
Jühling,F.,Mörl,M.,Hartmann,R.K.etal.(2009).tRNAdb2009:compilationoftRNAsequences
andtRNAgenes. NucleicAcidsRes. 37(Databaseissue):D159–D162.
Jurka,J.,Kapitonov,V.V.,Pavlicek,A.etal.(2005).Repbaseupdate,adatabaseofeukaryotic
repetitiveelements. Cytogenet.GenomeRes. 110(1–4):462–467.
Kalvari,I.,Argasinska,J.,Quinones-Olvera,N.etal.(2018).Rfam13.0:shiftingtoa
genome-centricresourcefornon-codingRNAfamilies. NucleicAcidsRes. 46(D1):
D335–D342.
Keller,O.,Kollmar,M.,Stanke,M.,andWaack,S.(2011).Anovelhybridgenepredictionmethod
employingproteinmultiplesequencealignments. Bioinformatics27:757–763.
Kent,W.J.(2002).BLAT–theBLAST-likealignmenttool. GenomeRes. 12:656–664.
Kim,D.,Pertea,G.,Trapnell,C.etal.(2013).TopHat2:accuratealignmentoftranscriptomesinthe
presenceofinsertions,deletionsandgenefusions. GenomeBiol. 14:R36.

===== PDF page 171 =====

References 151
Kinouchi,M.andKuoakawa,K.(2006).tRNAfinder:asoftwaresystemtofindalltRNAgenesin
theDNAsequencebasedonthecloverleafsecondarystructure. J.Comput.AidedChem. 7:
116–126.
König,S.,Romoth,L.W.,Gerischer,L.,andStanke,M.(2016).Simultaneousgenefindingin
multiplegenomes. Bioinformatics32:3388–3395.
Korf,I.,Flicek,P.,Duan,D.,andBrent,M.R.(2001).Integratinggenomichomologyintogene
structureprediction. Bioinformatics.17:S140–S148.
Kozak,M.(1987).Ananalysisof5 ′-noncodingsequencesfrom699vertebratemessengerRNAs.
NucleicAcidsRes. 15:8125–8148.
Krogh,A.(1997).TwomethodsforimprovingperformanceofaHMMandtheirapplicationfor
genefinding.In: ProceedingsoftheFifthInternationalConferenceonIntelligentSystemsfor
MolecularBiology,Halkidiki,Greece(21–26June1997) ,vol.5,179–186.MenloPark,CA:AAAI
Press.
Krogh,A.,Mian,I.S.,andHaussler,D.(1994).AhiddenMarkovmodelthatfindsgenesin E.coli
DNA.NucleicAcidsRes. 22:4768–4678.
Kulp,D.,Haussler,D.,Reese,M.G.,andEeckman,F.H.(1996).AgeneralizedhiddenMarkov
modelfortherecognitionofhumangenesinDNA.In: ProceedingsoftheFourthInternational
ConferenceonIntelligentSystemsforMolecularBiology ,vol.4,134–142,June12-15,1996,St.
Louis,MO.USA,AAAIPress,MenloPark,California.
Lagesen,K.,Hallin,P.,Rødland,E.A.etal.(2007).RNAmmer:consistentandrapidannotationof
ribosomalRNAgenes. NucleicAcidsRes. 35:3100–3108.
Lander,E.S.,Linton,L.M.,Birren,B.etal.(2001).Initialsequencingandanalysisofthehuman
genome.Nature409:860–921.
Larsen,T.S.andKrogh,A.(2003).EasyGene–aprokaryoticgenefinderthatranksORFsby
statisticalsignificance. BMCBioinf 4:21.
Lee,E.,Helt,G.A.,Reese,J.T.etal.(2013).WebApollo:aweb-basedgenomicannotationediting
platform.GenomeBiol. 14:R93.
Li,W.,Zhang,P.,Fellers,J.P.etal.(2004).Sequencecomposition,organization,andevolutionof
thecoreTriticeaegenome. PlantJ. 40:500–511.
Lifton,R.P.,Goldberg,M.L.,Karp,R.W.,andHogness,D.S.(1978).Theorganizationofthehistone
genesin Drosophilamelanogaster:functionalandevolutionaryimplications. ColdSpring
HarborSymp.Quant.Biol. 42:1047–1051.
Little,J.W.(2005).Lysogeny,prophageinduction,andlysogenicconversion.In: Phages:TheirRole
inBacterialPathogenesisandBiotechnology (eds.M.K.Waldor,D.I.FriedmanandS.L.Adhya),
37–54.Washington,DC:ASMPress.
Lowe,T.M.andEddy,S.R.(1997).tRNAscan-SE:aprogramforimproveddetectionoftransfer
RNAgenesingenomicsequence. NucleicAcidsRes. 25:955–964.
Lukashin,A.V.andBorodovsky,M.(1998).GeneMark.hmm:newsolutionsforgenefinding.
NucleicAcidsRes. 26:1107–1115.
Lunter,G.andGoodson,M.(2011).Stampy:astatisticalalgorithmforsensitiveandfastmapping
ofIlluminasequencereads. GenomeRes. 21:936–939.
Macke,T.J.,Ecker,D.J.,Gutell,R.R.etal.(2001).RNAMotif,anRNAsecondarystructure
definitionandsearchalgorithm. NucleicAcidsRes. 29:4724–4735.
Meyer,I.M.andDurbin,R.(2002).Comparativeabinitiopredictionofgenestructuresusingpair
HMMs.Bioinformatics18:1309–1318.
Naik,P.K.,Mittal,V.K.,andGupta,S.(2008).RetroPred:atoolforprediction,classificationand
extractionofnon-LTRretrotransposons(LINEs&SINEs)fromthegenomebyintegrating
PALS,PILER,MEMEandANN. Bioinformation2:263–270.
Overbeek,R.,Olson,R.,Pusch,G.D.etal.(2014).TheSEEDandtherapidannotationofmicrobial
genomesusingsubsystemstechnology(RAST). NucleicAcidsRes. 42(Databaseissue):
D206–D214.
Parra,G.,Agarwal,P.,Abril,J.F.etal.(2003).Comparativegenepredictioninhumanandmouse.
GenomeRes. 13:108–117.

===== PDF page 172 =====

152 Genome Annotation
Pennisi,E.(2003).Bioinformatics.Genecountersstruggletogettherightanswer. Science.301:
1040–1041.
Pertea,M.,Pertea,G.M.,Antonescu,C.M.etal.(2015).StringTieenablesimprovedreconstruction
ofatranscriptomefromRNA-seqreads. Nat.Biotechnol. 33:290–295.
Pribnow,D.(1975).NucleotidesequenceofanRNApolymerasebindingsiteatanearlyT7
promoter.Proc.Natl.Acad.Sci.USA. 72:784–788.
Price,A.L.,Jones,N.C.,andPevzner,P.A.(2005).Denovoidentificationofrepeatfamiliesinlarge
genomes.Bioinformatics21(Suppl1):i351–i358.
Riley,M.,Abe,T.,Arnaud,M.B.etal.(2006). Escherichiacoli K-12:acooperativelydeveloped
annotationsnapshot–2005. NucleicAcidsRes. 34:1–9.
Rogic,S.,Mackworth,A.K.,andOuellette,F.B.F.(2001).Evaluationofgene-findingprogramson
mammaliansequences. GenomeRes. 11:817–832.
Sakharkar,M.,Passetti,F.,deSouza,J.E.etal.(2002).ExInt:anexonintrondatabase. Nucleic
AcidsRes. 30:191–194.
Sallet,E.,Gouzy,J.,andSchiex,T.(2014).EuGene-PP:anext-generationautomatedannotation
pipelineforprokaryoticgenomes. Bioinformatics30:2659–2661.
Schweikert,G.,Behr,J.,Zien,A.etal.(2009).mGene.web:awebserviceforaccurate
computationalgenefinding. NucleicAcidsRes. 37(WebServerissue):W312–W316.
Seemann,T.(2014).Prokka:rapidprokaryoticgenomeannotation. Bioinformatics30:2068–2069.
Shine,J.andDalgarno,L.(1975).Determinantofcistronspecificityinbacterialribosomes. Nature
254:34–38.
Simão,F.A.,Waterhouse,R.M.,Ioannidis,P.etal.(2015).BUSCO:assessinggenomeassemblyand
annotationcompletenesswithsingle-copyorthologs. Bioinformatics.31:3210–3212.
Slater,G.S.andBirney,E.(2005).Automatedgenerationofheuristicsforbiologicalsequence
comparison.BMCBioinf 6:31.
Slupska,M.M.,King,A.G.,Fitz-Gibbon,S.etal.(2001).Leaderlesstranscriptsofthecrenarchaeal
hyperthermophilePyrobaculumaerophilum. J.Mol.Biol. 309:347–360.
Souvorov,A.,Kapustin,Y.,Kiryutin,B.etal.(2010).Gnomon–NCBIeukaryoticgeneprediction
tool.NatlCent.Biotechnol.Inf. 2010:1–24.
Sperisen,P.,Iseli,C.,Pagni,M.etal.(2004).Trome,trESTandtrGEN:databasesofpredicted
proteinsequences. NucleicAcidsRes. 32(Databaseissue):D509–D511.
Steijger,T.,Abril,J.F.,Engström,P.G.etal.,andRGASPConsortium(2013).Assessmentof
transcriptreconstructionmethodsforRNA-seq. Nat.Methods 10:1177–1184.
Stothard,P.andWishart,D.S.(2005).Circulargenomevisualizationandexplorationusing
CGView.Bioinformatics21:537–539.
Subramanian,S.,Mishra,R.K.,andSingh,L.(2003).Genome-wideanalysisofmicrosatellite
repeatsinhumans:theirabundanceanddensityinspecificgenomicregions. GenomeBiol. 4:
R13.
Tarailo-Graovac,M.andChen,N.(2009).UsingRepeatMaskertoidentifyrepetitiveelementsin
genomicsequences. Curr.ProtocBioinformatics .Chapter4,Unit4.10.
Taruscio,D.andMantovani,A.(2004).Factorsregulatingendogenousretroviralsequencesin
humanandmouse. Cytogenet.GenomeRes. 105:351–362.
Thibaud-Nissen,F.,DiCuccio,M.,Hlavina,W.etal.(2016).TheNCBIeukaryoticgenome
annotationpipeline. J.Anim.Sci. 94(Suppl4)):184.
Trapnell,C.,Pachter,L.,andSalzberg,S.L.(2009).TopHat:discoveringsplicejunctionswith
RNA-seq.Bioinformatics25:1105–1111.
Trapnell,C.,Roberts,A.,Goff,L.etal.(2012).Differentialgeneandtranscriptexpressionanalysis
ofRNA-seqexperimentswithTopHatandcufflinks. Nat.Protoc. 7:562–578.
VanDomselaar,G.H.,Stothard,P.,Shrivastava,S.etal.(2005).BASys:awebserverforautomated
bacterialgenomeannotation. NucleicAcidsRes. 33(WebServerissue):W455–W459.
Wang,Z.,Gerstein,M.,andSnyder,M.(2009).RNA-seq:arevolutionarytoolfortranscriptomics.
Nat.Rev.Genet. 10:57–63.

===== PDF page 173 =====

References 153
Waterhouse,R.M.,Tegenfeldt,F.,Li,J.etal.(2013).OrthoDB:ahierarchicalcatalogofanimal,
fungalandbacterialorthologs. NucleicAcidsRes. 41(Databaseissue):D358–D365.
Wegrzyn,J.L.,Liechty,J.D.,Stevens,K.A.etal.(2014).Uniquefeaturesoftheloblollypine( Pinus
taedaL.)megagenomerevealedthroughsequenceannotation. Genetics196:891–909.
Westesson,O.,Skinner,M.,andHolmes,I.(2013).Visualizingnext-generationsequencingdata
withJBrowse. BriefingsBioinf. 14:172–177.
Wheeler,T.J.,Clements,J.,Eddy,S.R.etal.(2013).Dfam:adatabaseofrepetitiveDNAbasedon
profilehiddenMarkovmodels. NucleicAcidsRes. 41(Databaseissue):D70–D82.
Will,C.L.andLührmann,R.(2011).Spliceosomestructureandfunction. ColdSpringHarbor
Perspect.Biol. 3(7),pii:a003707.
Winsor,G.L.,Lo,R.,HoSui,S.J.etal.(2005).Pseudomonasaeruginosagenomedatabaseand
PseudoCAP:facilitatingcommunity-based,continuallyupdated,genomeannotation. Nucleic
AcidsRes. 33(Databaseissue):D338–D343.
Wootton,J.C.andFederhen,S.(1993).Statisticsoflocalcomplexityinaminoacidsequencesand
sequencedatabases. Comput.Chem. 17:149–163.
Wu,T.D.,Reeder,J.,Lawrence,M.etal.(2016).GMAPandGSNAPforgenomicsequence
alignment:enhancementstospeed,accuracy,andfunctionality. MethodsMol.Biol. 1418:
283–334.
Xu,Z.andWang,H.(2007).LTR_FINDER:anefficienttoolforthepredictionoffull-lengthLTR
retrotransposons.NucleicAcidsRes. 35(WebServerissue):W265–W268.
Yeh,R.,Lim,L.P.,andBurge,C.(2001).Computationalinferenceofthehomologousgene
structuresinthehumangenome. GenomeRes. 11:803–816.
Zhang,M.Q.(2002).Computationalpredictionofeukaryoticproteincodinggenes. Nat.Rev.Genet.
3:698–709.
Zhou,Y.,Liang,Y.,Lynch,K.H.etal.(2011).PHAST:afastphagesearchtool. NucleicAcidsRes. 39
(WebServerissue):W347–W352.
Zhu,H.,Hu,G.,Yang,Y.etal.(2007).MED:anewnon-supervisedgenepredictionalgorithmfor
bacterialandarchaealgenomes. BMCBioinf 8:97.

Genome Annotation

译文：Ch5 Genome Annotation / Introduction

章节：Ch5 Genome Annotation

Canonical 小节：Introduction

范围：PDF page 137 - PDF page 138 上部；印刷页码 117-118

5 基因组注释

引言

基因预测方法

Source: Ch5 Genome Annotation / Ab Initio Gene Prediction in Eukaryotic Genomes

PDF Pages: 143-147 | Print Pages: 123-127

Boundary: starts at second true section title on PDF page 143; includes Figure 5.3-Figure 5.5 and stops before How Well Do Gene Predictors Work?

Note: PDF page 143 top contains the tail of Box 5.3 from previous section; excluded here.

译文：Ch5 Genome Annotation / Gene Prediction Methods

章节：Ch5 Genome Annotation

Canonical 小节：Gene Prediction Methods

范围：PDF page 138 - PDF page 147 上部；印刷页码 118-127

原核基因组中的 Ab Initio 基因预测

Box 5.1 位置特异性评分矩阵

Box 5.2 Markov 模型

Box 5.3 基因预测中的隐 Markov 模型

真核基因组中的 Ab Initio 基因预测

预测外显子界定信号

预测并评分外显子

外显子组装

译文：Ch5 Genome Annotation / How Well Do Gene Predictors Work?

章节：Ch5 Genome Annotation

Canonical 小节：How Well Do Gene Predictors Work?

范围：PDF page 147 中部 - PDF page 153 顶部；印刷页码 127-133

第5章 基因组注释

基因预测程序的效果如何？

图 5.6

Box 5.4 生物信息学中二分类或二项预测的评价

评估原核基因预测程序

评估真核基因预测程序

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation Using RNA-seq Data

PDF Pages: 153-154 | Print Pages: 133-134

Boundary: starts at RNA-seq subsection heading on PDF page 153; stops before the next subsection heading on PDF page 154.

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation Using Protein Sequence Databases

PDF Pages: 154-155 | Print Pages: 134-135

Boundary: starts at protein sequence databases subsection heading on PDF page 154; includes the page 155 tail despite the running header; stops before the true comparative gene prediction heading.

Source: Ch5 Genome Annotation / Gene Annotation and Evidence Generation using Comparative Gene Prediction

PDF Pages: 155-156 | Print Pages: 135-136

Boundary: starts at true comparative gene prediction subsection heading on PDF page 155; stops before the next non-protein-coding/foreign genes section on PDF page 156.

Source: Ch5 Genome Annotation / Evidence Generation for Non-Protein-Coding, Non-Coding, or Foreign Genes

PDF Pages: 156 | Print Page: 136

Boundary: starts at non-protein-coding/non-coding/foreign genes section heading; stops before tRNA and rRNA Gene Finding subsection.

Ch5 Genome Annotation — tRNA and rRNA Gene Finding

PDF page 156 下部 – page 157；印刷页码 136-137

Ch5 Genome Annotation — Prophage Finding in Prokaryotes

PDF page 157 下部 – page 158 中部；印刷页码 137-138

Ch5 Genome Annotation — Repetitive Sequence Finding/Masking in Eukaryotes

PDF page 158 中部 – page 161 上部；印刷页码 138-141

Ch5 Genome Annotation — Finding and Removing Pseudogenes in Eukaryotes

PDF page 161 上部；印刷页码 141

译文：Ch5 Genome Annotation / Evidence Generation for Genome Annotation

章节：Ch5 Genome Annotation

Canonical 小节：Evidence Generation for Genome Annotation

范围：PDF page 153 中部 - PDF page 161 上部；印刷页码 133-141

第5章 基因组注释

基因组注释的证据生成

第5章 基因组注释

使用 RNA-seq 数据进行基因注释和证据生成

第5章 基因组注释

使用蛋白质序列数据库进行基因注释和证据生成

第5章 基因组注释

使用比较基因预测进行基因注释和证据生成

第5章 基因组注释

非蛋白质编码、非编码或外源基因的证据生成

第五章 基因组注释

tRNA and rRNA Gene Finding（tRNA 和 rRNA 基因寻找）

第五章 基因组注释

Prophage Finding in Prokaryotes（原核生物中的前噬菌体识别）

第五章 基因组注释

Repetitive Sequence Finding/Masking in Eukaryotes（真核生物中的重复序列寻找/遮蔽）

第五章 基因组注释

Finding and Removing Pseudogenes in Eukaryotes（真核生物中假基因的寻找与去除）

Ch5 Genome Annotation — Prokaryotic Genome Annotation Pipelines

PDF page 162 中部 – page 163 图注；印刷页码 142-143

Ch5 Genome Annotation — Eukaryotic Genome Annotation Pipelines

PDF page 163 中部 – page 164；印刷页码 143-144

第5章基因组注释

第5章基因组注释

第5章基因组注释

第5章基因组注释

第5章基因组注释

第5章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释

第五章基因组注释