Chapter 1

Biological Sequence Databases

12 小节

001

Introduction

PDF page 21 - PDF page 23 顶部；印刷页码 1-3

▶

English SourcePDF extracted

Biological Sequence Databases

Andreas D. Baxevanis

Introduction

Over the past several decades, there has been a feverish push to understand, at the most

elementary of levels, what constitutes the basic “book of life.” Biologists (and scientists in gen-

eral) are driven to understand how the millions or billions of bases in an organism’s genome

contain all of the information needed for the cell to conduct the myriad metabolic processes

necessary for the organism’s survival – information that is propagated from generation to

generation. To have a basic understanding of how the collection of individual nucleotide

bases drives the engine of life, large amounts of sequence data must be collected and stored

in a way that these data can be searched and analyzed easily. To this end, much effort has

gone into the design and maintenance of biological sequence databases. These databases have

had a significant impact on the advancement of our understanding of biology not just from

a computational standpoint but also through their integrated use alongside studies being

performed at the bench.

The history of sequence databases began in the early 1960s, when Margaret Dayhoff and

colleagues (1965) at the National Biomedical Research Foundation (NBRF) collected all of the

protein sequences known at that time – all 65 of them – and published them in a book called

the Atlas of Protein Sequence and Structure. It is important to remember that, at this point in the

history of biology, the focus was on sequencing proteins through traditional techniques such

as the Edman degradation rather than on sequencing DNA, hence the overall small number

of available sequences. By the late 1970s, when a significant number of nucleotide sequences

became available, those were also included in later editions of the Atlas. As this collection

evolved, it included text-based descriptions to accompany the protein sequences, as well as

information regarding the evolution of many protein families. This work, in essence, was the

first annotated sequence database, even though it was in printed form. Over time, the amount

of data contained in the Atlas became unwieldy and the need for it to be available in electronic

form became obvious. From the early 1970s to the late 1980s, the contents of the Atlas were

distributed electronically by NBRF (and later by the Protein Information Resource, or PIR) on

magnetic tape, and the distribution included some basic programs that could be used to search

and evaluate distant evolutionary relationships.

The next phase in the history of sequence databases was precipitated by the veritable explo-

sion in the amount of nucleotide sequence data available to researchers by the end of the

1970s. To address the need for more robust public sequence databases, the Los Alamos National

Laboratory (LANL) created the Los Alamos DNA Sequence Database in 1979, which became

known as GenBank in 1982 (Benson et al. 2018). Meanwhile, the European Molecular Biology

Laboratory (EMBL) created the EMBL Nucleotide Sequence Data Library in 1980. Throughout

the 1980s, EMBL (then based in Heidelberg, Germany), LANL, and (later) the National Center

for Biotechnology Information (NCBI, part of the National Library of Medicine at the National

Institutes of Health) jointly contributed DNA sequence data to these databases. This was done

Biological Sequence Databases

by having teams of curators manually transcribing and interpreting what was published in

print journals to an electronic format more appropriate for computational analyses. The DNA

Databank of Japan (DDBJ; Kodama et al. 2018) joined this DNA data-collecting collabora-

tion a few years later. By the late 1980s, the quantity of DNA sequence data being produced

was so overwhelming that print journals began asking scientists to electronically submit their

DNA sequences directly to these databases, rather than publishing them in printed journals

or papers. In 1988, after a meeting of these three groups (now referred to as the International

Nucleotide Sequence Database Collaboration, or INSDC; Karsch-Mizrachi et al. 2018), there

was an agreement to use a common data exchange format and to have each database update

only the records that were directly submitted to it. Thanks to this agreement, all three centers

(EMBL, DDBJ, and NCBI) now collect direct DNA sequence submissions and distribute them

so that each center has copies of all of the sequences, with each center acting as a primary distri-

bution center for these sequences. DDBJ/EMBL/GenBank records are updated automatically

every 24 hours at all three sites, meaning that all sequences can be found within DDBJ, the

European Nucleotide Archive (ENA; Silvester et al. 2018), and GenBank in short order. That

said, each database within the INSDC has the freedom to display and annotate the sequence

data as it sees fit.

In parallel with the early work being done on DNA sequence databases, the foundations

for the Swiss-Prot protein sequence database were also being laid in the early 1980s by Amos

Bairoch, recounting its history from an engaging perspective in a first-person review (Bairoch

2000). Bairoch converted PIR’s Atlas to a format similar to that used by EMBL for its nucleotide

database. In this initial release, called PIR+, additional information about each of the pro-

teins was added, increasing its value as a curated, well-annotated source of information on

proteins. In the summer of 1986, Bairoch began distributing PIR+ on the US BIONET (a pre-

cursor to the Internet), renaming it Swiss-Prot. At that time, it contained the grand sum of

3900 protein sequences. This was seen as an overwhelming amount of data, in stark contrast

to today’s standards. As Swiss-Prot and EMBL followed similar formats, a natural collaboration

developed between these two groups, and these collaborative efforts strengthened when both

EMBL’s and Swiss-Prot’s operations were moved to EMBL’s European Bioinformatics Insti-

tute (EBI; Cook et al. 2018) in Hinxton, UK. One of the first collaborative projects undertaken

by the Swiss-Prot and EMBL teams was to create a new and much larger protein sequence

database supplement to Swiss-Prot. As maintaining the high quality of Swiss-Prot entries was a

time-consuming process involving extensive sequence analysis and detailed curation by expert

annotators (Apweiler 2001), and to allow the quick release of protein data not yet annotated

to Swiss-Prot’s stringent standards, a new database called TrEMBL (for “translation of EMBL

nucleotide sequences”) was created. This supplement to Swiss-Prot initially consisted of com-

putationally annotated sequence entries derived from the translation of all coding sequences

(CDSs) found in INSDC databases. In 2002, a new effort involving the Swiss Institute of Bioin-

formatics, EMBL-EBI, and PIR was launched, called the UniProt consortium (UniProt Con-

sortium 2017). This effort gave rise to the UniProt Knowledgebase (UniProtKB), consisting

of Swiss-Prot, TrEMBL, and PIR. A similar effort also gave rise to the NCBI Protein Database,

bringing together data from numerous sources and described more fully in the text that follows.

The completion of human genome sequencing and the sequencing of numerous model

genomes, as well as the existence of a gargantuan number of sequences in general, provides

a golden opportunity for biological scientists, owing to the inherent value of these data. At

the same time, the sheer magnitude of data also presents a conundrum to the inexperienced

user, resulting not just from the size of the “sequence information space” but from the

fact that the information space continues to get larger by leaps and bounds. Indeed, the

sequencing landscape has changed significantly in recent years with the development of new

high-throughput technologies that generate more and more sequence data in a way that is

best described as “better, cheaper, faster,” with these advances feeding into the “insatiable

appetite” that scientists have for more and more sequence data (Green et al. 2017). Given the

inherent value of the data contained within these sequence databases, this chapter will focus

Nucleotide Sequence Flatﬁles: A Dissection

on providing the reader with a solid understanding of these major public sequence databases,

as a first step toward being able to perform robust and accurate bioinformatic analyses.

中文译文

第 1 章 Biological Sequence Databases

Introduction

在过去几十年中，人们一直以近乎急切的速度，试图从最基本的层面理解构成“生命之书”的究竟是什么。生物学家（以及一般意义上的科学家）希望理解，一个生物体基因组中数以百万计或数以十亿计的碱基，如何包含细胞开展维持该生物体生存所必需的无数代谢过程所需的全部信息；这些信息又如何一代一代地传递下去。为了从基础层面理解单个核苷酸碱基的集合如何驱动生命这台机器，必须收集大量序列数据，并以便于检索和分析的方式加以存储。为此，研究者在生物序列数据库的设计和维护方面投入了大量工作。这些数据库不仅从计算角度显著推动了我们对生物学的理解，也通过与实验台研究的整合使用，对生物学认识的进展产生了重要影响。

序列数据库的历史始于 20 世纪 60 年代初。当时，Margaret Dayhoff 及其在 National Biomedical Research Foundation（NBRF）的同事（1965）收集了当时已知的全部蛋白质序列——总共 65 条——并将它们出版在一本名为 Atlas of Protein Sequence and Structure 的书中。需要记住的是，在生物学发展的这一阶段，研究重点是通过 Edman degradation 等传统技术测定蛋白质序列，而不是测定 DNA 序列，因此可用序列的总体数量很少。到 20 世纪 70 年代末，随着相当数量的核苷酸序列变得可用，这些序列也被纳入 Atlas 的后续版本中。随着这一资料集的发展，它不仅包含蛋白质序列，还包含与之配套的基于文本的描述，以及关于许多蛋白质家族进化的信息。从本质上说，这项工作是第一个带注释的序列数据库，尽管它当时是以印刷形式存在的。随着时间推移，Atlas 中包含的数据量变得难以管理，将其转为电子形式的需求也日益明显。从 20 世纪 70 年代初到 20 世纪 80 年代末，Atlas 的内容由 NBRF（后来由 Protein Information Resource，PIR）以磁带形式进行电子分发；这些分发内容还包括一些基础程序，可用于搜索并评估较远的进化关系。

序列数据库历史的下一个阶段，是由 20 世纪 70 年代末研究人员可用的核苷酸序列数据量真正爆炸式增长所推动的。为满足更强大的公共序列数据库需求，Los Alamos National Laboratory（LANL）于 1979 年创建了 Los Alamos DNA Sequence Database，该数据库于 1982 年被称为 GenBank（Benson et al. 2018）。与此同时，European Molecular Biology Laboratory（EMBL）于 1980 年创建了 EMBL Nucleotide Sequence Data Library。整个 20 世纪 80 年代，EMBL（当时位于德国 Heidelberg）、LANL，以及后来加入的 National Center for Biotechnology Information（NCBI，隶属于 National Library of Medicine，后者属于 National Institutes of Health）共同向这些数据库贡献 DNA 序列数据。这些数据由多组人工审查人员处理：他们手工转录并解读印刷期刊中发表的内容，将其转化为更适合计算分析的电子格式。几年后，DNA Databank of Japan（DDBJ；Kodama et al. 2018）加入了这一 DNA 数据收集协作。

到 20 世纪 80 年代末，产生的 DNA 序列数据量已经非常庞大，以至于印刷期刊开始要求科学家将其 DNA 序列直接以电子方式提交到这些数据库，而不是发表在纸质期刊或论文中。1988 年，在这三个组织召开会议之后（如今统称为 International Nucleotide Sequence Database Collaboration，即 INSDC；Karsch-Mizrachi et al. 2018），它们达成协议：使用共同的数据交换格式，并且每个数据库只更新直接提交给自己的记录。由于这一协议，三个中心（EMBL、DDBJ 和 NCBI）现在都接收直接提交的 DNA 序列，并对这些序列进行分发，使每个中心都保存全部序列的副本；同时，每个中心也都是这些序列的主要分发中心。DDBJ/EMBL/GenBank 记录会在三个站点每 24 小时自动更新一次，这意味着所有序列很快都可以在 DDBJ、European Nucleotide Archive（ENA；Silvester et al. 2018）和 GenBank 中找到。尽管如此，INSDC 内的每个数据库仍可按照自己认为合适的方式展示和注释序列数据。

在 DNA 序列数据库早期工作展开的同时，Swiss-Prot 蛋白质序列数据库的基础也在 20 世纪 80 年代初由 Amos Bairoch 奠定；Bairoch 在一篇第一人称综述中以引人入胜的视角回顾了这段历史（Bairoch 2000）。Bairoch 将 PIR 的 Atlas 转换为一种类似于 EMBL 核苷酸数据库所用格式的格式。在这个最初名为 PIR+ 的版本中，每种蛋白质都被加入了额外信息，从而提高了它作为经过策展、注释良好的蛋白质信息来源的价值。1986 年夏，Bairoch 开始通过 US BIONET（互联网的前身）分发 PIR+，并将其更名为 Swiss-Prot。当时，它总共包含 3900 条蛋白质序列。与今天的标准相比，这一数量显得很小，但在当时已被视为极其庞大的数据量。

由于 Swiss-Prot 和 EMBL 采用相似的格式，这两个团队之间自然发展出合作关系；当 EMBL 和 Swiss-Prot 的业务都迁移到位于英国 Hinxton 的 EMBL European Bioinformatics Institute（EBI；Cook et al. 2018）后，这些合作进一步加强。Swiss-Prot 和 EMBL 团队最早开展的合作项目之一，是创建一个新的、规模大得多的蛋白质序列数据库，作为 Swiss-Prot 的补充。由于维护 Swiss-Prot 条目的高质量是一项耗时工作，需要专家注释人员进行广泛的序列分析和细致策展（Apweiler 2001），同时也为了快速发布尚未达到 Swiss-Prot 严格注释标准的蛋白质数据，一个名为 TrEMBL 的新数据库被创建出来；TrEMBL 意为“translation of EMBL nucleotide sequences”。这个 Swiss-Prot 补充库最初由计算注释的序列条目构成，这些条目来自 INSDC 数据库中所有编码序列（coding sequences, CDSs）的翻译结果。

2002 年，Swiss Institute of Bioinformatics、EMBL-EBI 和 PIR 发起了一项新的合作，即 UniProt consortium（UniProt Consortium 2017）。这一工作催生了 UniProt Knowledgebase（UniProtKB），其中包括 Swiss-Prot、TrEMBL 和 PIR。类似的工作也促成了 NCBI Protein Database 的形成；该数据库汇集了来自众多来源的数据，并将在下文中作更充分的介绍。人类基因组测序的完成、众多模式生物基因组的测序，以及总体上数量极其庞大的序列数据的存在，共同构成了充分理由：读者首先需要扎实理解这些主要公共序列数据库，进而才能开展稳健而准确的生物信息学分析。

002

Nucleotide Sequence Databases

PDF page 23；印刷页码 3

▶

English SourcePDF extracted

===== PDF page 23 / printed page 3 =====

Nucleotide Sequence Databases

As described above, the major sources of nucleotide sequence data are the databases involved

in INSDC – DDBJ, ENA, and GenBank – with new or updated data being shared between

these three entities once every 24 hours. This transfer is facilitated by the use of common data

formats for the kinds of information described in detail below.

The elementary format underlying the information held in sequence databases is a text file

called the flatfile. The correspondence between individual flatfile formats greatly facilitates the

daily exchange of data between each of these databases. In most cases, fields can be mapped

on a one-to-one basis from one flatfile format to the other. Over time, various file formats have

been adopted and have found continued widespread use; others have fallen to the wayside for

a variety of reasons. The success of a given format depends on its usefulness in a variety of

contexts, as well as its power in effectively containing and representing the types of biological

data that need to be archived and communicated to scientists.

In its simplest form, a sequence record can be represented as a string of nucleotides with

some basic tag or identifier. The most widely used of these simple formats is FASTA, origi-

nally introduced as part of the FASTA software suite developed by Lipman and Pearson (1985)

that is described in detail in Chapter 3. This inherently simple format provides an easy way of

handling primary data for both humans and computers, taking the following form.

>U54469.1

CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA

GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG

TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT

TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA

GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

For brevity, only the first few lines of the sequence are shown. In the simplest incarna-

tion of the FASTA format, the “greater than” character (>) designates the beginning of a new

sequence record; this line is referred to as the definition line (commonly called the “def line”).

A unique identifier – in this case, the accession.version number (U54469.1) – is followed by the

nucleotide sequence, in either uppercase or lowercase letters, usually with 60 characters per

line. The accession number is the number that is always associated with this sequence (and

should be cited in publications), while the version number suffix allows users to easily deter-

mine whether they are looking at the most up-to-date record for a particular sequence. The

version number suffix is incremented by one each time the sequence is updated.

Additional information can be included on the definition line to make this simple format a

bit more informative, as follows.

>ENA|U54469|U54469.1 Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E)

gene, complete cds, alternatively spliced.

This modified FASTA definition line now has information on the source database (ENA),

its accession.version number (U54469.1), and a short description of what biological entity is

represented by the sequence.

中文译文

第 1 章 Biological Sequence Databases

Nucleotide Sequence Databases

如上所述，核苷酸序列数据的主要来源，是参与 INSDC 的各个数据库——DDBJ、ENA 和 GenBank。这三方每 24 小时共享一次新增或更新的数据。这种数据传递之所以能够实现，是因为它们使用了共同的数据格式来描述下文将详细介绍的各类信息。

序列数据库中保存信息的基本格式，是一种称为 flatfile 的文本文件。不同 flatfile 格式之间的对应关系，极大地方便了这些数据库之间每天进行数据交换。在多数情况下，一个 flatfile 格式中的字段都可以一一映射到另一个 flatfile 格式中。随着时间推移，多种文件格式被采用，其中一些一直被广泛使用，另一些则由于各种原因逐渐被淘汰。某一种格式能否成功，取决于它在多种场景中的实用性，也取决于它能否有效容纳并表达那些需要被归档并传达给科学家的生物学数据类型。

在最简单的形式中，一条序列记录可以表示为一串核苷酸，再加上某种基本标签或标识符。在这些简单格式中，使用最广泛的是 FASTA。FASTA 最初是 Lipman 和 Pearson（1985）开发的 FASTA 软件套件的一部分，本书第 3 章将对其作详细介绍。这种格式本身非常简单，因而为人和计算机处理原始数据都提供了一种方便方式。其形式如下。

>U54469.1
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA
GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG
TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT
TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA
GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

为简洁起见，这里只显示了该序列最开始的几行。在 FASTA 格式最简单的实现中，“大于号”字符（>）表示一条新序列记录的开始；这一行称为定义行（definition line，通常也称为 def line）。唯一标识符——在本例中是 accession.version 编号（U54469.1）——之后接核苷酸序列；序列可以使用大写或小写字母表示，通常每行 60 个字符。登录号是始终与这条序列关联的编号，也是在论文中引用时应使用的编号；而版本号后缀则便于用户判断自己查看的是否是某条特定序列的最新记录。每当序列被更新时，版本号后缀就会递增 1。

还可以在定义行中加入更多信息，使这种简单格式稍微更具信息量。例如：

>ENA|U54469|U54469.1 Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E)
gene, complete cds, alternatively spliced.

这个修改后的 FASTA 定义行现在包含了来源数据库（ENA）、accession.version 编号（U54469.1），以及对该序列所代表的生物学实体的简短描述。

003

Nucleotide Sequence Flatfiles: A Dissection

PDF page 23 末尾 - PDF page 29 上半；印刷页码 3-9

▶

English SourcePDF extracted

合并说明：本文件由以下 fragment 按原文顺序合并而来：

03_Nucleotide_Sequence_Flatfiles_A_Dissection（父小节开头）
04_The_Header（父小节内部子标题）
05_The_Feature_Table（父小节内部子标题，包含 feature table 与 sequence itself）

===== PDF page 23 / printed page 3 =====

Nucleotide Sequence Flatﬁles: A Dissection

As flatfiles represent the elementary unit of information within sequence databases and facil-

itate the interchange of information between these databases, it is important to understand

===== PDF page 24 / printed page 4 =====

what each individual field within the flatfile represents and what kinds of information can be

found in varying parts of the record. While there are minor differences in flatfile formats, they

can all be separated into three major parts: the header, containing information and descrip-

tors pertaining to the entire record; the feature table, which provides relevant annotations to

the sequence; and the sequence itself.

===== PDF page 24 / printed page 4 =====

The Header

The header is the most database-specific part of the record. Here, we will use the ENA version

of the record for discussion (shown in its entirety in Appendix 1.1), with the corresponding

DDBJ and GenBank versions of the header appearing in Appendix 1.2. The first line of the

record provides basic identifying information about the sequence contained in the record,

appropriately named the ID line; this corresponds to the LOCUS line in DDBJ/GenBank.

ID

U54469; SV 1; linear; genomic DNA; STD; INV; 2881 BP.

The accession number is shown on the ID line, followed by its sequence version (here, the

first version, or SV 1). As this is SV 1, this is equivalent to writing U54469.1, as described above.

This is then followed by the topology of the DNA molecule (linear) and the molecule type

(genomic DNA). The next element represents the ENA data class for this sequence (STD,

denoting a “standard” annotated and assembled sequence). Data classes are used to group

sequence records within functional divisions, enabling users to query specific subsets of the

database. A description of these functional divisions can be found in Box 1.1. Finally, the ID

line presents the taxonomic division for the sequence of interest (INV, for invertebrate; see

Internet Resources) and its length (2881 base pairs). The accession number will also be shown

separately on the AC line that immediately follows the ID lines.

Box 1.1 Functional Divisions in Nucleotide Databases

The organization of nucleotide sequence records into discrete functional types provides

a way for users to query speciﬁc subsets of the records within these databases. In addi-

tion, knowledge that a particular sequence is from a given technique-oriented database

allows users to interpret the data from the proper biological point of view. Several of these

divisions are described below, and examples of each of these functional divisions (called

“data classes” by ENA) can be found by following the example links listed on the ENA Data

Formats page listed in the Internet Resources section of this chapter.

CON

Constructed (or “contigged”) records of chromosomes, genomes, and other long DNA

sequences resulting from whole -genome sequencing efforts. The records in this

division do not contain sequence data; rather, they contain instructions for the

assembly of sequence data found within multiple database records.

EST

Expressed Sequence Tags. These records contain short (300–500 bp) single reads

from mRNA (cDNA) that are usually produced in large numbers. ESTs represent a

snapshot of what is expressed in a given tissue or at a given developmental stage.

They represent tags – some coding, some not – of expression for a given cDNA library.

GSS

Genome Survey Sequences. Similar to the EST division, except that the sequences are

genomic in origin. The GSS division contains (but is not limited to) single-pass read

genome survey sequences, bacterial artiﬁcial chromosome (BAC) or yeast artiﬁcial

chromosome (YAC) ends, exon-trapped genomic sequences, and Alu polymerase chain

reaction (PCR) sequences.

HTG

High-Throughput Genome sequences. Unﬁnished DNA sequences generated by

high-throughput sequencing centers, made available in an expedited fashion to the

scientiﬁc community for homology and similarity searches. Entries in this division

contain keywords indicating its phase within the sequencing process. Once ﬁnished,

HTG sequences are moved into the appropriate database taxonomic division.

===== PDF page 25 / printed page 5 =====

Nucleotide Sequence Flatﬁles: A Dissection

STD

A record containing a standard, annotated, and assembled sequence.

STS

Sequence-Tagged Sites. Short (200–500 bp) operationally unique sequences that

identify a combination of primer pairs used in a PCR assay, generating a reagent that

maps to a single position within the genome. The STS division is intended to facilitate

cross-comparison of STSs with sequences in other divisions for the purpose of

correlating map positions of anonymous sequences with known genes.

WGS

Whole-Genome Shotgun sequences. Sequence data from projects using shotgun

approaches that generate large numbers of short sequence reads that can then be

assembled by computer algorithms into sequence contigs, higher -order scaffolds, and

sometimes into near-chromosome- or chromosome-length sequences.

Following the ID line are one or more date lines (denoted by DT), indicating when the entry

was first created or last updated. For our sequence of interest, the entry was originally created

on May 19, 1996 and was last updated in ENA on June 23, 2017:

DT

19-MAY-1996 (Rel. 47, Created)

DT

23-JUN-2017 (Rel. 133, Last updated, Version 5)

The release number in each line indicates the first quarterly release made after the entry

was created or last updated. The version number for the entry appears on the second line and

allows the user to determine easily whether they are looking at the most up-to-date record

for a particular sequence. Please note that this is different from the accession.version format

described above – while some element of the record may have changed, the sequence may have

remained the same, so these two different types of version numbers may not always correspond

to one another.

The next part of the header contains the definition lines, providing a succinct description

of the kinds of biological information contained within the record. The definition line (DE in

ENA, DEFINITION in DDBJ/GenBank) takes the following form.

DE

Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene,

DE

complete cds, alternatively spliced.

Much care is taken in the generation of these definition lines and, although many of them

can be generated automatically from other parts of the record, they are reviewed to ensure

that consistency and richness of information are maintained. Obviously, it is quite impossible

to capture all of the biology underlying a sequence in a single line of text, but that wealth of

information will follow soon enough in downstream parts of the same record.

Continuing down the flatfile record, one finds the full taxonomic information on the

sequence of interest. The OS line (or SOURCE line in DDBJ/GenBank) provides the preferred

scientific name from which the sequence was derived, followed by the common name of the

organism in parentheses. The OC lines (or ORGANISM lines in DDBJ/GenBank) contain

the complete taxonomic classification of the source organism. The classification is listed

top-down, as nodes in a taxonomic tree, with the most general grouping (Eukaryota) given

first.

OS

Drosophila melanogaster (fruit fly)

OC

Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota;

OC

Neoptera; Holometabola; Diptera; Brachycera; Muscomorpha; Ephydroidea;

OC

Drosophilidae; Drosophila; Sophophora.

Each record must have at least one reference or citation, noted within what are called refer-

ence blocks. These reference blocks offer scientific credit and set a context explaining why this

particular sequence was determined. The reference blocks take the following form.

===== PDF page 26 / printed page 6 =====

Biological Sequence Databases

RN

[1]

RP

1-2881

RX

DOI; .1074/jbc.271.27.16393.

RX

PUBMED; 8663200.

RA

Lavoie C.A., Lachance P.E., Sonenberg N., Lasko P.;

RT

"Alternatively spliced transcripts from the Drosophila eIF4E gene produce

RT

two different Cap-binding proteins";

RL

J Biol Chem 271(27):16393-16398(1996).

XX

RN

[2]

RP

1-2881

RA

Lasko P.F.;

RT

;

RL

Submitted (09-APR-1996) to the INSDC.

RL

Paul F. Lasko, Biology, McGill University, 1205 Avenue Docteur Penfield,

RL

Montreal, QC H3A 1B1, Canada

In this case, two references are shown, one referring to a published paper and the other

referring to the submission of the sequence record itself. In the example above, the second

block provides information on the senior author of the paper listed in the first block, as well

as the author’s postal address. While the date shown in the second block indicates when the

sequence (and accompanying information) was submitted to the database, it does not indicate

when the record was first made public, so no inferences or claims based on first public release

can be made based on this date. Additional submitter blocks may be added to the record each

time the sequence is updated.

Some headers may contain COMMENT (DDBJ/GenBank) or CC (ENA) lines. These lines

can include a great variety of notes and comments (descriptors) that refer to the entire

record. Often, genome centers will use these lines to provide contact information and to

confer acknowledgments. Comments also may include the history of the sequence. If the

sequence of a particular record is updated, the comment will contain a pointer to the previous

versions of the record. Alternatively, if an earlier version of the record is retrieved, the

comment will point forward to the newer version, as well as backwards, if there was a still

earlier version. Finally, there are database cross-reference lines (marked DR) that provide

links to allied databases containing information related to the sequence of interest. Here, a

cross-reference to FlyBase can be seen in the complete header for this record in Appendix 1.1.

Note that the corresponding DDBJ/GenBank header in Appendix 1.2 does not contain these

cross-references.

===== PDF page 26 / printed page 6 =====

The Feature Table

Early on in the collaboration between INSDC partner organizations, an effort was made to

come up with a common way to represent the biological information found within a given

database record. This common representation is called the feature table, consisting of feature

keys (a single word or abbreviation indicating the described biological property), location infor-

mation denoting where the feature is located within the sequence, and additional qualifiers

providing additional descriptive information about the feature. The online INSDC feature table

documentation is extensive and describes in great detail what features are allowed and what

qualifiers can be used with each individual feature. Wording within the feature table uses com-

mon biological research terminology wherever possible and is consistent between DDBJ, ENA,

and GenBank entries.

Here, we will dissect the feature table for the eukaryotic transcription factor 4E gene from

Drosophila melanogaster, shown in its entirety in both Appendices 1.3 (in ENA format) and

1.4 (in DDBJ/GenBank format). This particular sequence is alternatively spliced, producing

two distinct gene products, 4E-I and 4E-II. The first block of information in the feature table is

always the source feature, indicating the biological source of the sequence and additional infor-

mation relating to the entire sequence. This feature must be present in all INSDC entries, as all

DNA or RNA sequences derive from some specific biological source, including synthetic DNA.

===== PDF page 27 / printed page 7 =====

Nucleotide Sequence Flatﬁles: A Dissection

FT

source

1..2881

FT

/organism="Drosophila melanogaster"

FT

/chromosome="3"

FT

/map="67A8-B2"

FT

/mol_type="genomic DNA"

FT

/db_xref="taxon:7227"

FT

gene

80..2881

FT

/gene="eIF4E"

In the first line of the source key, notice that the numbering scheme shows the range of

positions covered by this feature key as two numbers separated by two dots (1..2881). As

the source key pertains to the entire sequence, we can infer that the sequence described in

this entry is 2881 nucleotides in length. The various ways in which the location of any given

feature can be indicated are shown in Table 1.1, accounting for a wide range of biological

scenarios. The qualifiers then follow, each preceded by a slash. The full scientific name of

the organism is provided, as are specific mapping coordinates, indicating that this sequence

is at map location 67A8-B2 on chromosome 3. Also indicated is the type of molecule that

was sequenced (genomic DNA). Finally, the last line indicates a database cross-reference

(abbreviated as db_xref) to the NCBI taxonomy database, where taxon 7227 corresponds to

D. melanogaster. In general, these cross-references are controlled qualifiers that allow entries

to be connected to an external database, using an identifier that is unique to that external

database. Following the source block above is the gene feature, indicating that the gene

itself is a subset of the entire sequence in this entry, starting at position 80 and ending at

position 2881.

FT

mRNA

join(80..224,892..1458,1550..1920,1986..2085,2317..2404,

FT

2466..2881)

FT

/gene="eIF4E"

FT

/product="eukaryotic initiation factor 4E-I"

FT

mRNA

join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)

FT

/gene="eIF4E"

FT

/product="eukaryotic initiation factor 4E-II"

Table 1.1 Indicating locations within the feature table.

Single position within the sequence

345..500

A continuous range of positions bounded by and including the

indicated positions

<345..500

A continuous range of positions, where the exact lower boundary

is not known; the feature begins somewhere prior to position 345

but ends at position 500

345..>500

A continuous range of positions, where the exact upper boundary

is not known; the feature begins at position 345 but ends

somewhere after position 500

<1..888

The feature starts before the first sequenced base and continues to

position 888

(102.110)

Indicates that the exact location is unknown, but that it is one of

the positions between 102 and 110, inclusive

123 ̂ 124

Points to a site between positions 123 and 124

123 ̂ 177

Points to a site between two adjacent nucleotides or amino acids

anywhere between positions 123 and 177

join(12..78,134..202)

Regions 12–78 and 134–202 are joined to form one contiguous

sequence

complement(4918..5126)

The sequence complementary to that found from 4918 to 5126 in

the sequence record

J00194:100..202

Positions 100–202, inclusive, in the entry in this database having

accession number J00194

===== PDF page 28 / printed page 8 =====

Biological Sequence Databases

The next feature in this example indicates which regions form the two mRNA transcripts for

this gene, the first for eukaryotic initiation factor 4E-I and the second for eukaryotic initiation

factor 4E-II. In the first case (shown above), the join line indicates that six distinct DNA

segments are transcribed to form the mature RNA transcript while, in the second case, the

second region is missing, with only five distinct DNA segments transcribed into the mature

RNA transcript – hence the two splice variants that are ultimately encoded by this molecule.

FT

CDS

join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)

FT

/codon_start=1

FT

/gene="eIF4E"

FT

/product="eukaryotic initiation factor 4E-II"

FT

/note="Method: conceptual translation with partial peptide

FT

sequencing"

FT

/db_xref="GOA:P48598"

FT

/db_xref="InterPro:IPR001040"

FT

/db_xref="InterPro:IPR019770"

FT

/db_xref="InterPro:IPR023398"

FT

/db_xref="PDB:4AXG"

FT

/db_xref="PDB:4UE8"

FT

/db_xref="PDB:4UE9"

FT

/db_xref="PDB:4UEA"

FT

/db_xref="PDB:4UEB"

FT

/db_xref="PDB:4UEC"

FT

/db_xref="PDB:5ABU"

FT

/db_xref="PDB:5ABV"

FT

/db_xref="PDB:5T47"

FT

/db_xref="PDB:5T48"

FT

/db_xref="UniProtKB/Swiss-Prot:P48598"

FT

/protein_id="AAC03524.1"

FT

/translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGE

FT

PAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTVED

FT

FWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDNLWL

FT

DVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDALRLGR

FT

NNSLQYQLHKDTMVKQGSNVKSIYTL"

Following the mRNA feature is the CDS feature shown above, describing the region that

ultimately encodes the protein product. Focusing just on eukaryotic initiation factor 4E-II, the

CDS feature also shows a join line with coordinates that are slightly different from those

shown in the mRNA feature, specifically at the beginning and end positions. The difference

lies in the fact that the 5′ and 3′ untranslated regions (UTRs) are included in the mRNA fea-

ture but not in the CDS feature. The CDS feature corresponds to the sequence of amino acids

found in the translated protein product whose sequence is shown in the /translation qual-

ifier above. The /codon_start qualifier indicates that the amino acid translation of the first

codon begins at the first position of this joined region, with no offset.

The /protein_id qualifier shows the accession number for the corresponding entry in

the protein databases (AAC03524.1) and is hyperlinked, enabling the user to go directly to

that entry. These unique identifiers use a “3 + 5” format – three letters, followed by five num-

bers. Versions are indicated by the decimal that follows; when the protein sequence in the

record changes, the version is incremented by one. The assignment of a gene product or pro-

tein name (via the /protein qualifier) often is subjective, sometimes being assigned via weak

similarities to other (and sometimes poorly annotated) sequences. Given the potential for the

transitive propagation of poor annotations (that is, bad data tend to beget more bad data),

users are advised to consult curated nucleotide and protein sequence databases for the most

up-to-date, accurate information regarding the putative function of a given sequence. Finally,

notice the extensive cross-referencing via the /db_xref qualifier to entries in InterPro, the

===== PDF page 29 / printed page 9 =====

Nucleotide Sequence Flatﬁles: A Dissection

Protein Data Bank (PDB), and UniProtKB/Swiss-Prot, as well as to a Gene Ontology annotation

(GOA; Gene Ontology Consortium 2017).

Implicit in the source feature and the organism that is assigned to it is the genetic code used

to translate the nucleic acid sequence into a protein sequence when a CDS feature is present

in the record. Also, the DNA-centric nature of these feature tables means that all features are

mapped through a DNA coordinate system, not that of amino acid reference points, as shown

in the examples in Appendices 1.3 and 1.4.

SQ

Sequence 2881 BP; 849 A; 699 C; 585 G; 748 T; 0 other;

cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca

acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt

cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata

.

. <truncated for brevity>

.

aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta

2820

gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg

2880

g

2881

//

Finally, at the end of every nucleotide sequence record, one finds the actual nucleotide

sequence, with 60 bases per row. Note that, in the SQ line signaling the beginning of this section

of the record, not only is the overall length of the sequence provided, but a count of how many

of each individual type of nucleotide base is also provided, making it quite easy to compute the

GC content of this sequence.

中文译文

第 1 章 Biological Sequence Databases

Nucleotide Sequence Flatfiles: A Dissection

由于 flatfile 是序列数据库中信息的基本单位，并且承担着促进这些数据库之间信息交换的作用，因此有必要理解 flatfile 中每一个字段代表什么，以及记录的不同部分可以包含哪些类型的信息。虽然不同 flatfile 格式之间存在一些细微差异，但它们都可以分为三个主要部分：header，即包含整条记录相关信息和描述符的头部；feature table，即为序列提供相关注释的特征表；以及序列本身。

The Header

header 是一条记录中最能体现数据库特异性的部分。这里我们将以 ENA 版本的记录作为讨论对象（完整记录见附录 1.1），并在附录 1.2 中给出相应的 DDBJ 和 GenBank 版本 header。记录的第一行提供了该记录所含序列的基本识别信息，名称也很贴切，称为 ID 行；它对应于 DDBJ/GenBank 中的 LOCUS 行。

ID
U54469; SV 1; linear; genomic DNA; STD; INV; 2881 BP.

登录号显示在 ID 行中，后面跟着它的序列版本（这里是第一个版本，即 SV 1）。由于这里是 SV 1，因此它等同于前文所述的 U54469.1。随后给出的是 DNA 分子的拓扑结构（linear，线性）和分子类型（genomic DNA，基因组 DNA）。下一个元素表示该序列在 ENA 中的数据类别：STD，表示一条“标准的、已注释并已组装的序列”。数据类别用于将序列记录归入不同的功能分区，使用户能够查询数据库中的特定子集。Box 1.1 对这些功能分区作了说明。最后，ID 行还给出目标序列的分类分区（INV，表示无脊椎动物；见 Internet Resources）及其长度（2881 个碱基对）。登录号也会单独出现在紧随 ID 行之后的 AC 行中。

Box 1.1 核苷酸数据库中的功能分区

将核苷酸序列记录组织为彼此分离的功能类型，使用户能够查询这些数据库中特定的记录子集。此外，如果知道某条序列来自某个以特定技术为导向的数据库，用户就可以从适当的生物学角度解释这些数据。下面介绍其中几个分区。每类功能分区的示例（ENA 称之为“数据类别”）可通过本章 Internet Resources 部分列出的 ENA Data Formats 页面中的示例链接查看。

CON

由全基因组测序工作产生的染色体、基因组和其他长 DNA 序列的构建记录（或“contigged”记录）。这一分区中的记录不包含序列数据；相反，它们包含的是组装指令，用于说明如何组装多个数据库记录中的序列数据。

EST

Expressed Sequence Tags，表达序列标签。这些记录包含来自 mRNA（cDNA）的短单次读取序列，长度通常为 300-500 bp，且通常会大量产生。EST 代表给定组织或给定发育阶段中表达情况的一个快照。它们是给定 cDNA 文库中表达的标签，其中有些编码蛋白，有些则不编码。

GSS

Genome Survey Sequences，基因组调查序列。它与 EST 分区类似，但这些序列来源于基因组。GSS 分区包含但不限于：单次通过读取的基因组调查序列、细菌人工染色体（bacterial artificial chromosome, BAC）或酵母人工染色体（yeast artificial chromosome, YAC）的末端序列、外显子捕获的基因组序列，以及 Alu 聚合酶链式反应（polymerase chain reaction, PCR）序列。

HTG

High-Throughput Genome sequences，高通量基因组序列。它们是由高通量测序中心产生的未完成 DNA 序列，会以加速方式提供给科学界，用于同源性和相似性搜索。这一分区中的条目包含关键词，用来指示其在测序流程中所处的阶段。HTG 序列一旦完成，就会被移入相应的数据库分类分区。

STD

包含一条标准的、已注释并已组装序列的记录。

STS

Sequence-Tagged Sites，序列标签位点。这类序列较短，长度为 200-500 bp，是操作上唯一的序列，可识别 PCR 实验中使用的一组引物对，从而生成一种能够定位到基因组中单一位置的试剂。STS 分区旨在促进 STS 与其他分区中序列之间的交叉比较，以便将匿名序列的图谱位置与已知基因关联起来。

WGS

Whole-Genome Shotgun sequences，全基因组鸟枪法序列。它们是采用鸟枪法策略的项目所产生的序列数据。这类项目会生成大量短序列读段，随后可由计算机算法将这些读段组装为序列重叠群（contigs）、更高阶的脚手架序列（scaffolds），有时还可以组装为接近染色体长度或达到染色体长度的序列。

ID 行之后是一个或多个日期行（以 DT 表示），用于说明该条目最初创建或最后更新的时间。对于我们关注的这条序列，该条目最初创建于 1996 年 5 月 19 日，并于 2017 年 6 月 23 日在 ENA 中最后更新：

DT
19-MAY-1996 (Rel. 47, Created)
DT
23-JUN-2017 (Rel. 133, Last updated, Version 5)

每一行中的 release number 表示该条目创建或最后更新之后的第一个季度发布版本。条目的版本号出现在第二行，使用户能够很容易地判断自己查看的是否是某条特定序列的最新记录。请注意，这不同于前面描述的 accession.version 格式：记录中的某些元素可能发生了变化，但序列本身可能保持不变，因此这两种不同类型的版本号并不总是彼此对应。

header 的下一部分包含 definition lines，用于简明描述该记录中包含的生物学信息类型。definition line 在 ENA 中标记为 DE，在 DDBJ/GenBank 中标记为 DEFINITION，其形式如下。

DE
Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene,
DE
complete cds, alternatively spliced.

生成这些 definition lines 时需要非常谨慎。虽然许多 definition lines 可以由记录中的其他部分自动生成，但仍会经过人工审查，以确保信息的一致性和丰富度。显然，想用一行文本捕捉一条序列背后的全部生物学信息是不可能的；不过，同一条记录后续部分很快就会给出这些丰富的信息。

沿着 flatfile 记录继续往下看，可以看到目标序列的完整分类学信息。OS 行（在 DDBJ/GenBank 中为 SOURCE 行）给出该序列来源物种的首选科学名称，后面用括号给出生物体的常用名称。OC 行（在 DDBJ/GenBank 中为 ORGANISM 行）包含来源生物体的完整分类学分类。分类信息按照自上而下的方式列出，就像分类树中的节点一样，最一般的类群（Eukaryota，真核生物）排在最前面。

OS
Drosophila melanogaster (fruit fly)
OC
Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota;
OC
Neoptera; Holometabola; Diptera; Brachycera; Muscomorpha; Ephydroidea;
OC
Drosophilidae; Drosophila; Sophophora.

每条记录都必须至少包含一条参考文献或引用，记录在所谓的 reference blocks 中。这些 reference blocks 用于给予科学贡献归属，并提供背景，说明为什么要测定这条特定序列。reference blocks 的形式如下。

RN
[1]
RP
1-2881
RX
DOI; .1074/jbc.271.27.16393.
RX
PUBMED; 8663200.
RA
Lavoie C.A., Lachance P.E., Sonenberg N., Lasko P.;
RT
"Alternatively spliced transcripts from the Drosophila eIF4E gene produce
RT
two different Cap-binding proteins";
RL
J Biol Chem 271(27):16393-16398(1996).
XX
RN
[2]
RP
1-2881
RA
Lasko P.F.;
RT
;
RL
Submitted (09-APR-1996) to the INSDC.
RL
Paul F. Lasko, Biology, McGill University, 1205 Avenue Docteur Penfield,
RL
Montreal, QC H3A 1B1, Canada

在这个例子中显示了两条参考信息，一条指向已发表的论文，另一条指向该序列记录本身的提交。上面的第二个 reference block 提供了第一篇论文中资深作者的信息，以及该作者的邮寄地址。虽然第二个 reference block 中的日期说明了该序列（及其附带信息）提交到数据库的时间，但它并不表示该记录首次公开发布的时间，因此不能根据这个日期推断或声称该记录的首次公开发布时间。每当序列更新时，还可以向记录中添加新的提交者信息块。

有些 header 可能包含 COMMENT 行（DDBJ/GenBank）或 CC 行（ENA）。这些行可以包含多种多样的说明和注释（描述符），它们都指向整条记录。基因组中心常常使用这些行提供联系信息并表达致谢。注释还可以包含序列的历史。如果某条记录中的序列被更新，comment 会包含一个指向该记录先前版本的指针。反过来，如果检索到的是较早版本的记录，comment 会指向较新的版本；如果还存在更早的版本，也会向后指向那些版本。最后，还有数据库交叉引用行（标记为 DR），它们提供链接，指向包含目标序列相关信息的关联数据库。在附录 1.1 中这条记录的完整 header 里，可以看到一条指向 FlyBase 的交叉引用。需要注意的是，附录 1.2 中相应的 DDBJ/GenBank header 并不包含这些交叉引用。

The Feature Table

在 INSDC 各合作机构早期协作时，人们就努力寻找一种共同方式，用来表示某一数据库记录中包含的生物学信息。这种共同表示方式称为 feature table，即特征表。它由三类内容组成：feature keys，即特征键，用一个单词或缩写表示所描述的生物学属性；location information，即位置信息，说明该特征位于序列中的什么位置；以及额外的 qualifiers，即限定符，用于提供关于该特征的补充描述信息。INSDC 在线 feature table 文档非常详尽，详细说明了允许使用哪些特征，以及每一种特征可以搭配哪些限定符。feature table 中的措辞会尽可能采用常见的生物学研究术语，并且在 DDBJ、ENA 和 GenBank 条目之间保持一致。

这里我们将解析来自 Drosophila melanogaster 的 eukaryotic transcription factor 4E 基因的 feature table。该表在附录 1.3（ENA 格式）和附录 1.4（DDBJ/GenBank 格式）中均完整给出。这条特定序列存在可变剪接，产生两个不同的基因产物：4E-I 和 4E-II。feature table 中的第一个信息块始终是 source feature，它指出序列的生物学来源，以及与整条序列相关的补充信息。所有 INSDC 条目都必须包含这个 feature，因为所有 DNA 或 RNA 序列都来源于某种具体的生物学来源，包括合成 DNA。

FT
source
1..2881
FT
/organism="Drosophila melanogaster"
FT
/chromosome="3"
FT
/map="67A8-B2"
FT
/mol_type="genomic DNA"
FT
/db_xref="taxon:7227"
FT
gene
80..2881
FT
/gene="eIF4E"

在 source key 的第一行中，请注意其编号方式：它用两个由两个点分隔的数字（1..2881）表示该 feature key 覆盖的位置范围。由于 source key 涉及整条序列，因此可以推断，这个条目所描述的序列长度为 2881 个核苷酸。表 1.1 展示了表示任一特征位置的多种方式，这些方式能够覆盖范围很广的生物学场景。随后出现的是限定符，每个限定符前面都有一个斜杠。这里给出了该生物体的完整科学名称，也给出了具体的图谱坐标，说明这条序列位于 3 号染色体的 67A8-B2 图谱位置。同时还指出了被测序的分子类型（genomic DNA）。最后一行表示一个数据库交叉引用，缩写为 db_xref，指向 NCBI taxonomy database；其中 taxon 7227 对应 D. melanogaster。一般来说，这些交叉引用是受控限定符，允许条目通过外部数据库中唯一的标识符连接到外部数据库。在上面的 source block 之后是 gene feature，它表明该基因本身是此条目中整条序列的一个子集，起始于位置 80，终止于位置 2881。

FT
mRNA
join(80..224,892..1458,1550..1920,1986..2085,2317..2404,
FT
2466..2881)
FT
/gene="eIF4E"
FT
/product="eukaryotic initiation factor 4E-I"
FT
mRNA
join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)
FT
/gene="eIF4E"
FT
/product="eukaryotic initiation factor 4E-II"

Table 1.1 feature table 中位置的表示方式

表示方式	含义
345	序列中的单一位置
345..500	一个连续的位置范围，包含所示的两个边界位置
<345..500	一个连续的位置范围，但精确的下边界未知；该特征始于位置 345 之前的某处，并终止于位置 500
345..>500	一个连续的位置范围，但精确的上边界未知；该特征始于位置 345，并终止于位置 500 之后的某处
<1..888	该特征起始于第一个已测序碱基之前，并延续到位置 888
(102.110)	表示精确位置未知，但它是 102 到 110 之间的某一个位置，包含两端位置
123^124	指向位置 123 和 124 之间的一个位点
123^177	指向位置 123 到 177 之间任意两个相邻核苷酸或氨基酸之间的一个位点
join(12..78,134..202)	区域 12-78 和 134-202 被连接起来，形成一条连续序列
complement(4918..5126)	序列记录中 4918 到 5126 位置所对应序列的互补序列
J00194:100..202	登录号为 J00194 的数据库条目中 100-202 的位置，包含两端位置

本例中的下一个 feature 指出哪些区域构成该基因的两个 mRNA 转录本：第一个对应 eukaryotic initiation factor 4E-I，第二个对应 eukaryotic initiation factor 4E-II。在第一种情况（如上所示）中，join 行表示 6 个不同的 DNA 片段被转录形成成熟 RNA 转录本；而在第二种情况中，第二个区域缺失，只有 5 个不同的 DNA 片段被转录为成熟 RNA 转录本。因此，这个分子最终编码出两个剪接变体。

FT
CDS
join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
FT
/codon_start=1
FT
/gene="eIF4E"
FT
/product="eukaryotic initiation factor 4E-II"
FT
/note="Method: conceptual translation with partial peptide
FT
sequencing"
FT
/db_xref="GOA:P48598"
FT
/db_xref="InterPro:IPR001040"
FT
/db_xref="InterPro:IPR019770"
FT
/db_xref="InterPro:IPR023398"
FT
/db_xref="PDB:4AXG"
FT
/db_xref="PDB:4UE8"
FT
/db_xref="PDB:4UE9"
FT
/db_xref="PDB:4UEA"
FT
/db_xref="PDB:4UEB"
FT
/db_xref="PDB:4UEC"
FT
/db_xref="PDB:5ABU"
FT
/db_xref="PDB:5ABV"
FT
/db_xref="PDB:5T47"
FT
/db_xref="PDB:5T48"
FT
/db_xref="UniProtKB/Swiss-Prot:P48598"
FT
/protein_id="AAC03524.1"
FT
/translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGE
FT
PAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTVED
FT
FWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDNLWL
FT
DVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDALRLGR
FT
NNSLQYQLHKDTMVKQGSNVKSIYTL"

mRNA feature 之后是上面所示的 CDS feature，它描述最终编码蛋白质产物的区域。只看 eukaryotic initiation factor 4E-II，CDS feature 也显示了一个 join 行，其坐标与 mRNA feature 中显示的坐标略有不同，差异尤其体现在起始位置和终止位置。原因在于，5′ 和 3′ untranslated regions（UTRs，非翻译区）包含在 mRNA feature 中，但不包含在 CDS feature 中。CDS feature 对应于翻译后蛋白质产物中的氨基酸序列，该序列显示在上面的 /translation 限定符中。/codon_start 限定符表示，第一个密码子的氨基酸翻译从这一连接区域的第一个位置开始，没有偏移。

/protein_id 限定符显示蛋白质数据库中相应条目的登录号（AAC03524.1），并带有超链接，使用户能够直接进入该条目。这些唯一标识符采用“3 + 5”格式，即三个字母后接五个数字。版本号由后面的十进制小数表示；当记录中的蛋白质序列发生变化时，版本号递增 1。为基因产物或蛋白质指定名称（通过 /protein 限定符）往往带有主观性，有时是根据其与其他序列之间较弱的相似性来指定的，而那些其他序列本身有时也注释不佳。由于低质量注释可能会传递式扩散（也就是说，坏数据往往会产生更多坏数据），因此建议用户查阅经过人工审查的核苷酸和蛋白质序列数据库，以获得关于某条序列推定功能的最新、准确信息。最后，请注意 /db_xref 限定符通过大量交叉引用，链接到 InterPro、Protein Data Bank（PDB）和 UniProtKB/Swiss-Prot 中的条目，以及 Gene Ontology annotation（GOA；Gene Ontology Consortium 2017）。

当记录中存在 CDS feature 时，source feature 及其指定的生物体隐含了用于将核酸序列翻译为蛋白质序列的遗传密码。此外，这些 feature table 以 DNA 为中心，这意味着所有特征都是通过 DNA 坐标系统进行定位的，而不是通过氨基酸参考点来定位；附录 1.3 和附录 1.4 中的示例体现了这一点。

SQ
Sequence 2881 BP; 849 A; 699 C; 585 G; 748 T; 0 other;
cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca
60
acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt
120
cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata
180
.
. <truncated for brevity>
.
aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta
2820
gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg
2880
g
2881
//

最后，在每条核苷酸序列记录的末尾，都可以看到实际的核苷酸序列，每行 60 个碱基。请注意，标志着记录这一部分开始的 SQ 行不仅提供了序列的总长度，还给出了每一种核苷酸碱基的数量，因此可以很容易地计算这条序列的 GC 含量。

004

Graphical Interfaces

PDF page 29 - PDF page 30；印刷页码 9-10

▶

English SourcePDF extracted

===== PDF page 29 / printed page 9 =====

Graphical Interfaces

Graphical interfaces have been developed to facilitate the interpretation of the data found

within text-based flatfiles, with an example of the graphical view of the ENA record for our

sequence of interest (U54469.1) shown in Figure 1.1. These graphical views are particularly

useful when there is a long list of documented biological features within the feature table,

enabling the user to visualize potential interactions or relationships between biological

features. An additional example of the use of graphical views to assist in the interpretation

of the information found within a database record is provided in the discussion of the NCBI

Entrez discovery pathway in Chapter 2, as well as later in this chapter.

===== PDF page 30 / printed page 10, Figure 1.1 caption =====

Figure 1.1 The landing page for ENA record U54469.1, providing a graphical view of biological features found within the sequence of the

Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene. The tracks within the graphical view show the position of the gene,

mRNAs, and coding regions (marked CDS) within the 2881 bp sequence reported in this record.

中文译文

第 1 章 Biological Sequence Databases

Graphical Interfaces

为了帮助解释基于文本的 flatfile 中包含的数据，研究者开发了图形界面。图 1.1 展示了我们关注的序列（U54469.1）对应 ENA 记录的图形视图示例。当 feature table 中记录了很长一列生物学特征时，这类图形视图尤其有用，因为它能帮助用户直观看到不同生物学特征之间潜在的相互作用或关系。第 2 章关于 NCBI Entrez discovery pathway 的讨论，以及本章后面的内容，还会提供使用图形视图辅助解释数据库记录信息的其他示例。

Figure 1.1

ENA 记录 U54469.1 的登录页面，提供了 Drosophila melanogaster eukaryotic initiation factor 4E（eIF4E）基因序列中生物学特征的图形视图。图形视图中的 tracks 显示了该记录所报告的 2881 bp 序列中，基因、mRNA 和编码区（标记为 CDS）的位置。

图像资产：

005

RefSeq

PDF page 30 - PDF page 31；印刷页码 10-11

▶

English SourcePDF extracted

===== PDF page 30 / printed page 10 =====

Box 1.2 RefSeq

The ﬁrst several chapters of this book describe a variety of ways in which sequence data

and sequence annotations ﬁnd their way into public databases. While the combination of

data derived from systematic sequencing projects and individual investigators’ laborato-

ries yields a rich and highly valuable set of sequence data, some problems are apparent.

The most important issue is that a single biological entity may be represented by many

different entries in various databases. It also may not be clear whether a given sequence

has been experimentally determined or is simply the result of a computational prediction.

To address these issues, NCBI developed the RefSeq project, the major goal of which

is to provide a reference sequence for each molecule in the central dogma (DNA, mRNA,

and protein). As each biological entity is represented only once, RefSeq is, by deﬁnition,

non-redundant. Nucleotide and protein sequences in RefSeq are explicitly linked to one

===== PDF page 31 / printed page 11 =====

中文译文

第 1 章 Biological Sequence Databases

Box 1.2 RefSeq

本书前几章将介绍序列数据和序列注释进入公共数据库的多种途径。系统性测序项目产生的数据，与单个研究者实验室产生的数据结合在一起，形成了一套丰富且极具价值的序列数据资源；但与此同时，也出现了一些明显的问题。最重要的问题是，同一个生物学实体可能在不同数据库中由许多不同条目表示。此外，有时也并不清楚某条序列究竟是通过实验测定得到的，还是仅仅来自计算预测。

为了解决这些问题，NCBI 开发了 RefSeq 项目。该项目的主要目标，是为中心法则中的每一类分子（DNA、mRNA 和蛋白质）提供一条参考序列。由于每个生物学实体只被表示一次，RefSeq 按定义就是非冗余的。RefSeq 中的核苷酸序列和蛋白质序列彼此之间有明确链接。最重要的是，RefSeq 条目会持续接受人工审查，从而保证 RefSeq 条目能够代表关于某条特定 DNA、mRNA 或蛋白质序列的最新知识状态。

RefSeq 条目通过一套独立的登录号系列，与 GenBank 中的其他条目区分开来。RefSeq 登录号遵循“2 + 6”格式：先是一个表示参考序列类型的双字母代码，随后是一个下划线和一个六位数字。通过实验测定得到的序列数据表示如下：

NT_123456
Genomic contigs (DNA)
NM_123456
mRNAs
NP_123456
Proteins

通过基因组注释工作推导得到的参考序列表示如下：

XM_123456
Model mRNAs
XM_123456
Model proteins

理解“N”编号和“X”编号之间的区别很重要：前者表示真实的、通过实验测定得到的序列，而后者表示从原始 DNA 序列推导出的计算预测结果。

更多类型的 RefSeq 条目，以及关于 RefSeq 项目的更多信息，可以在 NCBI RefSeq 网站上找到。

006

Protein Sequence Databases

PDF page 31；印刷页码 11

▶

English SourcePDF extracted

===== PDF page 31 / printed page 11 =====

Protein Sequence Databases

With the availability of myriad complete genome sequences from both prokaryotes and eukary-

otes, significant effort is being dedicated to the identification and functional analysis of the

proteins encoded by these genomes. The large-scale analysis of these proteins continues to

generate huge amounts of data, including through the use of proteomic methods (Chapter 11)

and through protein structure analysis (Chapter 12), to name a few. These and other meth-

ods make it possible to identify large numbers of proteins quickly, to map their interactions

(Chapter 13), to determine their location within the cell, and to analyze their biological activi-

ties. This ever-increasing “information space” reinforces the central role that protein sequence

databases play as a resource for storing data generated by these efforts, making them freely

available to the life sciences community.

As most sequence data in protein databases are derived from the translation of nucleotide

sequences, they can be, in large part, thought of as “secondary databases.” Universal protein

sequence databases cover proteins from all species, whereas specialized protein sequence

databases concentrate on particular protein families, groups of proteins, or those from a

specific organism. Representative model organism databases include the Mouse Genome

Database (MGD; Smith et al. 2018) and WormBase (Lee et al. 2018), among others (Baxe-

vanis and Bateman 2015; Rigden and Fernández 2018). Organismal sequence databases are

discussed in greater detail in Chapter 2.

Universal protein databases can be divided further into two broad categories: sequence

repositories, where the data are stored with little or no manual intervention, and curated

databases, in which experts enhance the original data through expert biocuration. The

importance of ensuring interoperability, creating and implementing standards, and adopting

best practices aimed at accurately representing the biological knowledge found within the

sequence databases is absolutely paramount. Indeed, these curation goals are so important

that there is an organization called the International Society for Biocuration, the primary

mission of which is to advance these central tenets.

中文译文

第 1 章 Biological Sequence Databases

Protein Sequence Databases

随着原核生物和真核生物众多完整基因组序列的可用，研究者正投入大量精力来鉴定这些基因组所编码的蛋白质，并分析它们的功能。这些蛋白质的大规模分析持续产生海量数据，其中包括蛋白质组学方法（第 11 章）和蛋白质结构分析（第 12 章）等方法所产生的数据，只是其中几个例子。这些方法及其他类似方法使得人们能够迅速鉴定大量蛋白质，绘制它们之间的相互作用（第 13 章），确定它们在细胞中的位置，并分析它们的生物学活性。这一不断增长的“信息空间”进一步强化了蛋白质序列数据库的核心地位：这些数据库承担着存储这些努力所产生数据的任务，并将其免费提供给生命科学社区。

由于蛋白质数据库中的大多数序列数据都来源于核苷酸序列的翻译，因此在很大程度上可以把它们视为“二级数据库”。通用蛋白质序列数据库涵盖所有物种的蛋白质，而专门的蛋白质序列数据库则聚焦于特定蛋白家族、蛋白群或某个特定生物体。具有代表性的模式生物数据库包括 Mouse Genome Database（MGD；Smith et al. 2018）和 WormBase（Lee et al. 2018）等（Baxevanis and Bateman 2015; Rigden and Fernández 2018）。关于生物体序列数据库的内容将在第 2 章更详细讨论。

通用蛋白质数据库还可以进一步分为两大类：序列库，即数据几乎不经过人工干预或几乎不经人工干预即存储的数据库；以及人工审查数据库，即专家通过专业人工审查对原始数据进行增强。确保互操作性、建立并实施标准、采用旨在准确表示序列数据库中生物学知识的最佳实践，其重要性无论怎样强调都不为过。事实上，这些审查目标如此重要，以至于还有一个名为 International Society for Biocuration 的组织，其主要使命就是推进这些核心原则。

007

The NCBI Protein Database

PDF page 32；印刷页码 12

▶

English SourcePDF extracted

===== PDF page 32 / printed page 12 =====

The NCBI Protein Database

NCBI maintains the Protein database, which derives its content from a number of sources.

These include the translations of the annotated coding regions from INSDC databases

described above, from RefSeq (Box 1.2), and from NCBI’s Third Party Annotation (TPA)

database. The TPA dataset is quite interesting in its own right, as it captures both experimen-

tal and inferential data provided by the scientific community to supplement the information

found in an INSDC nucleotide entry. As the name suggests, the information in the TPA is pro-

vided by third parties and not by the original submitter of the corresponding INSDC entry. The

NCBI Protein database also includes additional non-NCBI sources of protein sequence data,

including Swiss-Prot, PIR, PDB, and the Protein Research Foundation. Step-by-step methods

for performing searches against the NCBI Protein database are described in detail in Chapter 3.

边界说明：本小节截止于下一真实小节标题 UniProt 前；PDF 同页紧随的 UniProt 标题和 Figure 1.2 图注归入后续 UniProt 小节。

中文译文

第 1 章 Biological Sequence Databases

The NCBI Protein Database

NCBI 维护着 Protein database，其内容来源于多个不同来源。其中包括上文所述的 INSDC 数据库中已注释编码区的翻译结果、RefSeq（Box 1.2），以及 NCBI 的 Third Party Annotation（TPA）数据库。TPA 数据集本身就很有意思，因为它收录了由科学界提供的实验数据和推断数据，用于补充 INSDC 核苷酸条目中的信息。顾名思义，TPA 中的信息由第三方提供，而不是由对应 INSDC 条目的原始提交者提供。NCBI Protein database 还包括来自非 NCBI 来源的其他蛋白质序列数据，例如 Swiss-Prot、PIR、PDB 和 Protein Research Foundation。关于如何对 NCBI Protein database 执行检索的逐步方法，将在第 3 章中详细介绍。

008

UniProt

PDF page 32 下半 - PDF page 35 上半；印刷页码 12-15

▶

English SourcePDF extracted

===== PDF page 32 / printed page 12 =====

UniProt

Although data repositories are an essential vehicle through which scientists can access

sequence data as quickly as possible, it is clear that the addition of biological information from

Figure 1.2 Results of a search for the human heterogeneous nuclear ribosomal protein A1 record within UniProtKB, using the accession

number P09651 as the search term. See text for details.

===== PDF page 33 / printed page 13 =====

Protein Sequence Databases

multiple, highly regarded sources greatly increases the power of the underlying sequence

data. The UniProt Consortium was formed to accomplish just that, bringing together the

Swiss-Prot, TrEMBL, and the Protein Information Resource Protein Sequence Database

under a single umbrella, called UniProt (UniProt Consortium 2017). UniProt comprises

three main databases: the UniProt Archive, a non-redundant set of all publicly available

protein sequences compiled from a variety of source databases; UniProtKB, combining entries

from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL; and the UniProt Reference Clusters

(UniRef), containing non-redundant views of the data contained in UniParc and UniProtKB

that are clustered at three different levels of sequence identity (Suzek et al. 2015).

The wealth of information found within a UniProtKB entry can be best illustrated by an

example. Here, we will consider the entry for the human heterogeneous nuclear ribonuclear

protein A1, with accession number P09651. A search of UniProtKB using this accession num-

ber as the search term produces the view seen in Figure 1.2. The lower part of the left-hand

column shows the various types of information available for this protein, and the user can

select or de-select sections based on their interests. The main part of the window provides basic

Figure 1.3 The Subcellular location and Pathology & Biotech sections of the record for the human heterogeneous nuclear ribosomal

protein A1 record within UniProtKB. These sections can be accessed by clicking on the blue tiles in the left-hand column of the window.

See text for details.

===== PDF page 34 / printed page 14 =====

Biological Sequence Databases

identifying information about this sequence, as well as an indication of whether the entry has

been manually reviewed and annotated by UniProtKB curators. Here, we see that the entry

has indeed been reviewed and that there is experimental evidence that supports the existence

of the protein. The next section in the file is devoted to conveying functional information, also

providing Gene Ontology (GO) terms that are associated with the entry, as well as links to

enzyme and pathway databases such as Reactome (see Chapter 13). Clicking on any of the

blue tiles in the left-hand column will jump the user down to the selected section of the entry.

For instance, if one clicks on Subcellular location, the view seen in Figure 1.3 is produced,

providing a color-coded schematic of the cell indicating the type of annotation (manual or

automatic) and links to publications supporting the annotation. The lower part of Figure 1.3

also shows information regarding the protein’s involvement in disease, documenting variants

that have been implicated in early onset Paget disease and amyotrophic lateral sclerosis (Kim

et al. 2013; Liu et al. 2016).

In the upper left corner of the UniProtKB window are display options that are quite useful

in visualizing the significant amount of data found in this entry’s feature table. By clicking

on Feature viewer, one is presented with the view shown in Figure 1.4, neatly summarizing

Figure 1.4 The Feature viewer rendering of the record for the human heterogeneous nuclear ribosomal protein A1 within UniProtKB.

Clicking the Display link, found in the upper left portion of the window, provides access to the Feature viewer. Any of the sections can be

expanded by clicking on the labels in the blue boxes to the left of the graphic. See text for details.

===== PDF page 35 / printed page 15 =====

Protein Sequence Databases

the annotations for this sequence in a coordinate-based fashion. Any of the sections can

be expanded by clicking on the labels in the blue boxes to the left of the graphic. Here, the

post-translational modification (PTM) section has been expanded, showing the position of

modified residues in this protein; clicking on any of the markers in the track will produce a

pop-up with additional information on the PTM, along with relevant links to the literature.

In Figure 1.5, the Structural features and Variants sections have also been expanded, showing

the positions of all alpha helices, beta strands, and beta turns within the protein, as well as

the location of putatively clinically relevant point mutations. Here, a variant at position 351 is

highlighted, with the proline-to-leucine variant identified as part of the ClinVar project (Lan-

drum et al. 2016) having a possible association with relapsing–remitting multiple sclerosis. By

examining different sections of this very useful graphical display, the user can start to see how

various features overlap with one another, perhaps indicating whether a known or predicted

disease-causing variant falls within a structured region of the protein. These annotations

and observations can provide important insights with respect to experimental design and the

interpretation of experimental data.

Figure 1.5 Expanding the PTM, Structural features, and Variants sections within the Feature viewer display shows the position of all

post-translational modiﬁcations (PTMs), alpha helices, beta strands, and beta turns within the human heterogeneous nuclear ribosomal

protein A1, as well as the location of putatively clinically relevant point mutations. Clicking on any of the variants produces a pop-up

window with additional information; here, the pop-up window provides disease association data for the proline-to-leucine variant at

position 351 of the sequence. See text for details.

中文译文

第 1 章 Biological Sequence Databases

UniProt

数据仓库固然是科学家尽可能快速获取序列数据的重要途径，但显然，如果再加入来自多个高度可信来源的生物学信息，底层序列数据的效力就会大大增强。UniProt Consortium 正是为了实现这一点而成立的，它把 Swiss-Prot、TrEMBL 和 Protein Information Resource Protein Sequence Database 统一纳入一个称为 UniProt 的框架之下（UniProt Consortium 2017）。UniProt 包含三个主要数据库：UniProt Archive，即对来自多种源数据库的所有公开蛋白质序列进行汇编而成的非冗余集合；UniProtKB，它整合了 UniProtKB/Swiss-Prot 和 UniProtKB/TrEMBL 的条目；以及 UniProt Reference Clusters（UniRef），它提供 UniParc 和 UniProtKB 中数据的非冗余视图，并在三个不同的序列一致性水平上进行聚类（Suzek et al. 2015）。

UniProtKB 条目所包含的信息之丰富，最适合通过一个实例来说明。这里我们以人类 heterogeneous nuclear ribonuclear protein A1 的条目为例，其 accession number 为 P09651。使用该 accession number 作为检索词搜索 UniProtKB，得到的界面如图 1.2 所示。左侧栏下半部分显示了该蛋白可用的信息类型，用户可以根据自己的兴趣选择或取消选择相应部分。窗口的主体部分提供该序列的基本识别信息，并指出该条目是否已由 UniProtKB 人工审查人员手工审查和注释。这里我们可以看到，这一条目确实已经经过审查，而且有实验性证据支持该蛋白的存在。文件中的下一部分用于传达功能信息，同时给出与该条目相关的 Gene Ontology（GO）术语，以及指向 Reactome 等酶和通路数据库的链接（见第 13 章）。点击左侧栏中的任意蓝色块，用户就会跳转到条目的相应部分。例如，如果点击 Subcellular location，便会得到图 1.3 所示的视图，其中展示了一个颜色编码的细胞示意图，说明注释类型（人工或自动），并提供支持该注释的文献链接。图 1.3 的下半部分还显示了该蛋白与疾病相关的信息，记录了与早发型 Paget 病和肌萎缩侧索硬化症相关的变体（Kim et al. 2013; Liu et al. 2016）。

在 UniProtKB 窗口的左上角，有一些显示选项，对于可视化该条目 feature table 中的大量数据非常有用。点击 Feature viewer 后，会出现图 1.4 所示的视图，以坐标化方式清晰总结了该序列的注释。任何部分都可以通过点击图形左侧蓝色方框中的标签来展开。这里已经展开了 post-translational modification（PTM）部分，显示该蛋白中修饰残基的位置；点击轨道中的任一标记都会弹出窗口，提供关于该 PTM 的更多信息，以及相关文献链接。

在图 1.5 中，Structural features 和 Variants 两部分也已展开，显示了蛋白内所有 alpha helices、beta strands 和 beta turns 的位置，以及推定具有临床相关性的点突变位置。这里高亮显示的是位置 351 的一个变体；作为 ClinVar project（Landrum et al. 2016）的一部分所识别出的 proline-to-leucine 变体，可能与 relapsing–remitting multiple sclerosis 有关。通过查看这一非常有用的图形显示中的不同部分，用户可以开始看到各类特征如何彼此重叠，这或许能提示一个已知或预测的致病变体是否位于蛋白的某个结构区域内。这些注释和观察可为实验设计和实验数据解释提供重要启发。

Figure 1.2

以 accession number P09651 作为检索词，在 UniProtKB 中搜索人类 heterogeneous nuclear ribosomal protein A1 记录得到的结果。详见正文。

Figure 1.3

UniProtKB 中人类 heterogeneous nuclear ribosomal protein A1 记录的 Subcellular location 和 Pathology & Biotech 部分。点击窗口左侧栏中的蓝色图块即可访问这些部分。详见正文。

Figure 1.4

UniProtKB 中人类 heterogeneous nuclear ribosomal protein A1 记录的 Feature viewer 渲染图。点击窗口左上方的 Display 链接，可以进入 Feature viewer。点击图形左侧蓝色框中的标签，可展开任一部分。详见正文。

Figure 1.5

在 Feature viewer 显示中展开 PTM、Structural features 和 Variants 部分，可以显示人类 heterogeneous nuclear ribosomal protein A1 中所有 post-translational modifications（PTMs）、alpha helices、beta strands 和 beta turns 的位置，以及推定具有临床相关性的点突变位置。点击任一变体会弹出包含更多信息的窗口；这里弹出的窗口提供了该序列 351 位 proline-to-leucine 变体的疾病关联数据。详见正文。

图像资产：

术语表（7 条）

English	中文
curator / curation	沿用“人工审查人员 / 人工审查”，不用“策展”。
accession number	沿用“accession number”，中文可解释为“登录号/检索号”，本项目正文中优先保留英文。
Feature viewer	首次译为“Feature viewer / 特征查看器”，正文中可保留 Feature viewer。
post-translational modification	翻译为“翻译后修饰”，缩写 PTM 保留。
alpha helix / beta strand / beta turn	译为“α 螺旋 / β 链 / β 转角”；在纯 ASCII 环境可写作 alpha helix / beta strand / beta turn。
variant	在 UniProt/ClinVar 语境下译为“变体”。
heterogeneous nuclear ribonuclear protein A1	原文疑似为 heterogeneous nuclear ribonucleoprotein A1。为尊重原著，译文保留英文名，术语表备注中标记需后续核对。

PDF 插图 (8 页)

._figure_1_2_page_render

._figure_1_3_page_render

._figure_1_4_page_render

._figure_1_5_page_render

figure_1_2_page_render

figure_1_3_page_render

figure_1_4_page_render

figure_1_5_page_render

009

Summary + Box 1.3 Ensuring the Continued Quality of Data in Public Sequence Databases

PDF page 36 - PDF page 37 上半；印刷页码 16-17

▶

English SourcePDF extracted

===== PDF page 36 / printed page 16 =====

Summary

The rapid pace of discovery in the genomic and proteomic arenas requires that databases are

built in a way that facilitates not just the storage of these data, but the efficient handling

and retrieval of information from these databases. Many lessons have been learned over the

decades regarding how to approach critical questions regarding design and content, often the

hard way. Thus, the continued development of currently existing databases, as well as the

conceptualization and creation of new types of databases, will be a critical focal point for

the advancement of biological discovery. As should be obvious from this chapter, keeping

databases up to date and accurate is a task that requires the active involvement of the bio-

logical community (Box 1.3). Therefore, it is incumbent upon all users to ensure the accuracy

of these data in an active fashion, engaging the curators in a continuous dialog so that these

widely used resources continue to remain a valuable resource to biologists worldwide.

Box 1.3 Ensuring the Continued Quality of Data in Public Sequence Databases

Given the roles of DDBJ, EMBL, and GenBank in maintaining the archive of all publicly

available DNA, RNA, and protein sequences, the continued usefulness of this resource

is highly dependent on the quality of data found within it. Despite the high degree of

both manual and automated checking that takes place before a record becomes pub-

lic, errors will still ﬁnd their way into the databases. These errors may be trivial and

have no biological consequence (e.g. an incorrect postal code), may be misleading (e.g.

an organism having the correct genus but wrong species name), or downright incorrect

(e.g. a full-length mRNA not having a CDS annotated on it). Sometimes, records may have

incorrect reference blocks, preventing researchers from linking to the correct publication

describing the sequence. Over time, many have taken an active role in reporting these

errors but, more often than not, these errors are left uncorrected.

While the individual INSDC members have the responsibility for hosting and dissemi-

nating the data found within their databases, keep in mind that the ownership of the data

rests with the original submitter – and these original submitters (or their designees) are

the only ones who can make updates to their database records. To keep these community

resources as accurate and up to date as possible, users are actively encouraged to report

any errors found when using the databases in the course of their work so that the database

administrators can follow up with the original submitters as appropriate.

Given below are the current e-mail addresses for submitting information regarding

errors to the three major sequence databases. As all the databases share information with

each other nightly, it is only necessary to report the error to one of the three members of

the consortium. Authors are actively encouraged to check their own records periodically

to ensure that the information they previously submitted is still accurate. Even though

this charge to the community is discussed here in the context of the three major sequence

databases, all databases provide similar mechanisms through which incorrect information

can be brought to the attention of the database administrators.

DDBJ

ddbjupdt@ddbj.nig.ac.jp

EMBL

datasubs@ebi.ac.uk

GenBank

gb-admin@ncbi.nlm.nih.gov

As alluded to above, the range of publicly available data obviously goes well beyond human

data, whether sequence based or not. As the major public sequence databases need to be able to

store data in a fairly generalized fashion, these databases often do not contain more specialized

types of information that would be of interest to specific segments of the biological community.

To address this, many smaller, specialized databases have emerged and have been developed

and curated by biologists “in the trenches” to fulfill specific needs. These databases, which

contain information ranging from strain crosses to gene expression data, provide a valuable

===== PDF page 37 / printed page 17 =====

第 1 章 Biological Sequence Databases

Summary

基因组学和蛋白质组学领域的发现速度极快，这要求数据库的构建方式不仅要便于存储这些数据，还要便于对数据库中的信息进行高效处理和检索。几十年来，研究者在如何处理数据库设计和内容方面的关键问题上积累了许多经验，其中不少经验来之不易。因此，持续发展现有数据库，并构想和创建新类型数据库，将成为推动生物学发现进步的关键焦点。正如本章已经清楚展示的那样，保持数据库的及时更新和准确性，是一项需要生物学界积极参与的任务（Box 1.3）。因此，所有用户都有责任以积极方式确保这些数据的准确性，与人工审查人员保持持续对话，使这些被广泛使用的资源能够继续成为全球生物学家的宝贵资源。

Box 1.3 确保公共序列数据库中数据的持续质量

鉴于 DDBJ、EMBL 和 GenBank 承担着维护所有公开 DNA、RNA 和蛋白质序列档案的职责，这一资源能否持续发挥作用，在很大程度上取决于其中数据的质量。尽管一条记录在公开之前会经过高度的人工和自动检查，错误仍然会进入数据库。这些错误可能很轻微，并不产生生物学后果（例如邮政编码错误）；也可能具有误导性（例如某个生物体的属名正确，但种名错误）；甚至可能完全错误（例如一条全长 mRNA 没有标注 CDS）。有时，记录中的参考文献区块也可能不正确，使研究者无法链接到描述该序列的正确出版物。长期以来，许多人已经积极报告这些错误；但更多时候，这些错误仍未得到纠正。

虽然 INSDC 的各个成员负责托管和传播其数据库中的数据，但请记住，数据的所有权属于最初提交者，而只有这些最初提交者（或其指定人员）才能更新自己的数据库记录。为了尽可能保持这些社区资源的准确和及时更新，用户在工作中使用数据库时，如果发现任何错误，都被积极鼓励报告出来，以便数据库管理员能够在适当情况下与最初提交者跟进。

下面列出的是目前向三大序列数据库提交错误信息的电子邮件地址。由于所有数据库每晚都会相互共享信息，因此只需要向联盟中的任意一个成员报告错误即可。作者也被积极鼓励定期检查自己的记录，以确保他们此前提交的信息仍然准确。虽然这里是在三大序列数据库的背景下讨论这一社区责任，但所有数据库都提供类似机制，使错误信息能够被提交给数据库管理员。

DDBJ

ddbjupdt@ddbj.nig.ac.jp

EMBL

datasubs@ebi.ac.uk

GenBank

gb-admin@ncbi.nlm.nih.gov

如上所述，公开可用数据的范围显然远远超出人类数据，也并不限于序列类数据。由于主要公共序列数据库需要以相当通用的方式存储数据，这些数据库往往并不包含某些生物学群体所感兴趣的更专门类型的信息。为了解决这一问题，许多较小的专业数据库应运而生，并由身处“一线”的生物学家开发和人工审查，以满足特定需求。这些数据库包含的信息从品系杂交到基因表达数据不等，是对那些更广为人知的公共序列数据库的重要补充；也鼓励用户明智地同时使用这两类数据库。此类数据库的带注释清单可见于 Nucleic Acids Research 每年出版的 Database 专刊（Rigden and Fernández 2018）。

本章之所以放在本书开头，是因为本书认为，理解生物学数据库，是能够开展稳健、准确的生物信息学分析的第一步。强烈建议读者花时间理解这些数据库中数据的结构，因为这是寻找感兴趣序列数据，并开展后续章节所述更高级分析的基础。

术语表（7 条）

English	中文
curator	继续译为“人工审查人员”。
curated by biologists	译为“由生物学家开发和人工审查”，避免“策展”。
in the trenches	按语境译为“身处‘一线’”，保留原文隐喻意味但符合中文教材表达。
public sequence databases	译为“公共序列数据库”。
original submitter	译为“最初提交者”。
database administrators	译为“数据库管理员”。
Database issue	译为“Database 专刊”，保留期刊栏目专名。

PDF 插图 (4 页)

._summary_box_1_3_page_36_render

._summary_continuation_page_37_render

summary_box_1_3_page_36_render

summary_continuation_page_37_render

010

Acknowledgments + Internet Resources

PDF page 37 中部；印刷页码 17

▶

English SourcePDF extracted

===== PDF page 37 / printed page 17 =====

Acknowledgments

The author thanks Rolf Apweiler for the use of material from the third edition of this book.

Internet Resources

DDBJ Database Divisions

www.ddbj.nig.ac.jp/ddbj/data-categories-e.html

DNA Database of Japan (DDBJ)

www.ddbj.nig.ac.jp

EMBL Nucleotide Sequence Database

www.embl.org

ENA Data Formats

www.ebi.ac.uk/ena/submit/data-formats

European Bioinformatics Institute

www.ebi.ac.uk

GenBank

www.ncbi.nlm.nih.gov

GenBank Database Divisions

www.ncbi.nlm.nih.gov/genbank/htgs/divisions

Genome Ontology Consortium

geneontology.org

INSDC Feature Table Definition

insdc.org/documents/feature_table.html

International Society for Biocuration

biocuration.org

NCBI Data Model

www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/

DATAMODL.HTML

NCBI Protein Database

www.ncbi.nlm.nih.gov/protein

Nucleic Acids Research Database issue

academic.oup.com/nar

Protein Data Bank (PDB)

www.rcsb.org

Protein Identification Resource (PIR)

pir.georgetown.edu

Protein Research Foundation

www.proteinresearch.net

RefSeq

www.ncbi.nlm.nih.gov/refseq

Swiss-Prot (EBI)

www.ebi.ac.uk/uniprot

Swiss-Prot (ExPASy)

web.expasy.org/docs/swiss-prot_guideline.html

UniProt Consortium

www.uniprot.org

中文译文

第 1 章 Biological Sequence Databases

Acknowledgments

作者感谢 Rolf Apweiler 允许使用本书第三版中的材料。

Internet Resources

DDBJ 数据库分部

www.ddbj.nig.ac.jp/ddbj/data-categories-e.html

日本 DNA 数据库（DDBJ）

www.ddbj.nig.ac.jp

EMBL 核苷酸序列数据库

www.embl.org

ENA 数据格式

www.ebi.ac.uk/ena/submit/data-formats

欧洲生物信息学研究所

www.ebi.ac.uk

GenBank

www.ncbi.nlm.nih.gov

GenBank 数据库分部

www.ncbi.nlm.nih.gov/genbank/htgs/divisions

Genome Ontology Consortium

geneontology.org

INSDC Feature Table 定义

insdc.org/documents/feature_table.html

International Society for Biocuration

biocuration.org

NCBI 数据模型

www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML

NCBI Protein Database

www.ncbi.nlm.nih.gov/protein

Nucleic Acids Research Database 专刊

academic.oup.com/nar

Protein Data Bank（PDB）

www.rcsb.org

Protein Identification Resource（PIR）

pir.georgetown.edu

Protein Research Foundation

www.proteinresearch.net

RefSeq

www.ncbi.nlm.nih.gov/refseq

Swiss-Prot（EBI）

www.ebi.ac.uk/uniprot

Swiss-Prot（ExPASy）

web.expasy.org/docs/swiss-prot_guideline.html

UniProt Consortium

www.uniprot.org

术语表（6 条）

English	中文
Internet Resources	译为“互联网资源”。
Database Divisions	译为“数据库分部”。
European Bioinformatics Institute	译为“欧洲生物信息学研究所”。
INSDC Feature Table Definition	译为“INSDC Feature Table 定义”，保留 Feature Table 以便与前文术语统一。
International Society for Biocuration	暂保留英文机构名，避免把 biocuration 机械译为“策展”。
Nucleic Acids Research Database issue	译为“Nucleic Acids Research Database 专刊”。

PDF 插图 (2 页)

._acknowledgments_internet_resources_page_37_render

acknowledgments_internet_resources_page_37_render

011

第 1 章 Biological Sequence Databases

References

PDF page 38；印刷页码 18

▶

English SourcePDF extracted

References

Apweiler,R.(2001).Functional information in Swiss-Prot: the basis for large-scale characterization of protein sequences. Briefings Bioinf. 2:9-18.

Bairoch,A.(2000).Serendipity in bioinformatics: the tribulations of a Swiss bioinformatician through exciting times! Bioinformatics. 16:48-64.

Baxevanis,A.D. and Bateman,A.(2015).The importance of biological databases in biological discovery. Curr. Protoc. Bioinf. 50:1.1.1-1.1.8.

Benson,D.A., Cavanaugh,M., Clark,K. et al.(2018).GenBank. Nucleic Acids Res. 46:D41-D47.

Cook,C.E., Bergman,M.T., Cochrane,G. et al.(2018).The European Bioinformatics Institute in 2017: data coordination and integration. Nucleic Acids Res. 46:D21-D29.

Dayhoff,M.O., Eck,R.V., Chang,M.A., and Sochard,M.R.(1965). Atlas of Protein Sequence and Structure. Silver Spring, MD: National Biomedical Research Foundation.

Gene Ontology Consortium(2017).Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45:D331-D338.

Green,E.D., Rubin,E.M., and Olson,M.V.(2017).The future of DNA sequencing. Nature. 550:179-181.

Karsch-Mizrachi,I., Tagaki,T., and Cochrane,G., on behalf of the International Nucleotide Sequence Database Collaboration(2018).The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 46:D48-D51.

Kim,H.J., Kim,N.C., Wang,Y.D. et al.(2013).Mutations in prion-like domains in hnRNPA2B1 and hnRNPA1 cause multisystem proteinopathy and ALS. Nature. 495:467-473.

Kodama,Y., Mashima,J., Kosuge,T. et al.(2018).DNA Data Bank of Japan: 30th anniversary. Nucleic Acids Res. 46:D30-D35.

Landrum,M.J., Lee,J.M., Benson,M. et al.(2016).ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44:D862-D868.

Lee,R.Y.N., Howe,K.L., Harris,T.W. et al.(2018).WormBase 2017: molting into a new stage. Nucleic Acids Res. 46:D869-D874.

Lipman,D.J. and Pearson,W.R.(1985).Rapid and sensitive protein similarity searches. Science. 227:1435-1441.

Liu,Q., Shu,S., Wang,R.R. et al.(2016).Whole-exome sequencing identifies a missense mutation in hnRNPA1 in a family with flail arm ALS. Neurology. 87:1763-1769.

Rigden,D.J. and Fernandez,X.M.(2018).The 2018 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 46:D1-D7.

Silvester,N., Alako,B., Amid,C. et al.(2018).The European Nucleotide Archive in 2017. Nucleic Acids Res. 46:D36-D40.

Smith,C.L., Blake,J.A., Kadin,J.A. et al., and The Mouse Genome Database Group(2018).Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46:D836-D842.

Suzek,B.E., Wang,Y., Huang,H. et al., and The UniProt Consortium(2015).UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 31:926-932.

UniProt Consortium(2017).UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45:D158-D169.

章节作者声明

This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred.

中文译文

第1章生物序列数据库

参考文献

Apweiler, R. (2001). Swiss-Prot 中的功能信息：大规模蛋白质序列表征的基础。Briefings Bioinf. 2:9-18.

Bairoch, A. (2000). 生物信息学中的意外发现：一位瑞士生物信息学家在激动人心时代的艰辛历程！Bioinformatics. 16:48-64.

Baxevanis, A.D. and Bateman, A. (2015). 生物数据库在生物学发现中的重要性。Curr. Protoc. Bioinf. 50:1.1.1-1.1.8.

Benson, D.A., Cavanaugh, M., Clark, K. et al. (2018). GenBank。Nucleic Acids Res. 46:D41-D47.

Cook, C.E., Bergman, M.T., Cochrane, G. et al. (2018). 2017年的欧洲生物信息学研究所：数据协调与整合。Nucleic Acids Res. 46:D21-D29.

Dayhoff, M.O., Eck, R.V., Chang, M.A., and Sochard, M.R. (1965). 蛋白质序列与结构图谱。Silver Spring, MD: National Biomedical Research Foundation.

Gene Ontology Consortium (2017). 基因本体知识库的扩展与资源。Nucleic Acids Res. 45:D331-D338.

Green, E.D., Rubin, E.M., and Olson, M.V. (2017). DNA测序的未来。Nature. 550:179-181.

Karsch-Mizrachi, I., Tagaki, T., and Cochrane, G., on behalf of the International Nucleotide Sequence Database Collaboration (2018). 国际核苷酸序列数据库协作组织。Nucleic Acids Res. 46:D48-D51.

Kim, H.J., Kim, N.C., Wang, Y.D. et al. (2013). hnRNPA2B1和hnRNPA1中朊病毒样结构域的突变导致多系统蛋白病和肌萎缩侧索硬化症。Nature. 495:467-473.

Kodama, Y., Mashima, J., Kosuge, T. et al. (2018). 日本DNA数据库：30周年。Nucleic Acids Res. 46:D30-D35.

Landrum, M.J., Lee, J.M., Benson, M. et al. (2016). ClinVar：临床相关变异解读的公共档案库。Nucleic Acids Res. 44:D862-D868.

Lee, R.Y.N., Howe, K.L., Harris, T.W. et al. (2018). WormBase 2017：蜕皮进入新阶段。Nucleic Acids Res. 46:D869-D874.

Lipman, D.J. and Pearson, W.R. (1985). 快速而灵敏的蛋白质相似性搜索。Science. 227:1435-1441.

Liu, Q., Shu, S., Wang, R.R. et al. (2016). 全外显子组测序在一个"连枷臂"肌萎缩侧索硬化症家系中鉴定出hnRNPA1的错义突变。Neurology. 87:1763-1769.

Rigden, D.J. and Fernández, X.M. (2018). 2018年《核酸研究》数据库专辑与在线分子生物学数据库集合。Nucleic Acids Res. 46:D1-D7.

Silvester, N., Alako, B., Amid, C. et al. (2018). 2017年的欧洲核苷酸档案库。Nucleic Acids Res. 46:D36-D40.

Smith, C.L., Blake, J.A., Kadin, J.A. et al., and The Mouse Genome Database Group (2018). 小鼠基因组数据库（MGD）-2018：实验小鼠的知识库。Nucleic Acids Res. 46:D836-D842.

Suzek, B.E., Wang, Y., Huang, H. et al., and The UniProt Consortium (2015). UniRef簇：改进序列相似性搜索的全面且可扩展的替代方案。Bioinformatics. 31:926-932.

UniProt Consortium (2017). UniProt：通用蛋白质知识库。Nucleic Acids Res. 45:D158-D169.

---

本章由 Andreas D. Baxevanis 博士以私人身份撰写。美国国立卫生研究院或美国卫生与公众服务部不为此提供任何官方支持或认可，也不应据此推断。

English	中文
Further Reading	译为“延伸阅读”。
bioinformatics landscape	译为“生物信息学格局”。
Database issue	沿用“Database 专刊”。
publicly available bioinformatic databases	译为“公开可用的生物信息学数据库”。

Biological Sequence Databases

第 1 章 Biological Sequence Databases

Introduction

第 1 章 Biological Sequence Databases

Nucleotide Sequence Databases

第 1 章 Biological Sequence Databases

Nucleotide Sequence Flatfiles: A Dissection

The Header

Box 1.1 核苷酸数据库中的功能分区

The Feature Table

Table 1.1 feature table 中位置的表示方式

第 1 章 Biological Sequence Databases

Graphical Interfaces

Figure 1.1

第 1 章 Biological Sequence Databases

Box 1.2 RefSeq

第 1 章 Biological Sequence Databases

Protein Sequence Databases

第 1 章 Biological Sequence Databases

The NCBI Protein Database

第 1 章 Biological Sequence Databases

UniProt

Figure 1.2

Figure 1.3

Figure 1.4

Figure 1.5

第 1 章 Biological Sequence Databases

Summary

Box 1.3 确保公共序列数据库中数据的持续质量

第 1 章 Biological Sequence Databases

Acknowledgments

Internet Resources

第 1 章 Biological Sequence Databases

Further Reading

章节作者声明

第1章 生物序列数据库

参考文献

导出

第1章生物序列数据库