Introduction
Biological Sequence Databases
Andreas D. Baxevanis
Introduction
Over the past several decades, there has been a feverish push to understand, at the most
elementary of levels, what constitutes the basic “book of life.” Biologists (and scientists in gen-
eral) are driven to understand how the millions or billions of bases in an organism’s genome
contain all of the information needed for the cell to conduct the myriad metabolic processes
necessary for the organism’s survival – information that is propagated from generation to
generation. To have a basic understanding of how the collection of individual nucleotide
bases drives the engine of life, large amounts of sequence data must be collected and stored
in a way that these data can be searched and analyzed easily. To this end, much effort has
gone into the design and maintenance of biological sequence databases. These databases have
had a significant impact on the advancement of our understanding of biology not just from
a computational standpoint but also through their integrated use alongside studies being
performed at the bench.
The history of sequence databases began in the early 1960s, when Margaret Dayhoff and
colleagues (1965) at the National Biomedical Research Foundation (NBRF) collected all of the
protein sequences known at that time – all 65 of them – and published them in a book called
the Atlas of Protein Sequence and Structure. It is important to remember that, at this point in the
history of biology, the focus was on sequencing proteins through traditional techniques such
as the Edman degradation rather than on sequencing DNA, hence the overall small number
of available sequences. By the late 1970s, when a significant number of nucleotide sequences
became available, those were also included in later editions of the Atlas. As this collection
evolved, it included text-based descriptions to accompany the protein sequences, as well as
information regarding the evolution of many protein families. This work, in essence, was the
first annotated sequence database, even though it was in printed form. Over time, the amount
of data contained in the Atlas became unwieldy and the need for it to be available in electronic
form became obvious. From the early 1970s to the late 1980s, the contents of the Atlas were
distributed electronically by NBRF (and later by the Protein Information Resource, or PIR) on
magnetic tape, and the distribution included some basic programs that could be used to search
and evaluate distant evolutionary relationships.
The next phase in the history of sequence databases was precipitated by the veritable explo-
sion in the amount of nucleotide sequence data available to researchers by the end of the
1970s. To address the need for more robust public sequence databases, the Los Alamos National
Laboratory (LANL) created the Los Alamos DNA Sequence Database in 1979, which became
known as GenBank in 1982 (Benson et al. 2018). Meanwhile, the European Molecular Biology
Laboratory (EMBL) created the EMBL Nucleotide Sequence Data Library in 1980. Throughout
the 1980s, EMBL (then based in Heidelberg, Germany), LANL, and (later) the National Center
for Biotechnology Information (NCBI, part of the National Library of Medicine at the National
Institutes of Health) jointly contributed DNA sequence data to these databases. This was done
Biological Sequence Databases
by having teams of curators manually transcribing and interpreting what was published in
print journals to an electronic format more appropriate for computational analyses. The DNA
Databank of Japan (DDBJ; Kodama et al. 2018) joined this DNA data-collecting collabora-
tion a few years later. By the late 1980s, the quantity of DNA sequence data being produced
was so overwhelming that print journals began asking scientists to electronically submit their
DNA sequences directly to these databases, rather than publishing them in printed journals
or papers. In 1988, after a meeting of these three groups (now referred to as the International
Nucleotide Sequence Database Collaboration, or INSDC; Karsch-Mizrachi et al. 2018), there
was an agreement to use a common data exchange format and to have each database update
only the records that were directly submitted to it. Thanks to this agreement, all three centers
(EMBL, DDBJ, and NCBI) now collect direct DNA sequence submissions and distribute them
so that each center has copies of all of the sequences, with each center acting as a primary distri-
bution center for these sequences. DDBJ/EMBL/GenBank records are updated automatically
every 24 hours at all three sites, meaning that all sequences can be found within DDBJ, the
European Nucleotide Archive (ENA; Silvester et al. 2018), and GenBank in short order. That
said, each database within the INSDC has the freedom to display and annotate the sequence
data as it sees fit.
In parallel with the early work being done on DNA sequence databases, the foundations
for the Swiss-Prot protein sequence database were also being laid in the early 1980s by Amos
Bairoch, recounting its history from an engaging perspective in a first-person review (Bairoch
2000). Bairoch converted PIR’s Atlas to a format similar to that used by EMBL for its nucleotide
database. In this initial release, called PIR+, additional information about each of the pro-
teins was added, increasing its value as a curated, well-annotated source of information on
proteins. In the summer of 1986, Bairoch began distributing PIR+ on the US BIONET (a pre-
cursor to the Internet), renaming it Swiss-Prot. At that time, it contained the grand sum of
3900 protein sequences. This was seen as an overwhelming amount of data, in stark contrast
to today’s standards. As Swiss-Prot and EMBL followed similar formats, a natural collaboration
developed between these two groups, and these collaborative efforts strengthened when both
EMBL’s and Swiss-Prot’s operations were moved to EMBL’s European Bioinformatics Insti-
tute (EBI; Cook et al. 2018) in Hinxton, UK. One of the first collaborative projects undertaken
by the Swiss-Prot and EMBL teams was to create a new and much larger protein sequence
database supplement to Swiss-Prot. As maintaining the high quality of Swiss-Prot entries was a
time-consuming process involving extensive sequence analysis and detailed curation by expert
annotators (Apweiler 2001), and to allow the quick release of protein data not yet annotated
to Swiss-Prot’s stringent standards, a new database called TrEMBL (for “translation of EMBL
nucleotide sequences”) was created. This supplement to Swiss-Prot initially consisted of com-
putationally annotated sequence entries derived from the translation of all coding sequences
(CDSs) found in INSDC databases. In 2002, a new effort involving the Swiss Institute of Bioin-
formatics, EMBL-EBI, and PIR was launched, called the UniProt consortium (UniProt Con-
sortium 2017). This effort gave rise to the UniProt Knowledgebase (UniProtKB), consisting
of Swiss-Prot, TrEMBL, and PIR. A similar effort also gave rise to the NCBI Protein Database,
bringing together data from numerous sources and described more fully in the text that follows.
The completion of human genome sequencing and the sequencing of numerous model
genomes, as well as the existence of a gargantuan number of sequences in general, provides
a golden opportunity for biological scientists, owing to the inherent value of these data. At
the same time, the sheer magnitude of data also presents a conundrum to the inexperienced
user, resulting not just from the size of the “sequence information space” but from the
fact that the information space continues to get larger by leaps and bounds. Indeed, the
sequencing landscape has changed significantly in recent years with the development of new
high-throughput technologies that generate more and more sequence data in a way that is
best described as “better, cheaper, faster,” with these advances feeding into the “insatiable
appetite” that scientists have for more and more sequence data (Green et al. 2017). Given the
inherent value of the data contained within these sequence databases, this chapter will focus
Nucleotide Sequence Flatfiles: A Dissection
on providing the reader with a solid understanding of these major public sequence databases,
as a first step toward being able to perform robust and accurate bioinformatic analyses.
第 1 章 Biological Sequence Databases
Introduction
在过去几十年中,人们一直以近乎急切的速度,试图从最基本的层面理解构成“生命之书”的究竟是什么。生物学家(以及一般意义上的科学家)希望理解,一个生物体基因组中数以百万计或数以十亿计的碱基,如何包含细胞开展维持该生物体生存所必需的无数代谢过程所需的全部信息;这些信息又如何一代一代地传递下去。为了从基础层面理解单个核苷酸碱基的集合如何驱动生命这台机器,必须收集大量序列数据,并以便于检索和分析的方式加以存储。为此,研究者在生物序列数据库的设计和维护方面投入了大量工作。这些数据库不仅从计算角度显著推动了我们对生物学的理解,也通过与实验台研究的整合使用,对生物学认识的进展产生了重要影响。
序列数据库的历史始于 20 世纪 60 年代初。当时,Margaret Dayhoff 及其在 National Biomedical Research Foundation(NBRF)的同事(1965)收集了当时已知的全部蛋白质序列——总共 65 条——并将它们出版在一本名为 Atlas of Protein Sequence and Structure 的书中。需要记住的是,在生物学发展的这一阶段,研究重点是通过 Edman degradation 等传统技术测定蛋白质序列,而不是测定 DNA 序列,因此可用序列的总体数量很少。到 20 世纪 70 年代末,随着相当数量的核苷酸序列变得可用,这些序列也被纳入 Atlas 的后续版本中。随着这一资料集的发展,它不仅包含蛋白质序列,还包含与之配套的基于文本的描述,以及关于许多蛋白质家族进化的信息。从本质上说,这项工作是第一个带注释的序列数据库,尽管它当时是以印刷形式存在的。随着时间推移,Atlas 中包含的数据量变得难以管理,将其转为电子形式的需求也日益明显。从 20 世纪 70 年代初到 20 世纪 80 年代末,Atlas 的内容由 NBRF(后来由 Protein Information Resource,PIR)以磁带形式进行电子分发;这些分发内容还包括一些基础程序,可用于搜索并评估较远的进化关系。
序列数据库历史的下一个阶段,是由 20 世纪 70 年代末研究人员可用的核苷酸序列数据量真正爆炸式增长所推动的。为满足更强大的公共序列数据库需求,Los Alamos National Laboratory(LANL)于 1979 年创建了 Los Alamos DNA Sequence Database,该数据库于 1982 年被称为 GenBank(Benson et al. 2018)。与此同时,European Molecular Biology Laboratory(EMBL)于 1980 年创建了 EMBL Nucleotide Sequence Data Library。整个 20 世纪 80 年代,EMBL(当时位于德国 Heidelberg)、LANL,以及后来加入的 National Center for Biotechnology Information(NCBI,隶属于 National Library of Medicine,后者属于 National Institutes of Health)共同向这些数据库贡献 DNA 序列数据。这些数据由多组人工审查人员处理:他们手工转录并解读印刷期刊中发表的内容,将其转化为更适合计算分析的电子格式。几年后,DNA Databank of Japan(DDBJ;Kodama et al. 2018)加入了这一 DNA 数据收集协作。
到 20 世纪 80 年代末,产生的 DNA 序列数据量已经非常庞大,以至于印刷期刊开始要求科学家将其 DNA 序列直接以电子方式提交到这些数据库,而不是发表在纸质期刊或论文中。1988 年,在这三个组织召开会议之后(如今统称为 International Nucleotide Sequence Database Collaboration,即 INSDC;Karsch-Mizrachi et al. 2018),它们达成协议:使用共同的数据交换格式,并且每个数据库只更新直接提交给自己的记录。由于这一协议,三个中心(EMBL、DDBJ 和 NCBI)现在都接收直接提交的 DNA 序列,并对这些序列进行分发,使每个中心都保存全部序列的副本;同时,每个中心也都是这些序列的主要分发中心。DDBJ/EMBL/GenBank 记录会在三个站点每 24 小时自动更新一次,这意味着所有序列很快都可以在 DDBJ、European Nucleotide Archive(ENA;Silvester et al. 2018)和 GenBank 中找到。尽管如此,INSDC 内的每个数据库仍可按照自己认为合适的方式展示和注释序列数据。
在 DNA 序列数据库早期工作展开的同时,Swiss-Prot 蛋白质序列数据库的基础也在 20 世纪 80 年代初由 Amos Bairoch 奠定;Bairoch 在一篇第一人称综述中以引人入胜的视角回顾了这段历史(Bairoch 2000)。Bairoch 将 PIR 的 Atlas 转换为一种类似于 EMBL 核苷酸数据库所用格式的格式。在这个最初名为 PIR+ 的版本中,每种蛋白质都被加入了额外信息,从而提高了它作为经过策展、注释良好的蛋白质信息来源的价值。1986 年夏,Bairoch 开始通过 US BIONET(互联网的前身)分发 PIR+,并将其更名为 Swiss-Prot。当时,它总共包含 3900 条蛋白质序列。与今天的标准相比,这一数量显得很小,但在当时已被视为极其庞大的数据量。
由于 Swiss-Prot 和 EMBL 采用相似的格式,这两个团队之间自然发展出合作关系;当 EMBL 和 Swiss-Prot 的业务都迁移到位于英国 Hinxton 的 EMBL European Bioinformatics Institute(EBI;Cook et al. 2018)后,这些合作进一步加强。Swiss-Prot 和 EMBL 团队最早开展的合作项目之一,是创建一个新的、规模大得多的蛋白质序列数据库,作为 Swiss-Prot 的补充。由于维护 Swiss-Prot 条目的高质量是一项耗时工作,需要专家注释人员进行广泛的序列分析和细致策展(Apweiler 2001),同时也为了快速发布尚未达到 Swiss-Prot 严格注释标准的蛋白质数据,一个名为 TrEMBL 的新数据库被创建出来;TrEMBL 意为“translation of EMBL nucleotide sequences”。这个 Swiss-Prot 补充库最初由计算注释的序列条目构成,这些条目来自 INSDC 数据库中所有编码序列(coding sequences, CDSs)的翻译结果。
2002 年,Swiss Institute of Bioinformatics、EMBL-EBI 和 PIR 发起了一项新的合作,即 UniProt consortium(UniProt Consortium 2017)。这一工作催生了 UniProt Knowledgebase(UniProtKB),其中包括 Swiss-Prot、TrEMBL 和 PIR。类似的工作也促成了 NCBI Protein Database 的形成;该数据库汇集了来自众多来源的数据,并将在下文中作更充分的介绍。人类基因组测序的完成、众多模式生物基因组的测序,以及总体上数量极其庞大的序列数据的存在,共同构成了充分理由:读者首先需要扎实理解这些主要公共序列数据库,进而才能开展稳健而准确的生物信息学分析。






















