Chapter 2

Information Retrieval from Biological Databases

4 小节

013

Introduction

PDF page 39 - PDF page 40 顶部；印刷页码 19-20

▶

English SourcePDF extracted

Information Retrieval from Biological Databases

Andreas D. Baxevanis

Introduction

On April 14, 2003, the biological community celebrated the achievement of the Human

Genome Project’s major goal: the complete, accurate, and high-quality sequencing of the

humangenome(InternationalHumanGenomeSequencingConsortium2001;Schmutzetal.

2004). The attainment of this goal, which many have compared to landing a person on the

moon, has had a profound effect on how biological and biomedical research is conducted

and will undoubtedly continue to have a profound effect on its direction in the future. The

availabilityof not just human genome data, but also human sequence variation data, model

organism sequence data, and information on gene structure and function provides fertile

ground for biologists to better design and interpret their experiments in the laboratory,

fulfillingthepromiseofbioinformaticsinadvancingandacceleratingbiologicaldiscovery.

OneofthemostimportantdatabasesavailabletobiologistsisGenBank,theannotatedcol-

lectionofallpubliclyavailableDNAandproteinsequences(Bensonetal.2017;seeChapter1).

Thisdatabase,maintainedbytheNationalCenterforBiotechnologyInformation(NCBI)atthe

NationalInstitutesofHealth(NIH),representsacollaborativeeffortbetweenNCBI,theEuro-

pean Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). At

thetimeofthiswriting,GenBankcontainedover200millionsequencesandover300trillion

nucleotide bases. The completion of human genome sequencing and the sequencing of an

ever-expandingnumberofmodelorganismgenomes,aswellastheexistenceofagargantuan

numberofsequencesingeneral,providesagoldenopportunityforbiologicalscientists,owing

to the inherent value of these data. However, at the same time, the sheer magnitude of data

presents a conundrum to the inexperienced user, resulting not just from the size of the “se-

quenceinformationspace”butfromthefactthattheinformationspacecontinuestogetlarger

and larger – by leaps and bounds – at a pace that will continue to accelerate, even though

humangenomesequencinghaslongbeen“completed.”

TheeffectoftheHumanGenomeProjectandothersystematicsequencingprojectsonthe

continuedaccumulationofsequencedataisillustratedbythegrowthofGenBank,asshown

inFigure2.1;theexponentialgrowthrateillustratedinthefigureisexpectedtocontinuefor

sometimetocome.Thecontinuedexpansionofnotjustthesequencespacebutofthemyriad

biologicaldatanowavailable becauseoftheexpansionofthesequencespaceunderscoresthe

necessity for all biologists to learn how to effectively navigate this information for effective

use in their work – even allowing investigators toavoid performing expensive experiments

themselvesbasedonthedatafoundwithinthesevirtualtreasuretroves.

GenBank(oranyotherbiologicaldatabase,forthatmatter)serveslittlepurposeunlessthe

datacanbeeasilysearchedandentriesretrievableinauseful,meaningfulformat.Otherwise,

sequencingeffortssuchasthosedescribedabovehavenousefulend–withouteffectivesearch

andretrievaltools,thebiologicalcommunityasawholecannotmakeuseoftheinformation

hiddenwithinthesemillionsofbasesandaminoacids,muchlessthestructurestheyformor

Bioinformatics,FourthEdition.EditedbyAndreasD.Baxevanis,GaryD.Bader,andDavidS.Wishart.

CompanionWebsite:www.wiley.com/go/baxevanis/Bioinformatics_4e

20 Information Retrieval from Biological Databases

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

Base pairs

(squares, billions)

Sequences

(circles, millions)

Figure 2.1 The exponential growth of GenBank in terms of number of nucleotides (squares, in millions)

and number of sequences submitted (circles, in thousands). Source data for the ﬁgure have been obtained

from the National Center for Biotechnology Information (NCBI) web site. Note that the period of accel-

erated growth after 1997 coincides with the completion of the Human Genome Project’s genetic and

physical mapping goals, setting the stage for high-accuracy, high-throughput sequencing, as well as the

development of new sequencing technologies (Collins et al. 1998, 2003; Green et al. 2011).

themutationstheyharbor.Muchefforthasgoneintomakingsuchdataaccessibletothebiolo-

gist,andaselectionoftheprogramsandinterfacesresultingfromtheseeffortsarethefocusof

thischapter.ThediscussionwillcenteronqueryingdatabasesmaintainedbyNCBI,asthese

more“general”repositoriesarefarandawaytheonesmostoftenaccessedbybiologists,but

attentionwillalsobegiventospecializeddatabasesthatprovideinformationnotnecessarily

foundthroughtheuseofEntrez,NCBI’sintegratedinformationretrievalsystem.

Integrated Information Retrieval: The Entrez System

One of the most widely used interfaces for the retrieval of information from biological

databasesistheNCBIEntrezsystem.Entrezcapitalizesonthefactthattherearepre-existing,

logicalrelationshipsbetweentheindividualentriesfoundinnumerouspublicdatabases.For

example,apaperinPubMedmaydescribethesequencingofagenewhosesequenceappears

inGenBank.Thenucleotidesequence,inturn,maycodeforaproteinproductwhosesequence

is stored in NCBI’s Protein database. The three-dimensional structure of that protein may

be known, and the coordinates for that structure may appear in NCBI’s Structure database.

Finally, there may be allelic or structural variants documented for the gene of interest,

catalogedindatabasessuchastheSingleNucleotidePolymorphismDatabase(calleddbSNP)

ortheDatabaseofGenomicStructuralVariation(calleddbVAR),respectively.Theexistence

ofsuchnaturalconnections,allhavingabiologicalunderpinning,motivatedthedevelopment

ofamethodthroughwhichalloftheinformationaboutaparticularbiologicalentitycouldbe

foundwithouthavingtosequentiallyvisitandqueryindividualdatabases,onebyone.

Entrez,tobeclear,isnotadatabaseitself.Rather,itistheinterfacethroughwhichitscom-

ponentdatabasescanbeaccessedandtraversed–anintegratedinformationretrievalsystem.

The Entrez information space includes PubMed records, nucleotide and protein sequence

data, information on conserved protein domains, three-dimensional structure information,

andgenomicvariationdatawithpotentialclinicalrelevance,agoodnumberofwhichwillbe

toucheduponinthischapter.ThestrengthofEntrezliesinthefactthat allofthisinformation,

across a large number of component databases, can be accessed by issuing one – and only

[EOF - 小节结束：下一小节为 Integrated Information Retrieval: The Entrez System]

中文译文

2 信息检索：从生物数据库中获取信息

引言

2003 年 4 月 14 日，生物学界庆祝人类基因组计划主要目标的达成：以完整、准确且高质量的方式完成人类基因组测序（International Human Genome Sequencing Consortium 2001; Schmutz et al. 2004）。许多人将这一成就与人类登月相提并论；它深刻改变了生物医学研究的开展方式，并且无疑仍将在未来持续影响这一领域的发展方向。人类基因组数据、人类序列变异数据、模式生物序列数据，以及关于基因结构和功能的信息，共同为生物学家更好地设计和解释实验室实验提供了丰厚基础，也兑现了生物信息学促进并加速生物学发现的承诺。

对生物学家而言，最重要的数据库之一是 GenBank——一个带有注释、收录所有公开可用 DNA 和蛋白质序列的集合（Benson et al. 2017；见第 1 章）。该数据库由美国国立卫生研究院（NIH）下属的美国国立生物技术信息中心（NCBI）维护，是 NCBI、欧洲分子生物学实验室（EMBL）和日本 DNA 数据库（DDBJ）协作的成果。本书撰写时，GenBank 已包含超过 2 亿条序列和超过 300 万亿个核苷酸碱基。人类基因组测序的完成、不断扩大的模式生物基因组测序，以及总体上急剧增长的序列数据，为生物科学家提供了非凡机会，因为这些数据本身具有巨大的内在价值。但与此同时，对缺乏经验的用户来说，如此庞大的数据体量也构成了难题——不仅因为“序列信息空间”规模巨大，也因为即便人类基因组测序早已“完成”，这个信息空间仍在以飞快且持续加速的速度扩张。

GenBank 的增长情况（图 2.1）清楚显示了人类基因组计划和其他系统性测序项目对序列数据持续积累的影响；图中所示的指数增长预计在未来一段时间内仍将持续。“序列空间”在不断扩张，由此产生的海量生物数据也同样迅速增长。这一事实凸显出所有生物学家都需要学习如何有效导航这些信息：既要在自己的工作中高效利用这些数据，也要在可能的情况下，基于这些虚拟宝库中已有的数据避免开展昂贵而重复的实验。

如果 GenBank（或任何其他生物数据库）中的数据不能被方便地检索，条目也不能以有用而有意义的格式获取，那么它几乎没有实际价值。否则，上述测序工作就无法真正发挥作用：没有有效的搜索和检索工具，整个生物学界既无法利用隐藏在数十亿碱基和氨基酸中的信息，也无法利用这些分子形成的结构或其中携带的突变信息。为使这些数据能被生物学家实际访问和使用，研究界已经投入了大量努力；本章关注的正是这些努力所产生的程序和界面。本章讨论将以 NCBI 维护的数据库查询为中心，因为这些较为“通用”的数据存储库是生物学家最常访问的资源；同时，本章也会介绍一些专门数据库，它们提供的信息未必能通过 Entrez——NCBI 的综合信息检索系统——获得。

---

图 2.1 GenBank 的指数增长（以核苷酸数量——方块，单位为十亿，和提交的序列数量——圆圈，单位为千计）。图的数据来源获取自美国国立生物技术信息中心（NCBI）网站。请注意，1997 年后加速增长的时期恰逢人类基因组计划的遗传和物理作图目标的完成，为高准确性、高通量测序以及新技术开发奠定了基础（Collins et al. 1998, 2003; Green et al. 2011）。

014

Integrated Information Retrieval: The Entrez System

PDF page 40 - PDF page 58 "Organismal Sequence Databases Beyond NCBI" 标题前；印刷页码 20-38