Chapter 7

Predictive Methods Using Protein Sequences

4 小节

052

Introduction

PDF page 205-206 前；印刷页码 185-186

▶

English SourcePDF extracted

Predictive Methods Using Protein Sequences

Jonas Reeb, T atyana Goldberg, Yanay Ofran, and Burkhard Rost

Introduction

Simply put, DNA encodes the instructions for life, while proteins constitute the machinery

oflife.DNAistranscribedintoRNAandfromthereinformationisdeliveredintotheamino

acidsequenceofaprotein.Thissimplifiedversionofthe“centraldogmaofmolecularbiology”

formulatedbyFrancisCrick(1958)essentiallyremainsvalid,althoughnewdiscoverieshave

extendedourview(Elbarbaryetal.2016).Furthermore,epigeneticstudieshavedemonstrated

that chromatincontains more complex informationthan just a one-dimensional(1D)string

ofletters,withtheheritabilityofepigenetictraitshavingaprofoundeffectongeneexpression

(Allis and Jenuwein 2016). Nonetheless, the 1D protein sequence ultimately determines the

three-dimensional (3D) structure into which the protein will fold (Anfinsen 1973), where it

willresideinthecell,withwhichothermoleculesitwillinteract,itsbiochemicalandphysio-

logicalfunction,andwhenandhowitwilleventuallybebrokendownandreducedbackinto

itsbuildingblocks.Insum,thefunction(or,inthecaseofadisease,themalfunction)ofevery

proteinisencodedinthesequenceofaminoacids.

The central dogma suggests that everything about a protein can be inferred from its DNA

sequence – so, why then analyze protein sequences? It turns out that computationally con-

verting DNA to protein sequence is challenging and we still do not understand exactly how

toidentifythestructureofaproteinbasedontheDNAthatencodesit.Itisevenmorediffi-

culttopredicttranscriptsfromDNA.Fortunately,manyexperimentalapproaches,including

proteomicsmethods,canbeusedtodeduceproteinsequences,asdiscussedinChapter11.

Theadventof“next-generation”DNAsequencingtechnologiesisgeneratingawealthofraw

sequence data about which very little is known (Martinez and Nelson 2010; Goodwin et al.

2016).Thepaceatwhichsequencesareaccumulatingfarexceedstheabilityofexperimental

biologiststodeciphertheirbiochemicaltraitsandbiologicalfunctions.Thegapbetweenthe

numberofproteinsofknownsequenceandknownfunction–the“sequence–functiongap”–is

everincreasing,requiringimprovedcomputationalapproachestopredictaspectsofaprotein’s

functionfromitsaminoacidsequence.Asimilarsequence–structuregapexistsforproteins,in

thatthereare180millionproteinsequencesavailablebutonlyabout150000differentknown

protein 3D structures have been determined as of this writing (Berman et al. 2000; UniProt

Consortium2016).

Determining a protein’s function begins with an analysis of what is already known. This

means that every protein must be compared with all others, which implies that the compu-

tational time needed to study protein function grows as the square of sequence growth – a

tremendouschallengeforcomputationalbiologyandbioinformatics.Inthefollowingsections,

we survey some of the research approaches and computational tools that have been shown

tosuccessfullypredictaspectsofthestructureandfunctionofaproteinfromitsaminoacid

sequence.

Bioinformatics,FourthEdition.EditedbyAndreasD.Baxevanis,GaryD.Bader,andDavidS.Wishart.

CompanionWebsite:www.wiley.com/go/baxevanis/Bioinformatics_4e

186 Predictive Methods Using Protein Sequences

One-Dimensional Prediction of Protein Structure

Synopsis

The 1D structure of a protein can be written as a simple string of characters representing

the set of natural amino acids – that is, the information content is one dimensional. While

more details on protein structure can be found in Chapter 12, in this chapter, we will

focus specifically on 1D prediction methods. Predictions of 1D features are relevant for two

reasons. First, features such as the number of membrane helices, the disorder in a protein,

or the surface residues are often important for protein function. We could determine 1D

structure from 3D structures if such structures were experimentally available but, given the

sequence–structure gap discussed above, experimental 3D structures are available for fewer

than 1% of all known sequences, while 1D predictions can be obtained for all 180 million

proteinsequencesknowntoday.Second,predictionsof1Dstructureareusedasaninputfor

most of the methods that will be described in the section about functional prediction that

follows. All of the features that will be described here are available from the PredictProtein

server,illustratedinFigure7.1andprovidingpre-computeddataonover20millionproteins

(Rostetal.2004;Kajanetal.2013;Yachdavetal.2014).

Figure 7 .1 Dashboard of the PredictProtein web server. PredictProtein (Yachdav et al. 2014) provides a centralized interface to many

methods that predict aspects of protein structure and function from sequence. Shown here is a sample of the dashboard for the pro-

tein picturesquely named Mothers against decapentaplegic homolog 7 (UniProtKB identiﬁer smad7_human). The black, numbered line in the

upper middle indicates the input amino acid sequence of length 428. Below follow predictions of different sequence-based tools, along

with a synopsis of the protein family. Predictions include protein–protein binding, protein–DNA/RNA binding, residue exposure (solvent

accessibility), secondary structure, and residue ﬂexibility; if found, the predictions also include membrane, long disordered, and coiled-coil

regions. Additional information is shown through mouse-over events, here illustrated through the beta strand prediction from the method

ReProf. T abs on the left give access to more detailed views of various predictions and analyses.

中文译文

第7章基于蛋白质序列的预测方法

引言

简言之，DNA 编码生命的指令，而蛋白质则是生命的运作机器。DNA 被转录为 RNA，RNA 再将信息传递到蛋白质的氨基酸序列中。Francis Crick（1958）提出的"分子生物学中心法则"这一简化版本至今仍基本有效，尽管新发现已拓展了我们的视野（Elbarbary et al. 2016）。此外，表观遗传学研究已经表明，染色质所包含的信息远比一维（1D）字母串复杂，表观遗传特征的遗传性对基因表达有着深远影响（Allis and Jenuwein 2016）。然而，一维蛋白质序列最终决定了蛋白质折叠所形成的三维（3D）结构——蛋白质在细胞中的定位、与哪些其他分子相互作用、其生化与生理功能，以及它最终如何被分解并还原为组成单元（Anfinsen 1973）。总而言之，每个蛋白质的功能（或在疾病情况下的功能障碍）都编码在氨基酸序列之中。

中心法则表明，蛋白质的一切信息都可以从其 DNA 序列推断出来——那么，为什么还要分析蛋白质序列呢？事实上，通过计算方法将 DNA 转化为蛋白质序列颇具挑战，而且我们至今仍不完全理解如何根据编码蛋白质的 DNA 来识别其结构。从 DNA 预测转录本则更加困难。幸运的是，许多实验方法（包括蛋白质组学方法）可用于推断蛋白质序列，这将在第 11 章中讨论。

"下一代"DNA 测序技术的出现产生了大量尚未被深入研究的原始序列数据（Martinez and Nelson 2010; Goodwin et al. 2016）。序列积累的速度远远超过了实验生物学家破译其生化特性和生物学功能的能力。已知序列与已知功能的蛋白质数量之间存在"序列-功能鸿沟"（sequence-function gap），这一鸿沟不断扩大，亟需改进的计算方法来从氨基酸序列预测蛋白质的功能。与之类似，蛋白质也存在"序列-结构鸿沟"（sequence–structure gap）：截至本文撰写时，已有 1.8 亿个蛋白质序列可用，但仅确定了约 15 万个不同的蛋白质 3D 结构（Berman et al. 2000; UniProt Consortium 2016）。

确定蛋白质功能的第一步是分析已有的知识。这意味着每个蛋白质都必须与所有其他蛋白质进行比较，这意味着研究蛋白质功能所需的计算时间随序列增长的平方而增长——这对计算生物学和生物信息学是一个巨大的挑战。在下面的各节中，我们将综述一些已证明能够从氨基酸序列成功预测蛋白质结构和功能的方法。

蛋白质一维结构预测

内容提要

蛋白质的 1D 结构可以简单地表示为一串代表天然氨基酸集合的字符——也就是说，信息含量是一维的。关于蛋白质结构的更多细节可见第 12 章，在本章中，我们将专门关注 1D 预测方法。1D 特征预测之所以重要，有两个原因。首先，诸如跨膜螺旋数量、蛋白质无序区域或表面残基等特征通常对蛋白质功能至关重要。如果实验获得的 3D 结构可用，我们可以从 3D 结构推导出 1D 结构，但鉴于上述讨论的序列-结构鸿沟，实验获得的 3D 结构仅占所有已知序列的不到 1%，而 1D 预测可用于今天已知的全部 1.8 亿个蛋白质序列。其次，1D 结构预测被用作后续功能预测章节中描述的大多数方法的输入。这里描述的所有特征均可从 PredictProtein 服务器获取，该服务器如图 7.1 所示，并提供超过 2000 万个蛋白质的预计算数据（Rost et al. 2004; Kajan et al. 2013; Yachdav et al. 2014）。

053

One-Dimensional Prediction of Protein Structure

PDF page 206-220；在 PDF page 221 的 `Predicting Protein Function` 真实标题前停止；印刷页码 186-200

▶

English SourcePDF extracted

Canonical section: 02_One_Dimensional_Prediction_of_Protein_Structure
English title: One-Dimensional Prediction of Protein Structure
PDF range: PDF page 206-220；在 PDF page 221 的 Predicting Protein Function 真实标题前停止
Boundary note: 本节包含 Synopsis、Secondary Structure and Solvent Accessibility、Performance Assessment of Secondary Structure Prediction、Transmembrane Alpha Helices and Beta Strands、Disordered Regions 等内部子标题；不将其作为 progress/README 的 peer entries 暴露。
Extraction note: PDF 文本存在明显单词粘连、断字与 caption/Box 排版噪声；以下原文保留页标记，供人工核对。