Chapter 4

Genome Browsers

7 小节

028

Introduction

PDF page 99-101 顶部；印刷页码 79-81

▶

English SourcePDF extracted

Genome Browsers

Tyra G. Wolfsberg

Introduction

The first complete sequence of a eukaryotic genome – that of Saccharomyces cerevisiae – was published in 1996 (Goffeau et al. 1996). The chromosomes of this organism, which range in size from 270 to 1500 kb, presented an immediate challenge in data management, as the upper limit for single database entries in GenBank at the time was 350 kb. To better manage the yeast genome sequence, as well as other chromosome and genome-length sequences being deposited into GenBank around that time, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) established the Genomes division of Entrez (Benson et al. 1997). Entries in this division were organized around a reference sequence onto which all other sequences from that organism were aligned. As these reference sequences have no size limit, “virtual” reference sequences of large genomes or chromosomes could be assembled from shorter GenBank sequences. For partially sequenced chromosomes, NCBI developed methods to integrate genetic, physical, and cytogenetic maps onto the framework of the whole chromosome. Thus, Entrez Genomes was able to provide the first graphical views of large-scale genomic sequence data.

The working draft of the human genome, completed in February 2001 (Lander et al. 2001), generated virtual reference sequences for each human chromosome, ranging in size from 46 to 246 Mb. NCBI created the first version of its human Map Viewer (Wheeler et al. 2001) shortly thereafter, in order to display these longer sequences. Around the same time, the University of California, Santa Cruz (UCSC) Genome Bioinformatics Group was developing its own human genome browser, based on software originally designed for displaying the much smaller Caenorhabditis elegans genome (Kent and Zahler 2000). Similarly, the Ensembl project at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) was also producing a system to automatically annotate the human genome sequence, as well as store and visualize the data (Hubbard et al. 2002). The three genome browsers all came online at about the same time, and researchers began using them to help navigate the human genome (Wolfsberg et al. 2002). Today, each site provides free access not only to human sequence data but also to a myriad of other assembled genomic sequences, from commonly used model organisms such as mouse to more recently released assemblies such as those of the domesticated turkey. Although the NCBI’s Map Viewer is not being further developed and will be replaced by its new Genome Data Viewer (Sayers et al. 2019), the UCSC and Ensembl Genome Browsers continue to be popular resources, used by most members of the bioinformatics and genomics communities. This chapter will focus on the last two genome browsers.

The reference human genome was sequenced in a clone-by-clone shotgun sequencing strategy and was declared complete in April 2003, although sequencing of selected regions is still continuing. This strategy includes constructing a bacterial artificial chromosome (BAC) tiling map for each human chromosome, then sequencing each BAC using a shotgun sequencing approach (reviewed in Green 2001).

Genome Browsers

The sequences of individual BACs were deposited into the High Throughput Genomic (HTG) division of GenBank as they became available. UCSC began assembling these BAC sequences into longer contigs in May 2000 (Kent and Haussler 2001), followed by assembly efforts undertaken at NCBI (Kitts 2003). These contigs, which contained gaps and regions of uncertain order, became the basis for the development of the genome browsers. Over time, as the genome sequence was finished, the human genome assembly was updated every few months. After UCSC stopped producing its own human genome assemblies in August 2001, NCBI built eight reference human genome assemblies for the bioinformatics community, culminating with a final assembly in March 2006. Subsequently, an international collaboration that includes the Wellcome Trust Sanger Institute (WTSI), the Genome Institute at Washington University, EBI, and NCBI formed the Genome Reference Consortium (GRC), which took over responsibility for subsequent assemblies of the human genome. This consortium has produced two human genome assemblies, namely GRCh37 in February 2009 and GRCh38 in December 2013. As one might expect, each new genome assembly leads to changes in the sequence coordinates of annotated features. In between the release of major assemblies, GRC creates patches, which either correct errors in the assembly or add alternate loci. These alternate loci are multiple representations of regions that are too variable to be represented by a single reference sequence, such as the killer cell immunoglobulin-like receptor (KIR) gene cluster on chromosome 19 and the major histocompatibility complex (MHC) locus on chromosome 6. Unlike new genome assemblies, patches do not affect the chromosomal coordinates of annotated features. GRCh38.p10 has 282 alternate loci or patches.

While the GRC also assembles the mouse, zebrafish, and chicken genomes, other genomes are sequenced and assembled by specialized sequencing consortia. The panda genome sequence, published in 2009, was the first mammalian genome to abandon the clone-based sequencing strategies used for human and mouse, relying entirely on next generation sequencing methodologies (Li et al. 2010). Subsequent advances in sequencing technologies have led to rapid increases in the number of complete genome sequences. At the time of this writing, both the UCSC Genome Browser and the main Ensembl web site host genome assemblies of over 100 organisms. The look and feel of each genome browser is the same regardless of the species displayed; however, the types of annotation differ depending on what data are available for each organism.

The backbone of each browser is an assembled genomic sequence. Although the underlying genomic sequence is, with a few exceptions, the same in both genome browsers, each team calculates its annotations independently. Depending on the type of analysis, a user may find that one genome browser has more relevant information than the other. The location of genes, both known and predicted, is a central focus of both genome browsers. For human, at present, both browsers feature the GENCODE gene predictions, an effort that is aimed at providing robust evidence-based reference gene sets (Harrow et al. 2012). Other types of genomic data are also mapped to the genome assembly, including NCBI reference sequences, single-nucleotide polymorphisms (SNPs) and other variants, gene regulatory regions, and gene expression data, as well as homologous sequences from other organisms. Both genome browsers can be accessed through a web interface that allows users to navigate through a graphical view of the genome. However, for those wishing to carry out their own calculations, sequences and annotations can also be retrieved in text format. Each browser also provides a sequence search tool – BLAT (Kent 2002) or BLAST (Camacho et al. 2009) – for interrogating the data via a nucleotide or protein sequence query. (Additional information on both BLAT and BLAST is provided in Chapter 3.)

In order to provide stability and ensure that old analyses can be reproduced, both genome browsers make available not only the current version of the genome assemblies but older ones as well. In addition, annotation tracks, such as the GENCODE gene track and the SNP track, may be based on different versions of the underlying data. Thus, users are encouraged to verify the version of all data (both genome assembly and annotations) when comparing a region of interest between the UCSC and Ensembl Genome Browsers.

The UCSC Genome Browser

This chapter presents general guidelines for accessing the genome sequence and annotations using the UCSC and Ensembl Genome Browsers. Although similar analyses could be carried out with either browser, we have chosen to use different examples at the two sites to illustrate different types of questions that a researcher might want to ask. We finish with a short description of JBrowse (Buels et al. 2016), another web-based genome browser that users can set up on their own servers to share custom genome assemblies and annotations. All of the resources discussed in this chapter are freely available.

Stopped before the next real section heading: "The UCSC Genome Browser".

中文译文

第4章基因组浏览器

作者：Tyra G. Wolfsberg

引言

第一个真核生物基因组完整序列——酿酒酵母（Saccharomyces cerevisiae）基因组——发表于 1996 年（Goffeau et al. 1996）。该生物的染色体大小范围为 270–1500 kb，这立刻带来了数据管理上的挑战，因为当时 GenBank 单条数据库记录的上限只有 350 kb。为了更好地管理酵母基因组序列，以及当时陆续提交到 GenBank 的其他染色体级和基因组级序列，美国国立卫生研究院（NIH）下属的美国国家生物技术信息中心（NCBI）建立了 Entrez 的 Genomes 分部（Benson et al. 1997）。这一分部中的记录围绕参考序列（reference sequence）组织，并将该物种的其他序列比对到该参考序列上。由于这些参考序列没有长度上限，因此可以把较短的 GenBank 序列拼装成大型基因组或整条染色体的“虚拟”参考序列。对于仅完成部分测序的染色体，NCBI 还开发了将遗传图谱、物理图谱和细胞遗传学图谱整合到整条染色体框架中的方法。由此，Entrez Genomes 成为了最早能够提供大尺度基因组序列数据图形化视图的系统之一。

2001 年 2 月完成的人类基因组工作草图（Lander et al. 2001）为每条人类染色体生成了一个虚拟参考序列，其长度范围为 46–246 Mb。为了显示这些更长的序列，NCBI 很快建立了第一版人类 Map Viewer（Wheeler et al. 2001）。大约在同一时期，加州大学圣克鲁兹分校（UCSC）Genome Bioinformatics Group 也在开发自己的人类基因组浏览器，它最初基于一个用于展示更小的秀丽隐杆线虫（Caenorhabditis elegans）基因组的软件框架（Kent and Zahler 2000）。与此同时，欧洲分子生物学实验室欧洲生物信息学研究所（EMBL-EBI）的 Ensembl 项目，也在构建一个能够自动注释人类基因组序列、并对数据进行存储与可视化的系统（Hubbard et al. 2002）。这三个基因组浏览器几乎在同一时期上线，研究人员也开始利用它们来导航人类基因组（Wolfsberg et al. 2002）。如今，这些站点不仅免费提供人类序列数据，还提供大量其他已组装的基因组序列，从常用模式生物如小鼠，到较新发布的家养火鸡基因组装版本。虽然 NCBI 的 Map Viewer 已不再继续开发，并将由新的 Genome Data Viewer 取代（Sayers et al. 2019），但 UCSC 和 Ensembl Genome Browser 仍然是生物信息学与基因组学群体中最常使用、最受欢迎的资源。本章将主要聚焦于后两者。

参考人类基因组采用 clone-by-clone shotgun sequencing strategy（逐克隆鸟枪法测序策略）完成测序，并于 2003 年 4 月被宣布完成，尽管某些特定区域的测序工作至今仍在继续。这一策略包括：先为每条人类染色体构建细菌人工染色体（bacterial artificial chromosome, BAC）铺瓦图（tiling map），然后再使用 shotgun sequencing approach（鸟枪法测序方法）对每个 BAC 进行测序（综述见 Green 2001）。

单个 BAC 的序列在获得后会被提交到 GenBank 的 High Throughput Genomic（HTG）分部。UCSC 自 2000 年 5 月起开始将这些 BAC 序列组装成更长的 contig（Kent and Haussler 2001），随后 NCBI 也开展了相应的组装工作（Kitts 2003）。这些 contig 含有缺口以及排列顺序尚不确定的区域，正是它们构成了基因组浏览器开发的基础。随着基因组测序逐步完成，人类基因组组装版本每隔几个月就会更新一次。UCSC 在 2001 年 8 月停止自行生成人类基因组组装版本后，NCBI 为生物信息学社区构建了 8 个参考人类基因组组装，最终以 2006 年 3 月的最后一个版本收尾。此后，Wellcome Trust Sanger Institute（WTSI）、华盛顿大学基因组研究所、EBI 和 NCBI 等机构组成了国际协作组织 Genome Reference Consortium（GRC），接管了后续人类基因组组装的维护工作。该联盟随后发布了两个主要的人类基因组组装版本：2009 年 2 月的 GRCh37 和 2013 年 12 月的 GRCh38。

正如人们所预期的那样，每一次新的基因组组装都会引起已注释特征（annotated features）序列坐标的变化。在主要组装版本发布之间，GRC 还会发布 patch（补丁），用于纠正组装错误或加入 alternate loci（替代位点）。所谓替代位点，是对那些变异过于丰富、无法由单一参考序列充分表示的区域给出的多种表示方式，例如 19 号染色体上的 killer cell immunoglobulin-like receptor（KIR）基因簇，以及 6 号染色体上的 major histocompatibility complex（MHC）位点。与新的基因组组装不同，patch 不会改变已注释特征的染色体坐标。GRCh38.p10 共包含 282 个替代位点或补丁。

虽然 GRC 也负责组装小鼠、斑马鱼和鸡的基因组，但其他物种的基因组通常由专门的测序联盟完成测序与组装。2009 年发表的大熊猫基因组，是第一个放弃人类和小鼠所用克隆式测序策略、完全依赖 next generation sequencing（下一代测序）方法完成的哺乳动物基因组（Li et al. 2010）。随着测序技术不断进步，完整基因组序列的数量也迅速增加。写作本章时，UCSC Genome Browser 和 Ensembl 主站都已收录超过 100 个物种的基因组组装。无论显示的是哪一种物种，这些基因组浏览器在整体界面风格和使用方式上基本一致；但由于不同物种可获得的数据不同，其注释内容和注释种类也会有所差异。

每个浏览器的骨架都是一个已经组装完成的基因组序列。尽管在极少数例外之外，两种浏览器所依托的基因组底层序列是相同的，但两支团队分别独立计算各自的注释结果。因此，针对不同类型的分析任务，用户可能会发现某一个浏览器提供的信息比另一个更相关。已知基因和预测基因的位置，都是两个浏览器共同关注的核心内容。就当前的人类基因组而言，这两个浏览器都整合了 GENCODE 基因预测结果，这是一项旨在提供稳健、基于证据的参考基因集的工作（Harrow et al. 2012）。此外，其他类型的基因组数据也会被映射到基因组组装上，包括 NCBI 参考序列、单核苷酸多态性（single-nucleotide polymorphisms, SNPs）及其他变异、基因调控区域、基因表达数据，以及来自其他物种的同源序列。用户既可以通过网页界面，以图形化方式浏览基因组；也可以在需要自行计算时，以文本格式提取序列和注释信息。每个浏览器还都提供序列搜索工具——BLAT（Kent 2002）或 BLAST（Camacho et al. 2009）——用于以核酸或蛋白质序列作为查询，对数据进行检索。（有关 BLAT 和 BLAST 的更多信息，见第 3 章。）

为了保持结果稳定，并确保旧分析可重复，两个基因组浏览器不仅提供当前版本的基因组组装，也保留旧版本。此外，诸如 GENCODE gene track 和 SNP track 之类的 annotation track（注释轨道），也可能基于不同版本的底层数据构建。因此，当用户在 UCSC 与 Ensembl Genome Browser 之间比较某一感兴趣区域时，应核对所有相关数据的版本，包括基因组组装版本和注释版本。

本章将给出使用 UCSC 和 Ensembl Genome Browser 访问基因组序列及注释信息的一般性指导。虽然许多类似的分析在两个浏览器中都可以完成，但本章刻意在两个站点分别选用不同示例，以展示研究者可能提出的不同类型问题。最后，本章还将简要介绍 JBrowse（Buels et al. 2016）——这是一种基于 Web 的基因组浏览器，用户可以部署在自己的服务器上，用于共享自定义基因组组装和注释。第 4 章所讨论的所有资源均可免费使用。

术语表（15 条）

English	中文
[x] genome browser	基因组浏览器
[x] reference sequence	参考序列
[x] genome assembly	基因组组装
[x] Map Viewer	Map Viewer（工具名保留英文）
[x] Genome Data Viewer	Genome Data Viewer（工具名保留英文）
[x] BAC	BAC（细菌人工染色体）
[x] tiling map	铺瓦图
[x] contig	contig（重叠群）
[x] Genome Reference Consortium	Genome Reference Consortium（GRC）
[x] alternate loci	替代位点
[x] annotation / annotated feature	注释 / 已注释特征
[x] GENCODE	GENCODE（项目名保留英文）
[x] SNP	SNP（单核苷酸多态性）
[x] annotation track	注释轨道
[x] next generation sequencing	下一代测序

029

The UCSC Genome Browser

PDF page 101-114 前；印刷页码 81-94

▶

English SourcePDF extracted

This chapter presents general guidelines for accessing the genome sequence and annotations

using the UCSC and Ensembl Genome Browsers. Although similar analyses could be carried

out with either browser, we have chosen to use different examples at the two sites to illustrate

different types of questions that a researcher might want to ask. We finish with a short descrip-

tion of JBrowse (Buels et al. 2016), another web-based genome browser that users can set up

on their own servers to share custom genome assemblies and annotations. All of the resources

discussed in this chapter are freely available.

The UCSC Genome Browser

After starting in 2000 with just a display of an early draft of the human genome assembly,

the UCSC Genome Browser now provides access to assemblies and annotations from over 100

organisms (Haeussler et al. 2019). The majority of assemblies are of mammalian genomes, but

other vertebrates, insects, nematodes, deuterostomes, and the Ebola virus are also included.

The assemblies from some organisms, including human and mouse, are available in multiple

versions. New organisms and assembly versions are added regularly.

The UCSC Browser presents genomic annotation in the form of tracks. Each track provides a

different type of feature, from genes to SNPs to predicted gene regulatory regions to expression

data. Each organism has its own set of tracks, some created by the UCSC Genome Bioinformat-

ics team and others provided by members of the bioinformatics community. Over 200 tracks are

available for the GRCh37 version of the human genome assembly. The newer human genome

assembly, GRCh38, has fewer tracks, as not all the data have been remapped from the older

assembly. Other genomes are not as well annotated as human; for example, fewer than 20

tracks are available for the sea hare. Some tracks, such as those created from NCBI transcript

data, are updated weekly, while others, such as the SNP tracks created from NCBI variant data

(Sayers et al. 2019), are updated less frequently, depending on the release schedule of the under-

lying data. For ease of use, tracks are organized into subsections. For example, depending on

the organism, the Genes and Gene Predictions section may include evidence-based gene pre-

dictions, ab initio gene predictions, and/or alignment of protein sequences from other species.

The home page of the UCSC Genome Browser provides a stepping-off point for many of the

resources developed by the Genome Bioinformatics group at UCSC, including the Genome

Browser, BLAT, and the Table Browser, which will be described in detail later in this chapter.

The Tools menu provides a link to liftOver, a widely used tool that converts genomic coordinates

from one assembly to another. Using this tool, it is possible to update annotation files so that old

data can be integrated into a new genome assembly. The Download menu provides an option

to download all the sequence and annotation data for each genome assembly hosted by UCSC,

as well as some of the source code. The What’s New section provides updates on new genome

assemblies, as well as new tools and features. Finally, there is an extensive Help menu, with

detailed documentation as well as videos. Users may also submit questions to a mailing list,

and most queries are answered within a day.

The UCSC Genome Browser provides multiple ways for both individual users and larger

genome centers to share data with collaborators or even the entire bioinformatics commu-

nity. These sharing options are available on the My Data link on the home page. Custom

Tracks allow users to display their own data as a separate annotation track in the browser.

User data must be formatted in a standard data structure in order to be interpreted correctly by

the browser. Many commonly used file formats are supported, including Browser Extensible

Data (BED), Binary Alignment/Map (BAM), and Variant Call Format (VCF; Box 4.1). Small

data files can be uploaded or pasted into the Genome Browser for personal use. Larger files

must be saved on the user’s web server and accessed by URL through the Genome Browser.

As anyone with the URL can access the data, this method can be used to share data with col-

laborators. Alternatively, Custom Tracks, along with track configurations and settings, can be

shared with selected collaborators using a named Session. Some groups choose to make their

Sessions available to the world at large in My Data →Public Sessions. Finally, groups with very

large datasets can host their data in the form of a Track Hub so that it can be viewed on the

UCSC Genome Browser. When a Track Hub is paired with an Assembly Hub, it can be used to

create a browser for a genome assembly not already hosted by UCSC.

Box 4.1 Common File Types for Genomic Data

Both the UCSC and Ensembl Genome Browsers allow users to upload their own data so

that they can be viewed in context with other genome-scale data. User data must be

formatted in a commonly used data structure in order to be interpreted correctly by the

browser.

Browser Extensible Data (BED) format is a tab-delimited format that is ﬂexible enough to

display many types of data. It can be used to display fairly simple features like the

location of transcription binding factor sites, as well more complex ones like transcripts

and their exons.

Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Align-

ment/Map (SAM) format. It is a compact format designed for use with very large ﬁles of

nucleotide sequence alignments. Because it can be indexed, only the portion of the ﬁle

that is needed for display is transferred to the browser. Many tools for next generation

sequence analysis use BAM format as output or input.

Variant Call Format (VCF) is a ﬂexible format for large ﬁles of variation data including

single-nucleotide variants, insertions/deletions, copy number variants, and structural

variants. Like BAM format, it is compressed and indexed, and only the portion of the ﬁle

that is needed for display is transferred to the browser. Many tools for variant analysis

use VCF format as output or input.

The UCSC Genome Browser home page lists commonly accessed tools, as well as a

frequently updated news section that highlights major data and software updates. To reach

the Genome Browser Gateway, the main entry point for text-based searches, click on the

Gateway link on the home page (Figure 4.1). The default assembly is the most recent

human assembly, GRCh38, from December 2013. The genomes of other species can be

selected from the phylogenetic tree on the left side of the Gateway page, or by typing

their name in the selection box. On the human Gateway page, there is also the option to

select one of four older human genome assemblies. Details about the GRCh38 assembly

and instructions for searching are available on the Gateway page.

To perform a search, enter text into the Position/Search Term box. If the query maps to a

unique position in the genome, such as a search for a particular chromosome and position, the

Go button links directly to the Genome Browser. However, if there is more than one hit for the

query, such as a search for the term metalloprotease, the resulting page will contain a list

of results that all contain that term. For some species, the terms have been indexed, and typing

a gene symbol into the search box will bring up a list of possible matches. In this example, we

will search for the human hypoxia inducible factor 1 alpha subunit (HIF1A) gene (Figure 4.1),

which produces a single hit on GRCh38.

The default Genome Browser view showing the genomic context of the HIF1A gene is shown

in Figure 4.2. The navigation controls are presented across the top of the display. The arrows

move the window to the left and right along the chromosome. Alternatively, the user can

move the display left and right by holding down the mouse button and dragging the window.

To zoom in and out, use the buttons at the top of the display. The base button zooms in so

far that individual nucleotides are displayed, while the zoom out 100× button will show the

entire chromosome if it is pressed a few times. The current genomic position and the length

of window (in nucleotides) is shown above a schematic of chromosome 14, where the current

The UCSC Genome Browser

Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the gene HIF1A on the human GRCh38 genome assembly.

The organism can be selected by clicking on its name in the phylogenetic tree. For many organisms, more than one genome assembly is

available. Typing a term into the Position/Search Term box returns a list of matching gene symbols.

genomic position is highlighted with a red box. A new search term can be entered into the

search box.

Below the browser window illustrated in Figure 4.2, one would find a list of tracks that

are available for display on the assembly. The tracks are separated into nine categories: Map-

ping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, mRNA and

Expressed Sequence Tag (EST), Expression, Regulation, Comparative Genomics, Variation,

and Repeats. Clicking on a track name opens the Track Settings page for that track, provid-

ing a description of the data displayed in that track. Most tracks can be displayed in one of the

following five modes.

Hide: the track is not displayed at all.
Dense: all features are collapsed into a single line; features are not labeled.
Squish: each feature is shown separately, but at 50% the height of full mode; features are

not labeled.

Pack: each feature is shown separately, but not necessarily on separate lines; features are

labeled.

Full: each feature is labeled and displayed on a separate line.

Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic context of the human HIF1A gene.

In order to simplify the display, most tracks are in hide mode by default. To change the

mode, use the pull-down menu below the track name or on the Track Settings page. Other

settings, such as color or annotation details, can also be configured on the Track Settings page.

For example, the NCBI RefSeq track allows users to select if they want to view all reference

sequences or only those that are curated or predicted (Box 1.2). One possible point of confusion

is that the UCSC Genome Browser will “remember” the mode in which each track is displayed

from session to session. Custom settings can be cleared by selecting Reset all User Settings under

the Genome Browser pull-down menu at the top of any page.

The annotation tracks in the window below the chromosome are the focus of the Genome

Browser (Figure 4.2). Tracks are depicted horizontally, with a title above the track and labels

on the left. The first two lines show the scale and chromosomal position. The term that was

searched for and matched (HIF1A in this case) is highlighted on the annotation tracks. The

next tracks shown by default are gene prediction tracks. The default gene track on GRCh38

is the GENCODE Genes set, which replaces the UCSC Genes track that is still displayed on

GRCh37 and older human assemblies. GENCODE genes are annotated using a combination

of computational analysis and manual curation, and are used by the ENCODE Consortium

and other groups as reference gene sets (Box 4.2). The GENCODE v24 track depicts all of the

gene models from the GENCODE v24 release, which includes both protein-coding genes and

non-coding RNA genes.

The UCSC Genome Browser

Box 4.2 GENCODE

The GENCODE gene set was originally developed by the ENCODE Consortium as a com-

prehensive source of high-quality human gene annotations (Harrow et al. 2012). It has

now been expanded to include the mouse genome (Mudge and Harrow 2015). The goal of

the GENCODE project is to include all alternative splice variants of protein-coding loci, as

well as non-coding loci and pseudogenes. The GENCODE Consortium uses computational

methods, manual curation, and experimental validation to identify these gene features.

The ﬁrst step is carried out by the same Ensembl gene annotation pipeline that is used

to annotate all vertebrate genomes displayed at Ensembl (Aken et al. 2016). This pipeline

aligns cDNAs, proteins, and RNA-seq data to the human genome in order to create can-

didate transcript models. All Ensembl transcript models are supported by experimental

evidence; no models are created solely from ab initio predictions. The Human and Verte-

brate Analysis and Annotation (HAVANA) group produces manually curated gene sets for

several vertebrate genomes, including mouse and human. These manually curated genes

are merged with the Ensembl transcript models to create the GENCODE gene sets for

mouse and human. A subset of the human models has been conﬁrmed by an experimental

validation pipeline (Howald et al. 2012).

The consortium makes available two types of GENCODE gene sets. The Comprehen-

sive set encompasses all gene models, and may include many alternatively spliced tran-

scripts (isoforms) for each gene. The Basic set includes a subset of representative tran-

scripts for each gene that prioritizes full-length protein-coding transcripts over partial- or

non-protein-coding transcripts. The Ensembl Genome Browser displays the Comprehen-

sive set by default. Although the UCSC Genome Browser displays the Basic set by default,

the Comprehensive set can be selected by changing the GENCODE track settings. At the

time of this writing, Ensembl is displaying GENCODE v27, released in August 2017. The

GENCODE version available by default at the UCSC Genome Browser is v24, from Decem-

ber 2015. More recent versions of GENCODE can be added to the browser by selecting

them in the All GENCODE super-track.

GENCODE and RefSeq both aim to provide a comprehensive gene set for mouse and

human. Frankish et al. (2015) have shown that, in human, the RefSeq gene set is more

similar to the GENCODE Basic set, while the GENCODE Comprehensive set contains more

alternative splicing and exons, as well as more novel protein-coding sequences, thus cov-

ering more of the genome. They also sought to determine which gene set would provide

the best reference transcriptome for annotating variants. They found that the GENCODE

Comprehensive set, because of its better genomic coverage, was better for discovering new

variants with functional potential, while the GENCODE Basic set may be better suited for

applications where a less complex set of transcripts is needed. Similarly, Wu et al. (2013)

compared the use of different gene sets to quantify RNA-seq reads and determine gene

expression levels. Like Frankish et al., they recommend using less complex gene anno-

tations (such as the RefSeq gene set) for gene expression estimates, but more complex

gene annotations (such as GENCODE) for exploratory research on novel transcriptional or

regulatory mechanisms.

In the GENCODE track, as well as other gene tracks, exons (regions of the transcript that

align with the genome) are depicted as blocks, while introns are drawn as the horizontal

lines that connect the exons. The direction of transcription is indicated by arrowheads on

the introns. Coding regions of exons are depicted as tall blocks, while non-coding exons

are shorter. In this example, the GENCODE track depicts ﬁve alternatively spliced tran-

scripts, labeled HIF1A on the left, for the HIF1A gene. As shown by the arrowheads, all

transcripts are transcribed from left to right. The 5′-most exon of each transcript (on the

left side of the display) is shorter on the left, indicating an untranslated region (UTR), and

(Continued)

Box 4.2 (Continued)

taller on the right, indicating a coding sequence. The reverse is true for the 3′-most exon

of each transcript. A very close visual inspection of the Genome Browser shows that the

last four HIF1A transcripts have a different pattern of exons from each other; a BLAST

search (not shown) reveals that ﬁrst two transcripts differ by only three nucleotides in

one exon. There is also a transcript labeled HIF1A-AS2, an anti-sense HIF1A transcript that

is transcribed from right to left. Another transcript, labeled RP11-618G20.1, is a synthetic

construct DNA. Zooming the display out by 3× allows a view of the genes immediately

upstream and downstream of HIF1A (Figure 4.3). A second HIF1A antisense transcript,

HIF1A-AS1, is also visible.

The track below the GENCODE track is the RefSeq gene predictions from NCBI track. This is

a composite track showing human protein-coding and non-protein-coding genes taken from

the NCBI RNA reference sequences collection (RefSeq; Box 1.2). By default, the RefSeq track

is shown in dense mode, with the exons of the individual transcripts condensed into a single

line (Figure 4.2). Note that, in this dense mode, the exons are displayed as blocks, as in the

GENCODE track, but there are no arrowheads on the gene model to show the direction of

transcription. To change the display of the RefSeq track to view individual transcripts, open

the Track Settings page for the NCBI RefSeq track by clicking on the track name in the first row

Figure 4.3 The genomic context of the human HIF1A gene, after clicking on zoom out 3×. The genes immediately upstream (FLJ22447) and

downstream (SNAPC1) of HIF1A are now visible.

The UCSC Genome Browser

Figure 4.4 The RefSeq Track Settings page. The track settings pages are used to conﬁgure the display of annotation tracks. By default, all

of the RefSeq tracks are set to display in dense mode, with all features condensed into a single line. In this example, the Curated RefSeqs

are being set to display in full mode, in which each RefSeq transcript will be labeled and displayed on a separate line. The remainder of

the RefSeqs will be displayed in dense mode. The types of RefSeqs, curated and predicted, are described in Box 1.2. After changing the

settings, press the submit button to apply them.

of the Genes and Gene Predictions section (below the graphical view shown in Figure 4.2). The

resulting Track Settings page (Figure 4.4) allows the user to choose which type of RefSeqs to

display (e.g. all, curated only, or predicted only). In this example, we change the mode of the

RefSeq Curated track from dense to full, and the resulting graphical view (Figure 4.5) displays

each curated RefSeq as a separate transcript. In contrast to the GENCODE track, there are

only three RefSeq transcripts for the HIF1A gene, and the HIF1A-AS2 RefSeq transcript is

much shorter than the GENCODE transcript with the same name. These discrepancies are

due to differences in how the RefSeq and GENCODE transcript sets are assembled (Boxes 1.2

and 4.2).

Additional information about each transcript in the GENCODE and RefSeq tracks is avail-

able by clicking on the gene symbol (HIF1A, in this case); as the original search was for HIF1A,

Figure 4.5 The genomic context of the human HIF1A gene, after displaying RefSeq Curated genes in full mode. Each RefSeq transcript is

now drawn on a separate line, so that individual exons, as well as the direction of transcription, are visible. Compare this rendition with

Figure 4.2, where all RefSeq transcripts are condensed on a single line.

Figure 4.6 The Get Genomic Sequence page that provides an interface for users to retrieve the sequence for a feature of interest. Click on

an individual transcript in the GENCODE or RefSeq track to open a page with additional details for that transcript. On either of those details

pages, click the link for Genomic Sequence to open the page displayed here, which provides choices for retrieving sequences upstream

or downstream of the transcript, as well as intron or exon sequences. In this example, retrieve the sequence 1000 nt upstream of the

annotated transcription start site. Shown in the inset is the result of retrieving the FASTA-formatted sequence 1000 nt upstream of the

HIF1A transcript.

the gene name is highlighted in inverse type. For GENCODE genes, UCSC has collected infor-

mation from a variety of public sources and includes a text description, accession numbers,

expression data, protein structure, Gene Ontology terms, and more. For RefSeq transcripts,

UCSC provides links to NCBI resources. Both GENCODE and RefSeq details pages provide a

link to Genomic Sequence in the Sequence and Links section, allowing users to retrieve genomic

sequences connected to an individual transcript. From the selection menu (Figure 4.6), users

can choose whether to download the sequence upstream or downstream of the gene, as well

as the exon or intron sequence. The sequence is returned in FASTA format.

Further down on the graphical view shown in Figure 4.3 are tracks from the ENCODE

Regulation super-track: Layered H3K27Ac and DNase Clusters. These data were generated

by the Encyclopedia of DNA Elements (ENCODE) Consortium between 2003 and 2012

(ENCODE Project Consortium 2012). The ENCODE Consortium has developed reagents

and tools to identify all functional elements in the human genome sequence. The Layered

H3K27Ac track indicates regions where there are modified histones that may indicate active

enhancers (Box 4.3).

The UCSC Genome Browser

Box 4.3 Histone Marks

Histone proteins package DNA into chromosomes. Post-translational modiﬁcations of

these histones can affect gene expression, as well as DNA replication and repair, by

changing chromatin structure or recruiting histone modiﬁers (Lawrence et al. 2016).

The post-translational modiﬁcations include methylation, phosphorylation, acetylation,

ubiquitylation, and sumoylation. Histone H3 is primarily acetylated on lysine residues,

methylated at arginine or lysine, or phosphorylated on serine or threonine. Histone H4

is primarily acetylated on lysine, methylated at arginine or lysine, or phosphorylated on

serine.

Histone modiﬁcation (or “marking”) is identiﬁed by the name of the histone, the residue

on which it is marked, and the type of mark. Thus, H3K27Ac is histone H3 that is acetylated

on lysine 27, while H3K79me2 is histone H3 that is dimethylated on lysine 79. Different

histone marks are associated with different types of chromatin structure. Some are more

likely found near enhancers and others near promoters and, while some cause an increase

of expression from nearby genes, others cause less. For example, H3K4me3 is associ-

ated with active promoters, and H3K27me3 is associated with developmentally controlled

repressive chromatin states.

The DNase Clusters track depicts regions where chromatin is hypersensitive to cutting

by the DNaseI enzyme. In these hypersensitive regions, the nucleosome structure

is less compacted, meaning that the DNA is available to bind transcription factors.

Thus, regulatory regions, especially promoters, tend to be DNase sensitive. The track

settings for the ENCODE Regulation super-track allows other ENCODE tracks to be

added to the browser window, including additional histone modiﬁcation and DNa-

seI hypersensitivity data. Changing the display of the H3K4Me3 peaks from hide to

full highlights the peaks in the H3K4Me3 track near the 5′ ends of the HIF1A and

SNAPC1 transcripts that overlap with DNase hypersensitive sites (Figure 4.7, blue

highlights). These peaks may represent promoter elements that regulate the start of

transcription.

The UCSC Genome Browser displays data from NCBI’s Single Nucleotide Polymorphism

Database (dbSNP) in four SNP tracks. Common SNPs contains SNPs and small insertions and

deletions (indels) from NCBI’s dbSNP that have a minor allele frequency of at least 1% and

are mapped to a single location in the genome. Researchers looking for disease-causing SNPs

can use this track to filter their data, hypothesizing that their variant of interest will be rare

and therefore not displayed in this track. Flagged SNPs are those that are deemed by NCBI to

be clinically associated, while Mult. SNPs have been mapped to more than one region in the

genome. NCBI filters out most multiple-mapping SNPs as they may not be true SNPs, so there

are not many variants in this track. All SNPs includes all SNPs from the three subcategories.

dbSNP is in a continuous state of growth, and new data are incorporated a few times each year

as a new release, or new build, of dbSNP. These four SNP tracks are available for a few of the

most recent builds of dbSNP, indicated by the number in the track name. Thus, for example,

Common SNPs (150) are SNPs found in ≥1% of samples from dbSNP build 150.

By default, the Common SNPs (150) track is displayed in dense mode, with all variants in the

region compressed onto a single line. Variants in the Common SNPs track are color coded by

function. Open the Track Settings for this track in order to modify the display (Figure 4.8). Set

the Display mode to pack in order to show each variant separately. At the same time, modify the

Coloring Options so that SNPs in UTRs of transcripts are set to blue and SNPs in coding regions

of transcripts are set to green if they are synonymous (no change to the protein sequence) or

red if they are non-synonymous (altering the protein sequence), with all remaining classes of

SNPs set to display in black. Note the changes in the resulting browser window, with the green

synonymous and blue untranslated SNPs clearly visible (Figure 4.9).

Figure 4.7 The genomic context of the human HIF1A gene, after changing the display of the H3K4Me3 peaks from hide to full. The H3K4Me3

track is part of the ENCODE Regulation super-track. Below the graphic display window in Figure 4.5, open up the ENCODE Regulation

Super-track, in the Regulation menu. Change the track display from hide to full to reproduce the page shown here. Note that the H3K4Me3

peaks, which can indicate promoter regions (Box 4.3), overlap with the transcription starts of the SNAPC1 and HIF1A genes (light blue

highlight). These regions also overlap with the DNase HS track, indicating that the chromatin should be available to bind transcription

factors in this region. The highlights were added within the Genome Browser using the Drag-and-select tool. This tool is accessed by

clicking anywhere in the Scale track at the top of the Genome Browser display and dragging the selection window across a region of

interest. The Drag-and-select tool provides options to Highlight the selected region or Zoom directly to it.

Figure 4.8 Conﬁguring the track settings for the Common SNPs(150) track. Set the Coloring Options so that all SNPs are black, except for

untranslated SNPs (blue), coding-synonymous SNPs (green), and coding-non-synonymous SNPs (red). In addition, change the Display mode

of the track from dense to pack so that the individual SNPs can be seen. By default, the function of each variant is deﬁned by its position

within transcripts in the GENCODE track. However, the track used for annotation can be changed in the settings called Use Gene Tracks for

Functional Annotation.

The UCSC Genome Browser

Figure 4.9 The genomic context of the human HIF1A gene, after changing the colors and display mode of the Common SNPs(150) track as

shown in Figure 4.8. The SNPs in the 5′ and 3′ untranslated regions of the HIF1A GENCODE transcripts are now colored blue, while the

coding-synonymous SNP is colored green.

Two types of Expression tracks display data from the NIH Genotype-Tissue Expression

(GTEx) project (GTEx Consortium 2015). The GTEx Gene track displays gene expression

levels in 51 tissues and two cell lines, based on RNA-seq data from 8555 samples. The GTEx

Transcript track provides additional analysis of the same data and displays median transcript

expression levels. By default, the GTEx Gene track is shown in pack mode, while the GTEx

Transcript track is hidden. Figure 4.10 shows the Gene track in pack display mode, in the

region of the phenylalanine hydroxylase (PAH) gene. The height of each bar in the bar graph

represents the median expression level of the gene across all samples for a tissue, and the

bar color indicates the tissue. The PAH gene is highly expressed in kidney and liver (the two

brown bars). The expression is more clearly visible in the details page for the GTEx track

(Figure 4.10, inset, purple box). The GTEx Transcript track is similar, but depicts expression

for individual transcripts rather than an average for the gene.

An alternate entry point to the UCSC Genome Browser is via a BLAT search (see Chapter 3),

where a user can input a nucleotide or protein sequence to find an aligned region in a

selected genome. BLAT excels at quickly identify a matching sequence in the same or highly

similar organism. We will attempt to use BLAT to find a lizard homolog of the human gene

Figure 4.10 The GTEx Gene track, which depicts median gene expression levels in 51 tissues and two cell lines, based on RNA-seq data

from the GTEx project from 8555 tissue samples. The main browser window depicts the GTEx Gene track for the human PAH gene, showing

high expression in the two tissues colored brown (liver and kidney) but low or no expression in others. Clicking on the GTEx track opens it

in a larger window, shown in the inset.

disintegrin and metalloproteinase domain-containing protein 18 (ADAM18). The ADAM18

protein sequence is copied in FASTA format from the NCBI view of accession number

NP_001307242.1 and pasted into the BLAT Search box that can be accessed from the Tools

pull-down menu; the method for retrieving this sequence in the correct format is described

in Chapter 2. Select the lizard genome and assembly AnoCar2.0/anoCar2. BLAT will auto-

matically determine that the query sequence is a protein and will compare it with the lizard

genome translated in all six reading frames. A single result is returned (Figure 4.11a). The

alignment between the ADAM18 protein sequence and lizard chromosome Un_GL343418

runs from amino acid 368 to amino acid 383, with 81.3% identity. The browser link depicts

the genomic context of this 48 nt hit (Figure 4.11b). Although the ADAM18 protein sequence

aligns to a region in which other human ADAM genes have also been aligned, the other

human genes are represented by a thin line, indicating a gap in their alignment. The details

link shown in Figure 4.11a produces the alignment between the ADAM18 protein and lizard

chromosome Un_GL343418 (Figure 4.11c). The top section of the results shows the protein

query sequence, with the blue letters indicating the short region of alignment with the

genome. The bottom section shows the pairwise alignment between the protein and genomic

sequence translated in six frames. Vertical black lines indicate identical sequences. Taken

together, the BLAT results show that only 16 amino acids of the 715 amino acid ADAM18

protein align to the lizard genome (Figure 4.11c). This alignment is short and likely does not

represent a homologous region between the ADAM18 protein and the lizard genome. Thus,

the BLAT algorithm, although fast, is not always sensitive enough to detect cross-species

orthologs. The BLAST algorithm, described in the Ensembl Genome Browser section, is more

sensitive, and is a better choice for identifying such homologs.

The UCSC Genome Browser

(a)

(b)

Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the results of running a BLAT search against the lizard

genome, using as a query the human protein sequence of the gene ADAM18, accession NP_001307242.1. The ADAM18 protein sequence

is available from NCBI at www.ncbi.nlm.nih.gov/protein/NP_001307242.1?report=fasta. At the UCSC Genome Browser, the web inter-

face to the BLAT search is in the Tools menu at the top of each page. The BLAT search was run against the lizard genome assembly from

May 2010, also called anoCar2. The columns on the results page are as follows: ACTIONS, links to the browser (Figure 4.11b) and details

(Figure 4.11c); QUERY, the name of the query sequence; SCORE, the BLAT score, determined by the number of matches vs. mismatches

in the ﬁnal alignment of the query to the genome; START, the start coordinate of the alignment, on the query sequence; END, the end

coordinate of the alignment, on the query sequence; QSIZE, the length of the query; IDENTITY, the percent identity between the query

and the genomic sequences; CHRO, the chromosome to which the query sequence aligns; STRAND, the chromosome strand to which

the query sequence aligns; START; the start coordinate of the alignment, on the genomic sequence; END, the end coordinate of the

alignment, on the genomic sequence; and SPAN, the length of the alignment, on the genomic sequence. Note that, in this example,

there is a single alignment; searches with other sequences may result in many alignments, each shown on a separate line. It is possible

to search with up to 25 sequences at a time, but each sequence must be in FASTA format. (b) This page shows the browser link from the

BLAT summary page. The alignment between the query and genome is shown as a new track called Your Sequence from BLAT Search.

(c) The details link from the BLAT summary page, showing the alignment between the query (human ADAM18 protein) and the lizard

genome, translated in six frames. The protein query sequence is shown at the top, with the blue letters indicating the amino acids

that align to the genome. The bottom section shows the pairwise alignment between the protein and genomic sequence translated in

six frames. Black lines indicate identical sequences; red and green letters indicate where the genomic sequence encodes a different

amino acid. Although the ADAM18 protein sequence has a length of 715 amino acids, only 16 amino acids align as a single block to

the lizard genome.

(c)

Figure 4.11 (Continued)

UCSC Table Browser

The Table Browser tool provides users a text-based interface with which to query, inter-

sect, filter, and download the data that are displayed graphically in the Genome Browser.

These data can then be saved in a spreadsheet for further analysis, or used as input into a

different program. Using a web-based interface, users select a genome assembly, track, and

position, then choose how to manipulate that track data and what fields to return. This

example will demonstrate how to retrieve a list of all NCBI mRNA reference sequences that

overlap with an SNP from the Genome-Wide Association Study (GWAS) Catalog track, which

identifies genetic loci associated with common diseases or traits. The GWAS Catalog is a

manually curated collection of published genome-wide association studies that assayed at

least 100 000 SNPs, in which all SNP-trait associations have p values of <1 × 10−5 (Buniello

et al. 2019).

The Table Browser landing page is accessible from either the UCSC Genome Browser home

page or the Tools pull-down menu. First, reset all user cart settings by clicking on the click here

link at the bottom of the Table Browser settings section.

Then, select the NCBI RefSeq track on the GRCh38 genome assembly (Figure 4.12a). Create

a filter to limit the search to curated mRNA reference sequences in the NM_ accession series

(Box 1.2; Figure 4.12b). Next, intersect the RefSeq track with variants from the GWAS Catalog

(Figure 4.12c). Finally, on the Table Browser form, change the output format to hyperlinks to

Genome Browser, then click get output. The output is a list of 3000+ RefSeq mRNAs that overlap

with a variant from the GWAS Catalog (Figure 4.12d). The Genome Browser view of one of the

transcripts, from the gene arginine–glutamic acid dipeptide (RE) repeats (RERE), and the six

SNPs from the GWAS Catalog that it overlaps, can be found by clicking on the first link in the

results list and is shown in Figure 4.12e.

中文译文

Ch4 Genome Browsers / The UCSC Genome Browser

UCSC 基因组浏览器

本章介绍使用 UCSC 与 Ensembl 基因组浏览器访问基因组序列和注释的一般指南。虽然两种浏览器都可以进行类似的分析，但我们在两个站点采用了不同的示例，以展示研究者可能提出的不同类型的问题。最后，我们简要介绍 JBrowse（Buels et al. 2016），这是一种基于 Web 的基因组浏览器，用户可在自己的服务器上部署，用于共享自定义的基因组组装和注释。本章讨论的所有资源均可免费获取。

UCSC 基因组浏览器

UCSC 基因组浏览器始于 2000 年，最初只显示人类基因组组装的早期草稿。如今，它已提供对 100 多个物种的组装和注释的访问（Haeussler et al. 2019）。大多数组属于哺乳动物基因组，但也包括其他脊椎动物、昆虫、线虫、后口动物以及埃博拉病毒。某些物种（包括人和小鼠）的组装有多个版本。新物种和新版本会定期添加。

UCSC 浏览器以轨道（track）的形式呈现基因组注释。每条轨道提供一种不同类型的特征，从基因到 SNP、预测的基因调控区域以及表达数据。每个物种都有自己的一组轨道，部分由 UCSC 基因组生物信息学团队创建，部分由生物信息学社区的成员提供。人类基因组 GRCh37 版本有 200 多条可用轨道。较新的人类基因组组装 GRCh38 轨道较少，因为并非所有数据都已从旧组装重新映射。其他基因组的注释程度不如人类；例如，海兔只有不到 20 条轨道。某些轨道（例如从 NCBI 转录本数据创建的轨道）可在多个物种中使用；其他轨道则仅适用于一个或少数几个物种。

本章首先介绍如何从 UCSC 主页（Figure 4.1）的 Gateway 链接访问 UCSC 基因组浏览器。默认组装为最新的人类组装（当前为 GRCh38）。也可以在主页上选择其他基因组和组装版本。

开始搜索基因的方式有两种。第一，在浏览器窗口顶部附近的搜索框中输入搜索词（Figure 4.1），浏览器默认会在当前基因组中搜索匹配的基因。第二，使用 BLAT 搜索功能（见下文）。我们将在示例中搜索人类低氧诱导因子 1α 亚基（HIF1A）基因（Figure 4.1），点击 go 后将显示 Figure 4.2 中的视图。

导航控件位于显示区域顶部。箭头用于在染色体上向左或向右移动。缩放控制用于放大或缩小显示区域。点击 zoom out 1.5× 或 zoom out 3× 将使视图从基因扩展到更大范围——不仅显示 HIF1A 本身，还显示其侧翼区域。点击 zoom in 1.5× 或 zoom in 3× 则放大到更小的区域，使单核苷酸级别的细节变得可见。chr position 字段显示当前在浏览器中查看的基因组坐标的范围。可以在该框中手动输入坐标或基因名称并按回车键，而非使用搜索功能。

Figure 4.1 UCSC 基因组浏览器主页，显示在人类 GRCh38 基因组组装上查询 HIF1A 基因。

Source: Reproduced with permission of UCSC Genome Browser, https://genome.ucsc.edu.

在 Figure 4.2 所示的浏览器窗口下方，可以看到一组轨道，按功能逻辑分组。默认情况下，许多轨道处于隐藏状态。要显示隐藏的轨道，可将轨道标签左侧的按钮从 hide 改为 dense、pack 或 full。dense 模式将所有特征压缩为单行显示；pack 模式以节省空间的方式显示轨道，通常使用多行；full 模式显示关于每个特征的最详细信息。

Figure 4.2 UCSC 基因组浏览器的默认视图，显示人类 HIF1A 基因的基因组上下文。

位于浏览器图形窗口（Figure 4.2）下方，是按功能分组的轨道列表。Genes and Gene Predictions 部分包含基因注释轨道。ENCODE Regulation 和 ENCODE Combined 部分包含添加了实验数据的轨道。Variation and Repeats 部分包含 dbSNP 常见 SNP 轨道，以及重复序列轨道。轨道的颜色、注释细节等设置也可以在 Track Settings 页面中配置。

每条轨道显示为水平条带，轨道上方有标题，轨道内部有标签。某些轨道还包含轨道项目的描述或显示控制（Figure 4.2）。轨道以 densen 模式显示为单行，例如默认显示在 UCSC 浏览器中的 RefSeq 和 GENCODE 轨道。

修改轨道显示

点击轨道标题行的任意位置或轨道最左侧的按钮（Figure 4.2），可进入 Track Settings 页面。这里，用户可以设置轨道可见性——dense、pack、full 或 hide。还可以配置特定于该轨道的设置，例如用于显示轨道项的颜色或数据子集。

当前视图以 full 模式显示参考基因和 GENCODE 注释 V41，以 dense 模式显示 RefSeq 注释。GENCODE 和 RefSeq 轨道在 pack 模式下显示为一个紧凑的概览，每行展开多个转录本。

点击 zoom out 3× 三次，视图将从单个基因扩展约 27 倍，显示 HIF1A 的上游和下游区域（Figure 4.3）。另一个 HIF1A 反义转录本（HIF1A-AS1）变得可见。请注意，在 dense 模式下，外显子显示为块状，而 full 模式则会显示带有内含子/外显子边界和外显子编号的整个转录本（Figure 4.2）。

Figure 4.3 点击 zoom out 3× 后人类 HIF1A 基因的基因组上下文。HIF1A 上游紧邻的基因（FLJ22447）和下游紧邻的基因（HIF1A-AS2 和 HIF1A-AS1）现在可见。

Source: Reproduced with permission of UCSC Genome Browser, https://genome.ucsc.edu.

在 Genes and Gene Predictions 部分中，将 UCSC Genes 轨道设置为 hide，将 RefSeq Curated 轨道的标签左侧按钮改为 full，点击 Track Settings 页面底部的 submit 按钮。产生的 Track Settings 页面（Figure 4.4）允许用户选择要显示哪种类型的 RefSeq，包括 curated RefSeq mRNAs（NM_ 前缀）、RefSeq predicted mRNAs（XM_ 前缀）等。选择 NM_ 后点击 submit，并将 RefSeq Curated 轨道的显示模式从 dense 改为 full，得到 Figure 4.5 的图形视图。

Figure 4.4 RefSeq Track Settings 页面。Track Settings 页面用于配置注释轨道的显示。默认情况下，所有 RefSeq curated mRNAs（NM_ 前缀）都会显示。

Figure 4.5 在 full 模式下显示 RefSeq Curated 基因后人类 HIF1A 基因的基因组上下文。每条 RefSeq 转录本都显示在一个独立的行上，带有外显子-内含子结构显示。这与 Figure 4.2 形成对比，后者所有 RefSeq 转录本都被压缩为单行。

Figure 4.6 Get Genomic Sequence 页面，为用户提供检索感兴趣特征序列的界面。点击某个转录本的外显子可访问此页面。

检索序列

向下滚动至 Figure 4.3 图形视图的底部，可以找到 Display 按钮附近的 DNA 链接。点击该链接会显示 Get Genomic Sequence 页面（Figure 4.6）。用户可以选择提取整个浏览器窗口对应的区域序列，也可以选择连接到单个转录本的序列。通过下拉菜单（Figure 4.6），用户可以选择外显子、编码区、5′ UTR 或 3′ UTR 的序列。输出的格式可以是以 FASTA 格式显示序列，选择小写字母表示重复序列，或字母中包含内含子——这有助于查看可变剪接模式。

添加注释轨道

在 Figure 4.3 所示的图形视图下方，可以看到来自 ENCODE（Encyclopedia of DNA Elements）项目的轨道。这些轨道提供了跨多种细胞类型的转录组和表观基因组数据。为了显示与 HIF1A 基因座相关的 H3K4Me3 组蛋白修饰数据，找到 ENCODE Regulation 超轨道，并通过下拉菜单将 H3K4Me3 标志的显示从 hide 改为 full。SNAPC1 等基因的 H3K4Me3 峰在 Figure 4.7 中可见。

添加 SNP 数据可以使研究者将基因组特征与已知变异联系起来。滚动至 Variation and Repeats 部分，将 Common SNPs(150) 轨道从 hide 改为 full。打开此轨道的 Track Settings 以修改其显示（Figure 4.8）。在 Coloring Options 部分，将所有选项设为黑色，except missense 设为黑色，except synonymous 设为黑色。将 5′ UTR 和 3′ UTR SNP 设置为蓝色。设置如图 Figure 4.8 所示。点击 submit 后，synonymous 和 untranslated SNP 将以蓝色清晰可见（Figure 4.9）。

Figure 4.7 将 H3K4Me3 峰从 hide 改为 full 显示后人类 HIF1A 基因的基因组上下文。H3K4Me3 轨道是 ENCODE Regulation 超轨道的一部分。

Figure 4.8 配置 Common SNPs(150) 轨道的 Track Settings。将 Coloring Options 设置为所有 SNP 均为黑色，除错义 SNP（也设为黑色）和同义 SNP（设为黑色）外，5′ 和 3′ UTR 区域的 SNP 设为蓝色。

Figure 4.9 按 Figure 4.8 所示更改 Common SNPs(150) 轨道的颜色和显示模式后的人类 HIF1A 基因基因组上下文。HIF1A GENCODE 转录本的 5′ 和 3′ UTR 中的 SNP 现在显示为蓝色，而非同义 SNP 仍然为黑色。

比较基因轨道

在 Genes and Gene Predictions 部分下方，将 GENCODE V41 轨道设置为 pack 模式。hide MANE Select 和 MANE Plus Clinical 轨道。Figure 4.10 以 pack 模式显示了 Gene 轨道，以及 Figure 4.2 中使用的 full 模式 RefSeq Curated 轨道的对比。在 pack 模式下，转录本被分组以减少重叠，全部显示在节省空间的视图中。该视图提供了一个很好的概览，显示哪些转录本是 GENCODE、RefSeq 或两者共有的。例如，HIF1A-204 同时属于 GENCODE V41 和 RefSeq Curated 轨道（Figure 4.10）。GTEx Transcript 轨道类似，但展示的是来自基因型-组织表达（GTEx）项目的表达数据（Figure 4.10 插图）。

Figure 4.10 以 pack 模式显示的 Gene 轨道与 full 模式显示的 RefSeq Curated 轨道的对比。两个轨道的共享转录本清晰可见（如 HIF1A-204）。

使用 BLAT

BLAST-Like Alignment Tool（BLAT）工具（Kent 2002）可用于将查询序列比对到基因组。BLAT 可以从 UCSC 基因组浏览器主页或 Tools 下拉菜单访问。BLAT 支持 DNA、RNA 或蛋白质序列的比对，可针对一个或多个基因组进行。Figure 4.11 显示了对人类 HIF1A 编码序列进行 BLAT 搜索的结果。结果以得分递减排序，与查询序列完美匹配的结果排在首位。点击 browser 链接可以直接跳转到基因在基因组上下文中的位置。

Figure 4.11 BLAT 搜索结果显示人类 HIF1A 编码序列的比对结果。得分最高的匹配排在最前面。

使用 UCSC Table Browser 进行数据检索

Table Browser（Karolchik et al. 2004）可以通过 UCSC 基因组浏览器主页的 Tools 下拉菜单访问。首先，点击 Table Browser 设置部分底部的 click here 链接可重置所有用户 cart 设置。然后在 GRCh38 基因组组装上选择 NCBI RefSeq 轨道（Figure 4.12a）。创建过滤器，将搜索范围限定为 NM_ 编号系列的 curated mRNA 参考序列（Box 1.2；Figure 4.12b）。接下来，将 RefSeq 轨道与来自 GWAS Catalog 的变异取交集（Figure 4.12c）。最后，在 Table Browser 表单中，将输出格式更改为 Genome Browser 的超链接，然后点击 get output。输出是一个包含 3,000 多个 RefSeq mRNA 的列表，这些 mRNA 与 GWAS Catalog 中的某个变异有重叠（Figure 4.12d）。可以通过点击结果列表中的第一个链接，查看来自 arginine–glutamic acid dipeptide (RE) repeats（RERE）基因的一个转录本以及与其重叠的六个 GWAS Catalog SNP 在基因组浏览器中的视图，如 Figure 4.12e 所示。

Figure 4.12 (a) 在 GRCh38 上选择 NCBI RefSeq 轨道。(b) 创建 NM_ 限定过滤器。(c) 与 GWAS Catalog 变异取交集。(d) 输出 3,000+ 匹配的 RefSeq mRNA。(e) 查看 RERE 基因位点与 GWAS Catalog SNP 的 Genome Browser 视图。

030

UCSC Table Browser

PDF page 114-116 前；印刷页码 94-96