This chapter presents general guidelines for accessing the genome sequence and annotations
using the UCSC and Ensembl Genome Browsers. Although similar analyses could be carried
out with either browser, we have chosen to use different examples at the two sites to illustrate
different types of questions that a researcher might want to ask. We finish with a short descrip-
tion of JBrowse (Buels et al. 2016), another web-based genome browser that users can set up
on their own servers to share custom genome assemblies and annotations. All of the resources
discussed in this chapter are freely available.
The UCSC Genome Browser
After starting in 2000 with just a display of an early draft of the human genome assembly,
the UCSC Genome Browser now provides access to assemblies and annotations from over 100
organisms (Haeussler et al. 2019). The majority of assemblies are of mammalian genomes, but
other vertebrates, insects, nematodes, deuterostomes, and the Ebola virus are also included.
The assemblies from some organisms, including human and mouse, are available in multiple
versions. New organisms and assembly versions are added regularly.
The UCSC Browser presents genomic annotation in the form of tracks. Each track provides a
different type of feature, from genes to SNPs to predicted gene regulatory regions to expression
data. Each organism has its own set of tracks, some created by the UCSC Genome Bioinformat-
ics team and others provided by members of the bioinformatics community. Over 200 tracks are
available for the GRCh37 version of the human genome assembly. The newer human genome
assembly, GRCh38, has fewer tracks, as not all the data have been remapped from the older
assembly. Other genomes are not as well annotated as human; for example, fewer than 20
tracks are available for the sea hare. Some tracks, such as those created from NCBI transcript
data, are updated weekly, while others, such as the SNP tracks created from NCBI variant data
(Sayers et al. 2019), are updated less frequently, depending on the release schedule of the under-
lying data. For ease of use, tracks are organized into subsections. For example, depending on
the organism, the Genes and Gene Predictions section may include evidence-based gene pre-
dictions, ab initio gene predictions, and/or alignment of protein sequences from other species.
The home page of the UCSC Genome Browser provides a stepping-off point for many of the
resources developed by the Genome Bioinformatics group at UCSC, including the Genome
Browser, BLAT, and the Table Browser, which will be described in detail later in this chapter.
The Tools menu provides a link to liftOver, a widely used tool that converts genomic coordinates
from one assembly to another. Using this tool, it is possible to update annotation files so that old
data can be integrated into a new genome assembly. The Download menu provides an option
to download all the sequence and annotation data for each genome assembly hosted by UCSC,
as well as some of the source code. The What’s New section provides updates on new genome
assemblies, as well as new tools and features. Finally, there is an extensive Help menu, with
detailed documentation as well as videos. Users may also submit questions to a mailing list,
and most queries are answered within a day.
The UCSC Genome Browser provides multiple ways for both individual users and larger
genome centers to share data with collaborators or even the entire bioinformatics commu-
nity. These sharing options are available on the My Data link on the home page. Custom
Tracks allow users to display their own data as a separate annotation track in the browser.
User data must be formatted in a standard data structure in order to be interpreted correctly by
the browser. Many commonly used file formats are supported, including Browser Extensible
Data (BED), Binary Alignment/Map (BAM), and Variant Call Format (VCF; Box 4.1). Small
data files can be uploaded or pasted into the Genome Browser for personal use. Larger files
must be saved on the user’s web server and accessed by URL through the Genome Browser.
As anyone with the URL can access the data, this method can be used to share data with col-
laborators. Alternatively, Custom Tracks, along with track configurations and settings, can be
shared with selected collaborators using a named Session. Some groups choose to make their
Sessions available to the world at large in My Data →Public Sessions. Finally, groups with very
large datasets can host their data in the form of a Track Hub so that it can be viewed on the
UCSC Genome Browser. When a Track Hub is paired with an Assembly Hub, it can be used to
create a browser for a genome assembly not already hosted by UCSC.
Box 4.1 Common File Types for Genomic Data
Both the UCSC and Ensembl Genome Browsers allow users to upload their own data so
that they can be viewed in context with other genome-scale data. User data must be
formatted in a commonly used data structure in order to be interpreted correctly by the
browser.
Browser Extensible Data (BED) format is a tab-delimited format that is flexible enough to
display many types of data. It can be used to display fairly simple features like the
location of transcription binding factor sites, as well more complex ones like transcripts
and their exons.
Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Align-
ment/Map (SAM) format. It is a compact format designed for use with very large files of
nucleotide sequence alignments. Because it can be indexed, only the portion of the file
that is needed for display is transferred to the browser. Many tools for next generation
sequence analysis use BAM format as output or input.
Variant Call Format (VCF) is a flexible format for large files of variation data including
single-nucleotide variants, insertions/deletions, copy number variants, and structural
variants. Like BAM format, it is compressed and indexed, and only the portion of the file
that is needed for display is transferred to the browser. Many tools for variant analysis
use VCF format as output or input.
The UCSC Genome Browser home page lists commonly accessed tools, as well as a
frequently updated news section that highlights major data and software updates. To reach
the Genome Browser Gateway, the main entry point for text-based searches, click on the
Gateway link on the home page (Figure 4.1). The default assembly is the most recent
human assembly, GRCh38, from December 2013. The genomes of other species can be
selected from the phylogenetic tree on the left side of the Gateway page, or by typing
their name in the selection box. On the human Gateway page, there is also the option to
select one of four older human genome assemblies. Details about the GRCh38 assembly
and instructions for searching are available on the Gateway page.
To perform a search, enter text into the Position/Search Term box. If the query maps to a
unique position in the genome, such as a search for a particular chromosome and position, the
Go button links directly to the Genome Browser. However, if there is more than one hit for the
query, such as a search for the term metalloprotease, the resulting page will contain a list
of results that all contain that term. For some species, the terms have been indexed, and typing
a gene symbol into the search box will bring up a list of possible matches. In this example, we
will search for the human hypoxia inducible factor 1 alpha subunit (HIF1A) gene (Figure 4.1),
which produces a single hit on GRCh38.
The default Genome Browser view showing the genomic context of the HIF1A gene is shown
in Figure 4.2. The navigation controls are presented across the top of the display. The arrows
move the window to the left and right along the chromosome. Alternatively, the user can
move the display left and right by holding down the mouse button and dragging the window.
To zoom in and out, use the buttons at the top of the display. The base button zooms in so
far that individual nucleotides are displayed, while the zoom out 100× button will show the
entire chromosome if it is pressed a few times. The current genomic position and the length
of window (in nucleotides) is shown above a schematic of chromosome 14, where the current
The UCSC Genome Browser
Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the gene HIF1A on the human GRCh38 genome assembly.
The organism can be selected by clicking on its name in the phylogenetic tree. For many organisms, more than one genome assembly is
available. Typing a term into the Position/Search Term box returns a list of matching gene symbols.
genomic position is highlighted with a red box. A new search term can be entered into the
search box.
Below the browser window illustrated in Figure 4.2, one would find a list of tracks that
are available for display on the assembly. The tracks are separated into nine categories: Map-
ping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, mRNA and
Expressed Sequence Tag (EST), Expression, Regulation, Comparative Genomics, Variation,
and Repeats. Clicking on a track name opens the Track Settings page for that track, provid-
ing a description of the data displayed in that track. Most tracks can be displayed in one of the
following five modes.
- Hide: the track is not displayed at all.
- Dense: all features are collapsed into a single line; features are not labeled.
- Squish: each feature is shown separately, but at 50% the height of full mode; features are
not labeled.
- Pack: each feature is shown separately, but not necessarily on separate lines; features are
labeled.
- Full: each feature is labeled and displayed on a separate line.
Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic context of the human HIF1A gene.
In order to simplify the display, most tracks are in hide mode by default. To change the
mode, use the pull-down menu below the track name or on the Track Settings page. Other
settings, such as color or annotation details, can also be configured on the Track Settings page.
For example, the NCBI RefSeq track allows users to select if they want to view all reference
sequences or only those that are curated or predicted (Box 1.2). One possible point of confusion
is that the UCSC Genome Browser will “remember” the mode in which each track is displayed
from session to session. Custom settings can be cleared by selecting Reset all User Settings under
the Genome Browser pull-down menu at the top of any page.
The annotation tracks in the window below the chromosome are the focus of the Genome
Browser (Figure 4.2). Tracks are depicted horizontally, with a title above the track and labels
on the left. The first two lines show the scale and chromosomal position. The term that was
searched for and matched (HIF1A in this case) is highlighted on the annotation tracks. The
next tracks shown by default are gene prediction tracks. The default gene track on GRCh38
is the GENCODE Genes set, which replaces the UCSC Genes track that is still displayed on
GRCh37 and older human assemblies. GENCODE genes are annotated using a combination
of computational analysis and manual curation, and are used by the ENCODE Consortium
and other groups as reference gene sets (Box 4.2). The GENCODE v24 track depicts all of the
gene models from the GENCODE v24 release, which includes both protein-coding genes and
non-coding RNA genes.
The UCSC Genome Browser
Box 4.2 GENCODE
The GENCODE gene set was originally developed by the ENCODE Consortium as a com-
prehensive source of high-quality human gene annotations (Harrow et al. 2012). It has
now been expanded to include the mouse genome (Mudge and Harrow 2015). The goal of
the GENCODE project is to include all alternative splice variants of protein-coding loci, as
well as non-coding loci and pseudogenes. The GENCODE Consortium uses computational
methods, manual curation, and experimental validation to identify these gene features.
The first step is carried out by the same Ensembl gene annotation pipeline that is used
to annotate all vertebrate genomes displayed at Ensembl (Aken et al. 2016). This pipeline
aligns cDNAs, proteins, and RNA-seq data to the human genome in order to create can-
didate transcript models. All Ensembl transcript models are supported by experimental
evidence; no models are created solely from ab initio predictions. The Human and Verte-
brate Analysis and Annotation (HAVANA) group produces manually curated gene sets for
several vertebrate genomes, including mouse and human. These manually curated genes
are merged with the Ensembl transcript models to create the GENCODE gene sets for
mouse and human. A subset of the human models has been confirmed by an experimental
validation pipeline (Howald et al. 2012).
The consortium makes available two types of GENCODE gene sets. The Comprehen-
sive set encompasses all gene models, and may include many alternatively spliced tran-
scripts (isoforms) for each gene. The Basic set includes a subset of representative tran-
scripts for each gene that prioritizes full-length protein-coding transcripts over partial- or
non-protein-coding transcripts. The Ensembl Genome Browser displays the Comprehen-
sive set by default. Although the UCSC Genome Browser displays the Basic set by default,
the Comprehensive set can be selected by changing the GENCODE track settings. At the
time of this writing, Ensembl is displaying GENCODE v27, released in August 2017. The
GENCODE version available by default at the UCSC Genome Browser is v24, from Decem-
ber 2015. More recent versions of GENCODE can be added to the browser by selecting
them in the All GENCODE super-track.
GENCODE and RefSeq both aim to provide a comprehensive gene set for mouse and
human. Frankish et al. (2015) have shown that, in human, the RefSeq gene set is more
similar to the GENCODE Basic set, while the GENCODE Comprehensive set contains more
alternative splicing and exons, as well as more novel protein-coding sequences, thus cov-
ering more of the genome. They also sought to determine which gene set would provide
the best reference transcriptome for annotating variants. They found that the GENCODE
Comprehensive set, because of its better genomic coverage, was better for discovering new
variants with functional potential, while the GENCODE Basic set may be better suited for
applications where a less complex set of transcripts is needed. Similarly, Wu et al. (2013)
compared the use of different gene sets to quantify RNA-seq reads and determine gene
expression levels. Like Frankish et al., they recommend using less complex gene anno-
tations (such as the RefSeq gene set) for gene expression estimates, but more complex
gene annotations (such as GENCODE) for exploratory research on novel transcriptional or
regulatory mechanisms.
In the GENCODE track, as well as other gene tracks, exons (regions of the transcript that
align with the genome) are depicted as blocks, while introns are drawn as the horizontal
lines that connect the exons. The direction of transcription is indicated by arrowheads on
the introns. Coding regions of exons are depicted as tall blocks, while non-coding exons
are shorter. In this example, the GENCODE track depicts five alternatively spliced tran-
scripts, labeled HIF1A on the left, for the HIF1A gene. As shown by the arrowheads, all
transcripts are transcribed from left to right. The 5′-most exon of each transcript (on the
left side of the display) is shorter on the left, indicating an untranslated region (UTR), and
(Continued)
Box 4.2 (Continued)
taller on the right, indicating a coding sequence. The reverse is true for the 3′-most exon
of each transcript. A very close visual inspection of the Genome Browser shows that the
last four HIF1A transcripts have a different pattern of exons from each other; a BLAST
search (not shown) reveals that first two transcripts differ by only three nucleotides in
one exon. There is also a transcript labeled HIF1A-AS2, an anti-sense HIF1A transcript that
is transcribed from right to left. Another transcript, labeled RP11-618G20.1, is a synthetic
construct DNA. Zooming the display out by 3× allows a view of the genes immediately
upstream and downstream of HIF1A (Figure 4.3). A second HIF1A antisense transcript,
HIF1A-AS1, is also visible.
The track below the GENCODE track is the RefSeq gene predictions from NCBI track. This is
a composite track showing human protein-coding and non-protein-coding genes taken from
the NCBI RNA reference sequences collection (RefSeq; Box 1.2). By default, the RefSeq track
is shown in dense mode, with the exons of the individual transcripts condensed into a single
line (Figure 4.2). Note that, in this dense mode, the exons are displayed as blocks, as in the
GENCODE track, but there are no arrowheads on the gene model to show the direction of
transcription. To change the display of the RefSeq track to view individual transcripts, open
the Track Settings page for the NCBI RefSeq track by clicking on the track name in the first row
Figure 4.3 The genomic context of the human HIF1A gene, after clicking on zoom out 3×. The genes immediately upstream (FLJ22447) and
downstream (SNAPC1) of HIF1A are now visible.
The UCSC Genome Browser
Figure 4.4 The RefSeq Track Settings page. The track settings pages are used to configure the display of annotation tracks. By default, all
of the RefSeq tracks are set to display in dense mode, with all features condensed into a single line. In this example, the Curated RefSeqs
are being set to display in full mode, in which each RefSeq transcript will be labeled and displayed on a separate line. The remainder of
the RefSeqs will be displayed in dense mode. The types of RefSeqs, curated and predicted, are described in Box 1.2. After changing the
settings, press the submit button to apply them.
of the Genes and Gene Predictions section (below the graphical view shown in Figure 4.2). The
resulting Track Settings page (Figure 4.4) allows the user to choose which type of RefSeqs to
display (e.g. all, curated only, or predicted only). In this example, we change the mode of the
RefSeq Curated track from dense to full, and the resulting graphical view (Figure 4.5) displays
each curated RefSeq as a separate transcript. In contrast to the GENCODE track, there are
only three RefSeq transcripts for the HIF1A gene, and the HIF1A-AS2 RefSeq transcript is
much shorter than the GENCODE transcript with the same name. These discrepancies are
due to differences in how the RefSeq and GENCODE transcript sets are assembled (Boxes 1.2
and 4.2).
Additional information about each transcript in the GENCODE and RefSeq tracks is avail-
able by clicking on the gene symbol (HIF1A, in this case); as the original search was for HIF1A,
Figure 4.5 The genomic context of the human HIF1A gene, after displaying RefSeq Curated genes in full mode. Each RefSeq transcript is
now drawn on a separate line, so that individual exons, as well as the direction of transcription, are visible. Compare this rendition with
Figure 4.2, where all RefSeq transcripts are condensed on a single line.
Figure 4.6 The Get Genomic Sequence page that provides an interface for users to retrieve the sequence for a feature of interest. Click on
an individual transcript in the GENCODE or RefSeq track to open a page with additional details for that transcript. On either of those details
pages, click the link for Genomic Sequence to open the page displayed here, which provides choices for retrieving sequences upstream
or downstream of the transcript, as well as intron or exon sequences. In this example, retrieve the sequence 1000 nt upstream of the
annotated transcription start site. Shown in the inset is the result of retrieving the FASTA-formatted sequence 1000 nt upstream of the
HIF1A transcript.
the gene name is highlighted in inverse type. For GENCODE genes, UCSC has collected infor-
mation from a variety of public sources and includes a text description, accession numbers,
expression data, protein structure, Gene Ontology terms, and more. For RefSeq transcripts,
UCSC provides links to NCBI resources. Both GENCODE and RefSeq details pages provide a
link to Genomic Sequence in the Sequence and Links section, allowing users to retrieve genomic
sequences connected to an individual transcript. From the selection menu (Figure 4.6), users
can choose whether to download the sequence upstream or downstream of the gene, as well
as the exon or intron sequence. The sequence is returned in FASTA format.
Further down on the graphical view shown in Figure 4.3 are tracks from the ENCODE
Regulation super-track: Layered H3K27Ac and DNase Clusters. These data were generated
by the Encyclopedia of DNA Elements (ENCODE) Consortium between 2003 and 2012
(ENCODE Project Consortium 2012). The ENCODE Consortium has developed reagents
and tools to identify all functional elements in the human genome sequence. The Layered
H3K27Ac track indicates regions where there are modified histones that may indicate active
enhancers (Box 4.3).
The UCSC Genome Browser
Box 4.3 Histone Marks
Histone proteins package DNA into chromosomes. Post-translational modifications of
these histones can affect gene expression, as well as DNA replication and repair, by
changing chromatin structure or recruiting histone modifiers (Lawrence et al. 2016).
The post-translational modifications include methylation, phosphorylation, acetylation,
ubiquitylation, and sumoylation. Histone H3 is primarily acetylated on lysine residues,
methylated at arginine or lysine, or phosphorylated on serine or threonine. Histone H4
is primarily acetylated on lysine, methylated at arginine or lysine, or phosphorylated on
serine.
Histone modification (or “marking”) is identified by the name of the histone, the residue
on which it is marked, and the type of mark. Thus, H3K27Ac is histone H3 that is acetylated
on lysine 27, while H3K79me2 is histone H3 that is dimethylated on lysine 79. Different
histone marks are associated with different types of chromatin structure. Some are more
likely found near enhancers and others near promoters and, while some cause an increase
of expression from nearby genes, others cause less. For example, H3K4me3 is associ-
ated with active promoters, and H3K27me3 is associated with developmentally controlled
repressive chromatin states.
The DNase Clusters track depicts regions where chromatin is hypersensitive to cutting
by the DNaseI enzyme. In these hypersensitive regions, the nucleosome structure
is less compacted, meaning that the DNA is available to bind transcription factors.
Thus, regulatory regions, especially promoters, tend to be DNase sensitive. The track
settings for the ENCODE Regulation super-track allows other ENCODE tracks to be
added to the browser window, including additional histone modification and DNa-
seI hypersensitivity data. Changing the display of the H3K4Me3 peaks from hide to
full highlights the peaks in the H3K4Me3 track near the 5′ ends of the HIF1A and
SNAPC1 transcripts that overlap with DNase hypersensitive sites (Figure 4.7, blue
highlights). These peaks may represent promoter elements that regulate the start of
transcription.
The UCSC Genome Browser displays data from NCBI’s Single Nucleotide Polymorphism
Database (dbSNP) in four SNP tracks. Common SNPs contains SNPs and small insertions and
deletions (indels) from NCBI’s dbSNP that have a minor allele frequency of at least 1% and
are mapped to a single location in the genome. Researchers looking for disease-causing SNPs
can use this track to filter their data, hypothesizing that their variant of interest will be rare
and therefore not displayed in this track. Flagged SNPs are those that are deemed by NCBI to
be clinically associated, while Mult. SNPs have been mapped to more than one region in the
genome. NCBI filters out most multiple-mapping SNPs as they may not be true SNPs, so there
are not many variants in this track. All SNPs includes all SNPs from the three subcategories.
dbSNP is in a continuous state of growth, and new data are incorporated a few times each year
as a new release, or new build, of dbSNP. These four SNP tracks are available for a few of the
most recent builds of dbSNP, indicated by the number in the track name. Thus, for example,
Common SNPs (150) are SNPs found in ≥1% of samples from dbSNP build 150.
By default, the Common SNPs (150) track is displayed in dense mode, with all variants in the
region compressed onto a single line. Variants in the Common SNPs track are color coded by
function. Open the Track Settings for this track in order to modify the display (Figure 4.8). Set
the Display mode to pack in order to show each variant separately. At the same time, modify the
Coloring Options so that SNPs in UTRs of transcripts are set to blue and SNPs in coding regions
of transcripts are set to green if they are synonymous (no change to the protein sequence) or
red if they are non-synonymous (altering the protein sequence), with all remaining classes of
SNPs set to display in black. Note the changes in the resulting browser window, with the green
synonymous and blue untranslated SNPs clearly visible (Figure 4.9).
Figure 4.7 The genomic context of the human HIF1A gene, after changing the display of the H3K4Me3 peaks from hide to full. The H3K4Me3
track is part of the ENCODE Regulation super-track. Below the graphic display window in Figure 4.5, open up the ENCODE Regulation
Super-track, in the Regulation menu. Change the track display from hide to full to reproduce the page shown here. Note that the H3K4Me3
peaks, which can indicate promoter regions (Box 4.3), overlap with the transcription starts of the SNAPC1 and HIF1A genes (light blue
highlight). These regions also overlap with the DNase HS track, indicating that the chromatin should be available to bind transcription
factors in this region. The highlights were added within the Genome Browser using the Drag-and-select tool. This tool is accessed by
clicking anywhere in the Scale track at the top of the Genome Browser display and dragging the selection window across a region of
interest. The Drag-and-select tool provides options to Highlight the selected region or Zoom directly to it.
Figure 4.8 Configuring the track settings for the Common SNPs(150) track. Set the Coloring Options so that all SNPs are black, except for
untranslated SNPs (blue), coding-synonymous SNPs (green), and coding-non-synonymous SNPs (red). In addition, change the Display mode
of the track from dense to pack so that the individual SNPs can be seen. By default, the function of each variant is defined by its position
within transcripts in the GENCODE track. However, the track used for annotation can be changed in the settings called Use Gene Tracks for
Functional Annotation.
The UCSC Genome Browser
Figure 4.9 The genomic context of the human HIF1A gene, after changing the colors and display mode of the Common SNPs(150) track as
shown in Figure 4.8. The SNPs in the 5′ and 3′ untranslated regions of the HIF1A GENCODE transcripts are now colored blue, while the
coding-synonymous SNP is colored green.
Two types of Expression tracks display data from the NIH Genotype-Tissue Expression
(GTEx) project (GTEx Consortium 2015). The GTEx Gene track displays gene expression
levels in 51 tissues and two cell lines, based on RNA-seq data from 8555 samples. The GTEx
Transcript track provides additional analysis of the same data and displays median transcript
expression levels. By default, the GTEx Gene track is shown in pack mode, while the GTEx
Transcript track is hidden. Figure 4.10 shows the Gene track in pack display mode, in the
region of the phenylalanine hydroxylase (PAH) gene. The height of each bar in the bar graph
represents the median expression level of the gene across all samples for a tissue, and the
bar color indicates the tissue. The PAH gene is highly expressed in kidney and liver (the two
brown bars). The expression is more clearly visible in the details page for the GTEx track
(Figure 4.10, inset, purple box). The GTEx Transcript track is similar, but depicts expression
for individual transcripts rather than an average for the gene.
An alternate entry point to the UCSC Genome Browser is via a BLAT search (see Chapter 3),
where a user can input a nucleotide or protein sequence to find an aligned region in a
selected genome. BLAT excels at quickly identify a matching sequence in the same or highly
similar organism. We will attempt to use BLAT to find a lizard homolog of the human gene
Figure 4.10 The GTEx Gene track, which depicts median gene expression levels in 51 tissues and two cell lines, based on RNA-seq data
from the GTEx project from 8555 tissue samples. The main browser window depicts the GTEx Gene track for the human PAH gene, showing
high expression in the two tissues colored brown (liver and kidney) but low or no expression in others. Clicking on the GTEx track opens it
in a larger window, shown in the inset.
disintegrin and metalloproteinase domain-containing protein 18 (ADAM18). The ADAM18
protein sequence is copied in FASTA format from the NCBI view of accession number
NP_001307242.1 and pasted into the BLAT Search box that can be accessed from the Tools
pull-down menu; the method for retrieving this sequence in the correct format is described
in Chapter 2. Select the lizard genome and assembly AnoCar2.0/anoCar2. BLAT will auto-
matically determine that the query sequence is a protein and will compare it with the lizard
genome translated in all six reading frames. A single result is returned (Figure 4.11a). The
alignment between the ADAM18 protein sequence and lizard chromosome Un_GL343418
runs from amino acid 368 to amino acid 383, with 81.3% identity. The browser link depicts
the genomic context of this 48 nt hit (Figure 4.11b). Although the ADAM18 protein sequence
aligns to a region in which other human ADAM genes have also been aligned, the other
human genes are represented by a thin line, indicating a gap in their alignment. The details
link shown in Figure 4.11a produces the alignment between the ADAM18 protein and lizard
chromosome Un_GL343418 (Figure 4.11c). The top section of the results shows the protein
query sequence, with the blue letters indicating the short region of alignment with the
genome. The bottom section shows the pairwise alignment between the protein and genomic
sequence translated in six frames. Vertical black lines indicate identical sequences. Taken
together, the BLAT results show that only 16 amino acids of the 715 amino acid ADAM18
protein align to the lizard genome (Figure 4.11c). This alignment is short and likely does not
represent a homologous region between the ADAM18 protein and the lizard genome. Thus,
the BLAT algorithm, although fast, is not always sensitive enough to detect cross-species
orthologs. The BLAST algorithm, described in the Ensembl Genome Browser section, is more
sensitive, and is a better choice for identifying such homologs.
The UCSC Genome Browser
(a)
(b)
Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the results of running a BLAT search against the lizard
genome, using as a query the human protein sequence of the gene ADAM18, accession NP_001307242.1. The ADAM18 protein sequence
is available from NCBI at www.ncbi.nlm.nih.gov/protein/NP_001307242.1?report=fasta. At the UCSC Genome Browser, the web inter-
face to the BLAT search is in the Tools menu at the top of each page. The BLAT search was run against the lizard genome assembly from
May 2010, also called anoCar2. The columns on the results page are as follows: ACTIONS, links to the browser (Figure 4.11b) and details
(Figure 4.11c); QUERY, the name of the query sequence; SCORE, the BLAT score, determined by the number of matches vs. mismatches
in the final alignment of the query to the genome; START, the start coordinate of the alignment, on the query sequence; END, the end
coordinate of the alignment, on the query sequence; QSIZE, the length of the query; IDENTITY, the percent identity between the query
and the genomic sequences; CHRO, the chromosome to which the query sequence aligns; STRAND, the chromosome strand to which
the query sequence aligns; START; the start coordinate of the alignment, on the genomic sequence; END, the end coordinate of the
alignment, on the genomic sequence; and SPAN, the length of the alignment, on the genomic sequence. Note that, in this example,
there is a single alignment; searches with other sequences may result in many alignments, each shown on a separate line. It is possible
to search with up to 25 sequences at a time, but each sequence must be in FASTA format. (b) This page shows the browser link from the
BLAT summary page. The alignment between the query and genome is shown as a new track called Your Sequence from BLAT Search.
(c) The details link from the BLAT summary page, showing the alignment between the query (human ADAM18 protein) and the lizard
genome, translated in six frames. The protein query sequence is shown at the top, with the blue letters indicating the amino acids
that align to the genome. The bottom section shows the pairwise alignment between the protein and genomic sequence translated in
six frames. Black lines indicate identical sequences; red and green letters indicate where the genomic sequence encodes a different
amino acid. Although the ADAM18 protein sequence has a length of 715 amino acids, only 16 amino acids align as a single block to
the lizard genome.
(c)
Figure 4.11 (Continued)
UCSC Table Browser
The Table Browser tool provides users a text-based interface with which to query, inter-
sect, filter, and download the data that are displayed graphically in the Genome Browser.
These data can then be saved in a spreadsheet for further analysis, or used as input into a
different program. Using a web-based interface, users select a genome assembly, track, and
position, then choose how to manipulate that track data and what fields to return. This
example will demonstrate how to retrieve a list of all NCBI mRNA reference sequences that
overlap with an SNP from the Genome-Wide Association Study (GWAS) Catalog track, which
identifies genetic loci associated with common diseases or traits. The GWAS Catalog is a
manually curated collection of published genome-wide association studies that assayed at
least 100 000 SNPs, in which all SNP-trait associations have p values of <1 × 10−5 (Buniello
et al. 2019).
The Table Browser landing page is accessible from either the UCSC Genome Browser home
page or the Tools pull-down menu. First, reset all user cart settings by clicking on the click here
link at the bottom of the Table Browser settings section.
Then, select the NCBI RefSeq track on the GRCh38 genome assembly (Figure 4.12a). Create
a filter to limit the search to curated mRNA reference sequences in the NM_ accession series
(Box 1.2; Figure 4.12b). Next, intersect the RefSeq track with variants from the GWAS Catalog
(Figure 4.12c). Finally, on the Table Browser form, change the output format to hyperlinks to
Genome Browser, then click get output. The output is a list of 3000+ RefSeq mRNAs that overlap
with a variant from the GWAS Catalog (Figure 4.12d). The Genome Browser view of one of the
transcripts, from the gene arginine–glutamic acid dipeptide (RE) repeats (RERE), and the six
SNPs from the GWAS Catalog that it overlaps, can be found by clicking on the first link in the
results list and is shown in Figure 4.12e.