5 Bioconductor

R is an open source software programming language that has been widely used for studying gene expression data thanks in part to the Bioconductor project. You can install R for free from CRAN on macOS, Windows, and Linux operating systems.

The Bioconductor community has defined common structures that enable users to store their data in common formats and analyzed through many different solutions (DOI: 10.1038/nmeth.3252). One such common format is the RangedSummarizedExperiment object from the SummarizedExperiment package that can then be used in downstream analayses of differential expression using packages such as DESeq2, edgeR, and limma.

To facilitate the re-analysis of public expression data, we have provided several R/Bioconductor packages that interface with the recount3 data. These are:

These packages can also benefit from using recount and derfinder, which we made in earlier phases of the ReCount project.

In the Bioconductor landing pages for our packages you can find detailed vignette documents, such as the one called recount3 quick start guide. Vignette documents are user guides illustrating how the different functions in a given R package can be used together. The code shown in this chapter is a subset from these vignette documents, thus we highly recommend that you read the vignettes if you want to learn more details about these R packages than what we have illustrated here.

5.1 recount3

The following information is taken from the recount3 documentation.

After installing recount3, we need to load the package, which will automatically load the required dependencies.

## Load recount3 R package
library("recount3")

If you have identified a study of interest and want to access the gene level expression data, use create_rse() as shown below. create_rse() has arguments that will allow you to specify the annotation of interest for the given organism, and whether you want to download gene, exon or exon-exon junction expression data.

## Find all available human projects
human_projects <- available_projects()
## 2021-09-15 14:26:37 caching file sra.recount_project.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/sra.recount_project.MD.gz'
## 2021-09-15 14:26:42 caching file gtex.recount_project.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/gtex/metadata/gtex.recount_project.MD.gz'
## 2021-09-15 14:26:44 caching file tcga.recount_project.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/tcga/metadata/tcga.recount_project.MD.gz'
## Find the project you are interested in,
## here we use SRP009615 as an example
proj_info <- subset(
    human_projects,
    project == "SRP009615" & project_type == "data_sources"
)

## Create a RangedSummarizedExperiment (RSE) object at the gene level
rse_gene_SRP009615 <- create_rse(proj_info)
## 2021-09-15 14:26:51 downloading and reading the metadata.
## 2021-09-15 14:26:52 caching file sra.sra.SRP009615.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.sra.SRP009615.MD.gz'
## 2021-09-15 14:26:53 caching file sra.recount_project.SRP009615.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_project.SRP009615.MD.gz'
## 2021-09-15 14:26:55 caching file sra.recount_qc.SRP009615.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_qc.SRP009615.MD.gz'
## 2021-09-15 14:26:56 caching file sra.recount_seq_qc.SRP009615.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_seq_qc.SRP009615.MD.gz'
## 2021-09-15 14:26:58 caching file sra.recount_pred.SRP009615.MD.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_pred.SRP009615.MD.gz'
## 2021-09-15 14:26:59 downloading and reading the feature information.
## 2021-09-15 14:27:00 caching file human.gene_sums.G026.gtf.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/annotations/gene_sums/human.gene_sums.G026.gtf.gz'
## 2021-09-15 14:27:04 downloading and reading the counts: 12 samples across 63856 features.
## 2021-09-15 14:27:04 caching file sra.gene_sums.SRP009615.G026.gz.
## adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/gene_sums/15/SRP009615/sra.gene_sums.SRP009615.G026.gz'
## 2021-09-15 14:27:08 construcing the RangedSummarizedExperiment (rse) object.
## Explore that RSE object
rse_gene_SRP009615
## class: RangedSummarizedExperiment 
## dim: 63856 12 
## metadata(8): time_created recount3_version ... annotation recount3_url
## assays(1): raw_counts
## rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
##   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
## rowData names(10): source type ... havana_gene tag
## colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
## colData names(175): rail_id external_id ...
##   recount_pred.curated.cell_line BigWigURL

You can also interactively choose your study of interest

## Note that you can interactively explore the available projects
proj_info_interactive <- interactiveDisplayBase::display(human_projects)

## Select a single row, then hit "send". The following code checks this.
stopifnot(nrow(proj_info_interactive) == 1)

## Then create the RSE object
rse_gene_interactive <- create_rse(proj_info_interactive)

Once you have a RSE file, you can use transform_counts() to transform the raw coverage counts.

## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)

Now you are ready to continue with downstream analysis software.

recount3 also supports accessing the BigWig raw coverage files as well as specific study or collection sample metadata. Please continue to the users guide for more detailed information.

5.2 snapcount

The following information is taken from the snapcount documentation. As stated in the documentation, snapcount provides access to data from the recount2 project, but it can also be used with the data from the recount3 project.

To get started, we need to load the snapcount package into our R session. This will load all the required dependencies.

library("snapcount")

snapcount makes it easy to query the Snaptron web services, with results presented as RangedSummarizedExperiment objects. You can query measurements for genes, exons, splice junctions and coverage vectors from the RNA-seq samples indexed in Snaptron. Samples are organized into compilations (e.g. srav2) that altogether contain the same studies and summaries in the recount resource. Queries can be filtered to narrow the focus to particular genes or genomic intervals, to events with certain prevalence, to events that do or don’t appear in gene annotation, or to samples with particular metadata.

snapcount complements the recount package, which also allows for searching by coordinates or by HUGO gene names. In general, recount works best when you are interested in all the genes, exons, or splice junctions in a study, whereas snapcount is best for queries over a particular subset of genes or intervals across all or most of the samples in recount2/Snaptron. The more specific your query, the faster and easier it will be to use snapcount.

All the RNA-seq samples in the recount and Snaptron resources were analyzed using the Rail-RNA aligner. Rail produces spliced alignments that in turn produce coverage and junction-level summaries that are further processed to obtain the data in recount and Snaptron.

Similar to the recount all coverage counts in Snaptron/snapcount are stored/retrieved as raw, un-normalized counts.

5.2.1 Basic queries

A basic query returns a RangedSummarizedExperiment (RSE) object with one or more features (genes, exons, or splice junctions) as rowRanges. Raw coverage counts are returned as the counts assay in the RSE Full sample metadata is returned as the colData of the RSE

Basic queries include:

  • exon-exon splice junction counts: query_jx
  • exon-level quantifications: query_exon
  • genes-level quantifications: query_gene

The Gencode v25 annotation defines what genes and exons can be queried (as in recount). For splice junctions, both annotated and novel junctions are queried at the same time unless one is explicitly filtered.

Metadata columns will vary by compilation (e.g. TCGA vs. GTEx). Please be cautious with metadata. We strive to make it as complete and usable as possible, but it can still be incomplete, incorrect or inconsistently formatted (e.g. “age” in the srav2 compilation).

Start by querying for all junctions within the region of gene CD99:

##Query coverage for gene, exon, and annotated junctions across all
#in the region of the CD99 gene
#from GTEx v6 sample compilation
#CD99 is chosen for its size
sb <- QueryBuilder(compilation="gtex", regions="CD99")
cd99.gene <- query_gene(sb)
dim(cd99.gene)
## [1]    1 9662
head(cd99.gene)
## class: RangedSummarizedExperiment 
## dim: 1 9662 
## metadata(1): uri
## assays(1): counts
## rownames(1): 23
## rowData names(12): DataSource:Type snaptron_id ... coverage_median
##   compilation_id
## colnames(9662): rail_50099 rail_50100 ... rail_59759 rail_59760
## colData names(322): rail_id Run ... junction_coverage
##   junction_avg_coverage

Now query for junctions:

##Query all exon-exon splice junctions within the region of gene CD99
cd99.jx.all <- query_jx(sb)
dim(cd99.jx.all)
## [1] 3485 9662
cd99.jx.all
## class: RangedSummarizedExperiment 
## dim: 3485 9662 
## metadata(1): uri
## assays(1): counts
## rownames(3485): 28340058 28340273 ... 28352407 28352408
## rowData names(12): DataSource:Type snaptron_id ... coverage_median
##   source_dataset_id
## colnames(9662): rail_50099 rail_50100 ... rail_59759 rail_59760
## colData names(322): rail_id Run ... junction_coverage
##   junction_avg_coverage

Now query for junctions and filter by sample type: Brain:

#now subfilter by sample tissue
#GTEx samples that are labeled with tissue type "Brain"
sb <- set_column_filters(sb, SMTS == "Brain")
cd99.jx.all <- query_jx(sb)
dim(cd99.jx.all)
## [1]  637 1409
head(cd99.jx.all)
## class: RangedSummarizedExperiment 
## dim: 6 1409 
## metadata(1): uri
## assays(1): counts
## rownames(6): 28340058 28341126 ... 28341507 28341508
## rowData names(12): DataSource:Type snaptron_id ... coverage_median
##   source_dataset_id
## colnames(1409): rail_50099 rail_50105 ... rail_59748 rail_59753
## colData names(322): rail_id Run ... junction_coverage
##   junction_avg_coverage

Same query/filter again but for exons:

cd99.exon <- query_exon(sb)
dim(cd99.exon)
## [1]   32 1409
head(cd99.exon)
## class: RangedSummarizedExperiment 
## dim: 6 1409 
## metadata(1): uri
## assays(1): counts
## rownames(6): 691 692 ... 695 696
## rowData names(12): DataSource:Type snaptron_id ... coverage_median
##   compilation_id
## colnames(1409): rail_50099 rail_50105 ... rail_59748 rail_59753
## colData names(322): rail_id Run ... junction_coverage
##   junction_avg_coverage

Now query junctions but further filter for only those which are fully annotated:

###Only query junctions which are fully annotated---both left and
#right splice sites are found together in one or more of the
#Snaptron sourced annotations
sb <- set_row_filters(sb, annotated == 1)
cd99.jx <- query_jx(sb)
dim(cd99.jx)
## [1]   23 1409
head(cd99.jx)
## class: RangedSummarizedExperiment 
## dim: 6 1409 
## metadata(1): uri
## assays(1): counts
## rownames(6): 28350154 28350172 ... 28350658 28350686
## rowData names(12): DataSource:Type snaptron_id ... coverage_median
##   source_dataset_id
## colnames(1409): rail_50099 rail_50105 ... rail_59748 rail_59753
## colData names(322): rail_id Run ... junction_coverage
##   junction_avg_coverage

Now check an example of the metadata stored in the RSE, InsertSize:

##Metadata is stored directly in the RSE object.
#For example the library insert size can be retrieved
#across all runs  in the RSE
head(cd99.jx.all$InsertSize)
## [1] 162 302 163 150 223 150

5.3 megadepth

The following information is taken from the megadepth documentation.

To get started, we need to load the megadepth package into our R session. This will load all the required dependencies.

library("megadepth")

Once we have the R package loaded, we need to install the Megadepth software. We can do so with install_megadepth(), which downloads a binary for your OS (Linux, Windows or macOS).2 We can then use with an example BigWig file to compute the coverage at a set of regions.

## Install the latest version of Megadepth
install_megadepth(force = TRUE)
## The latest megadepth version is 1.1.1
## This is not an interactive session, therefore megadepth has been installed temporarily to 
## /tmp/RtmpY0c1zv/megadepth

Next, we might want to use megadepth to quantify the coverage at a set of regions of the genome of interest to us. Here we will use two example files that are include in the package for illustration and testing purposes. One of them is a bigWig file that contains the base-pair coverage information for a sample of interest and the second one is BED file which contains the genomic region coordinates of interest. So we first locate them.

## Next, we locate the example BigWig and annotation files
example_bw <- system.file("tests", "test.bam.all.bw",
    package = "megadepth", mustWork = TRUE
)
annotation_file <- system.file("tests", "testbw2.bed",
    package = "megadepth", mustWork = TRUE
)

## Where are they?
example_bw
## [1] "/usr/local/lib/R/host-site-library/megadepth/tests/test.bam.all.bw"
annotation_file
## [1] "/usr/local/lib/R/host-site-library/megadepth/tests/testbw2.bed"

Once we have located the example files we can proceed to calculating the base-pair coverage for our genomic regions of interest. There are several ways to do this with megadepth, but here we use the user-friendly function get_coverage(). This function will perform a given operation op on the bigWig file for a given set of regions of interest (annotation). One of those operations is to compute the mean base-pair coverage for each input region. This is what we’ll do with our example bigWig file.

## We can then use megadepth to compute the coverage
bw_cov <- get_coverage(
    example_bw,
    op = "mean",
    annotation = annotation_file
)
bw_cov
## GRanges object with 4 ranges and 1 metadata column:
##         seqnames          ranges strand |     score
##            <Rle>       <IRanges>  <Rle> | <numeric>
##   [1]      chr10            0-10      * |      0.00
##   [2]      chr10 8756697-8756762      * |     15.85
##   [3]      chr10 4359156-4359188      * |      3.00
##   [4] GL000219.1   168500-168620      * |      1.26
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

get_coverage() returns an object that is familiar to GenomicRanges users, that is, a GRanges object that can be used with other Bioconductor software packages.

This example is just the tip of the iceberg, as Megadepth and thus megadepth can do a lot of useful processing operations on BAM and bigWig files.

Please continue to the users guide for more detailed information.


  1. Please check Megadepth for instructions on how to compile the software from source if the binary version doesn’t work for you.↩︎