9 Your data

Note that this section will need to be refined once we have gone through this process. Currently we are focused on releasing the initial version of recount3.

9.1 Process your data

While recount_pump contains code for generating recount3, if you want to use the Monorail RNA-seq processing pipeline (alignment/quantification) for your own data, we highly recommend that you start by looking at the monorail-external GitHub repository. It contains a very detailed README with instructions on how to install the software, download the annotation files, and process raw RNA-seq data (with some example files used for illustration purposes).

In order to add your data to recount3, we will need all the raw files for your study. That is:

  • the gene count text files for each annotation we support (organism-dependent)
  • similarly, the exon count text files for each annotation
  • the exon-exon junction files 5
  • the five metadata files
  • the bigWig files for each sample

We will then need to organize your data into the directory structure expected by recount3.

9.2 Contribute your collections

As described in the raw files section, collections involve creating a metadata file for a custom set of samples across one or more studies (typically more than one study). You can find some example collections at http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/. For example, http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/ contains two files:

  • <collection_name>.custom.gz: gtex_geuvadis.custom.gz
  • <collection_name>.recount_project.gz: gtex_geuvadis.recount_project.gz (optional!)

These files include the 3 columns we use internally for identifying all samples:

  • rail_id: used by Snaptron
  • study: the project name. See recount3::available_projects() for supported options (or use your own custom recount3_url).
  • external_id: typically the SRA run ID but each data source has different unique IDs, like TCGA which uses much longer sample IDs.

9.2.1 Collection metadata

The <collection_name>.custom.gz can then contain any additional columns of interest. For example, in the recount-brain project we manually curated samples from 62 studies and standardized variables across the 62 studies.

Here’s how the example collection metadata file looks like:

read.delim(
    recount3::file_retrieve(
        "http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.custom.gz"
    )
)
## 2024-12-13 16:45:03.109457 caching file gtex_geuvadis.custom.gz.
## adding rname 'http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.custom.gz'
##   rail_id               external_id     study sequencing_type  tissue
## 1   98371                 ERR188021 ERP001942            bulk     LCL
## 2  153881 GTEX-U3ZM-0826-SM-4DXU6.1   BLADDER            bulk BLADDER

This file has to be readble with R using the following code:

utils::read.delim(
    file_path,
    sep = "\t",
    check.names = FALSE,
    quote = "",
    comment.char = ""
)

Since this is the main file that needs to be produced for adding a collection to recount3, if your collection involves data that is already present in recount3, it will be very easy for us to add it to our resource. Otherwise we will need the raw files for the corresponding new data.

9.2.2 Collection project location

The second text file specifies the required information for locating the samples. While we had considered using this type of file, we don’t require it anymore. For historical purposes, this is how that file looked: 6

read.delim(
    recount3::file_retrieve(
        "http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.recount_project.gz"
    )
)
## 2024-12-13 16:45:03.666874 caching file gtex_geuvadis.recount_project.gz.
## adding rname 'http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.recount_project.gz'
##   rail_id               external_id     study   project     organism
## 1   98371                 ERR188021 ERP001942 ERP001942 Homo sapiens
## 2  153881 GTEX-U3ZM-0826-SM-4DXU6.1   BLADDER   BLADDER Homo sapiens
##         file_source           metadata_source date_processed
## 1  data_sources/sra collections/gtex_geuvadis     2019-10-01
## 2 data_sources/gtex collections/gtex_geuvadis     2019-11-01

  1. Hm… I think this will be tricky if new exon-exon junctions are found in someone’s data which are absent in recount3. This is VERY likely to happen!↩︎

  2. Once we have more collections I can verify this!↩︎