9 Your data
Note that this section will need to be refined once we have gone through this process. Currently we are focused on releasing the initial version of recount3.
9.1 Process your data
While recount_pump
contains code for generating recount3, if you want to use the Monorail RNA-seq processing pipeline (alignment/quantification) for your own data, we highly recommend that you start by looking at the monorail-external
GitHub repository. It contains a very detailed README with instructions on how to install the software, download the annotation files, and process raw RNA-seq data (with some example files used for illustration purposes).
In order to add your data to recount3, we will need all the raw files for your study. That is:
- the gene count text files for each annotation we support (organism-dependent)
- similarly, the exon count text files for each annotation
- the exon-exon junction files 5
- the five metadata files
- the bigWig files for each sample
We will then need to organize your data into the directory structure expected by recount3
.
9.2 Contribute your collections
As described in the raw files section, collections involve creating a metadata file for a custom set of samples across one or more studies (typically more than one study). You can find some example collections at http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/. For example, http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/ contains two files:
<collection_name>
.custom.gz: gtex_geuvadis.custom.gz<collection_name>
.recount_project.gz: gtex_geuvadis.recount_project.gz (optional!)
These files include the 3 columns we use internally for identifying all samples:
rail_id
: used by Snaptronstudy
: the project name. Seerecount3::available_projects()
for supported options (or use your own customrecount3_url
).external_id
: typically the SRA run ID but each data source has different unique IDs, like TCGA which uses much longer sample IDs.
9.2.1 Collection metadata
The <collection_name>
.custom.gz can then contain any additional columns of interest. For example, in the recount-brain
project we manually curated samples from 62 studies and standardized variables across the 62 studies.
Here’s how the example collection metadata file looks like:
read.delim(
recount3::file_retrieve(
"http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.custom.gz"
)
)
## 2024-12-13 16:45:03.109457 caching file gtex_geuvadis.custom.gz.
## adding rname 'http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.custom.gz'
## rail_id external_id study sequencing_type tissue
## 1 98371 ERR188021 ERP001942 bulk LCL
## 2 153881 GTEX-U3ZM-0826-SM-4DXU6.1 BLADDER bulk BLADDER
This file has to be readble with R using the following code:
Since this is the main file that needs to be produced for adding a collection to recount3
, if your collection involves data that is already present in recount3
, it will be very easy for us to add it to our resource. Otherwise we will need the raw files for the corresponding new data.
9.2.2 Collection project location
The second text file specifies the required information for locating the samples. While we had considered using this type of file, we don’t require it anymore. For historical purposes, this is how that file looked: 6
read.delim(
recount3::file_retrieve(
"http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.recount_project.gz"
)
)
## 2024-12-13 16:45:03.666874 caching file gtex_geuvadis.recount_project.gz.
## adding rname 'http://snaptron.cs.jhu.edu/data/temp/recount3/human/collections/gtex_geuvadis/metadata/gtex_geuvadis.recount_project.gz'
## rail_id external_id study project organism
## 1 98371 ERR188021 ERP001942 ERP001942 Homo sapiens
## 2 153881 GTEX-U3ZM-0826-SM-4DXU6.1 BLADDER BLADDER Homo sapiens
## file_source metadata_source date_processed
## 1 data_sources/sra collections/gtex_geuvadis 2019-10-01
## 2 data_sources/gtex collections/gtex_geuvadis 2019-11-01