6 Raw Files
Explain the raw file formats we have and how they are organized
6.1 Data source vs collection
A data source
specifies where the data is hosted at, most commonly from the Sequence Read Archive (SRA). A collection
is a manually curated set of samples from one or more studies. A collection has a custom metadata file where the curator(s) can specify metadata variables for the collection. In other words:
data_source
: samples from the original data origincollection
: manually selected samples with curated collection-specific sample metadata
6.2 Annotation files
Here are the direct links in case you are interested in downloading the annotation files directly.
On the R package, you can use recount3::locate_url_ann()
to obtain these URLs.
The URL structure is:
<recount3_url>
/<organism>
/annotations/<gene|exon>
_sums/<organism>
.<gene|exon>
_sums.<annotation file extension>
.gtf.gz
These are the annotation file extensions; human:
- Gencode v26:
G026
- Gencode v29:
G029
- RefSeq:
ERCC
- FANTOM6_cat:
F006
- ERCC:
R109
- SIRV:
SIRV
Mouse:
- Gencode v23:
M023
6.3 Project-level count files
For every project, we have files at the gene, exon, and exon-exon junction expression levels. For genes and exons, we provide a file for each of the annotations. That is, for every project we provide:
- gene files: one count matrix per annotation
- exon files: one count matrix per annotation
- 3 exon-exon junction files: the sparse count matrix data in Matrix Market (MM) format, the small list of sample identifiers (IDs), and the exon-exon junctions coordinate information (RR file)
All these files can be located with recount3::locate_url()
. The following R code creates a table with links to the files for the default annotation for each organism. Note that you can replace the annotation file extension (like G026
) for the corresponding one for annotation annotation shown in the previous section (or use recount3::annotation_ext()
to see available options).
## Obtain all available projects
projects <- rbind(
recount3::available_projects("human"),
recount3::available_projects("mouse")
)
## Locate the project raw files at the gene level using the default annotation
projects$gene <- apply(projects, 1, function(x)
locate_url(
project = x["project"],
project_home = x["project_home"],
type = "gene",
organism = x["organism"],
annotation = annotation_options(x["organism"])[1] # Use default annotation
))
## Locate the project raw files at the exon level using the default annotation
projects$exon <- apply(projects, 1, function(x)
locate_url(
project = x["project"],
project_home = x["project_home"],
type = "exon",
organism = x["organism"],
annotation = annotation_options(x["organism"])[1] # Use default annotation
))
## Locate the project raw exon-exon junction files
projects <-
cbind(projects, do.call(rbind, apply(projects, 1, function(x) {
x <-
locate_url(
project = x["project"],
project_home = x["project_home"],
type = "jxn",
organism = x["organism"]
)
res <- data.frame(t(x))
colnames(res) <-
paste0("jxn_", gsub("^.*\\.", "", gsub("\\.gz", "", colnames(res))))
return(res)
})))
rownames(projects) <- NULL
## Dimensions of the table
dim(projects)
# [1] 18830 11
## Export
write.csv(projects, file = "recount3_raw_project_files_with_default_annotation.csv", row.names = FALSE)
As a teaser, here you can see the first 20 rows of this long table. Or you can download the CSV file to your computer from GitHub.
The URL structure is:
- gene:
<recount3_url>
/<organism>
/data_sources/<data_source>
/gene_sums/<last 2 project letters or digits>
/<project>
/<data_source>
.gene_sums.<project>
.<annotation file extension>
.gz - exon:
<recount3_url>
/<organism>
/data_sources/<data_source>
/exon_sums/<last 2 project letters or digits>
/<project>
/<data_source>
.exon_sums.<project>
.<annotation file extension>
.gz - junctions:
<recount3_url>
/<organism>
/data_sources/<data_source>
/junctions/<last 2 project letters or digits>
/<project>
/<data_source>
.junctions.<project>
.<junction type: typically ALL>
.<junction file extension: RR, MM or ID>
.gz 3
6.4 Project-level metadata files
Every project from an original data source has 5 different sample metadata tables. These are:
project_meta
(sra
,gtex
,tcga
): information mostly used by the R interface for locating filesrecount_project
: information downloaded from the original data source, such as the SRA Run Table selectorrecount_qc
: quality check fields using the QC annotationrecount_seq_qc
: sequence quantily check fieldsrecount_pred
: curated and predicted sample information described in the recount3 manuscript
You can use the following R code to obtain the links to all these raw metadata files or use recount3::locate_url()
.
## Obtain all the metadata files
metadata_files <- do.call(rbind, apply(projects, 1, function(x) {
x <-
locate_url(
project = x[["project"]],
project_home = x[["project_home"]],
type = "metadata",
organism = x[["organism"]]
)
res <- data.frame(t(x))
colnames(res) <-
gsub("\\..*", "", gsub("^[a-z]+\\.", "", colnames(res)))
colnames(res)[colnames(res) %in% unique(projects$file_source)] <-
"project_meta"
return(res)
}))
dim(metadata_files)
# [1] 18830 5
## Export
write.csv(metadata_files, file = "recount3_metadata_files.csv", row.names = FALSE)
As a teaser, here you can see the first 6 rows of this long table. Or you can download the CSV file to your computer from GitHub. If you want to, you can combine it with the project raw files table from the previous section.
## project_meta
## 1 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP107565/sra.sra.SRP107565.MD.gz
## 2 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP149665/sra.sra.SRP149665.MD.gz
## 3 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP017465/sra.sra.SRP017465.MD.gz
## 4 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP119165/sra.sra.SRP119165.MD.gz
## 5 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP133965/sra.sra.SRP133965.MD.gz
## 6 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP096765/sra.sra.SRP096765.MD.gz
## recount_project
## 1 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP107565/sra.recount_project.SRP107565.MD.gz
## 2 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP149665/sra.recount_project.SRP149665.MD.gz
## 3 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP017465/sra.recount_project.SRP017465.MD.gz
## 4 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP119165/sra.recount_project.SRP119165.MD.gz
## 5 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP133965/sra.recount_project.SRP133965.MD.gz
## 6 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP096765/sra.recount_project.SRP096765.MD.gz
## recount_qc
## 1 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP107565/sra.recount_qc.SRP107565.MD.gz
## 2 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP149665/sra.recount_qc.SRP149665.MD.gz
## 3 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP017465/sra.recount_qc.SRP017465.MD.gz
## 4 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP119165/sra.recount_qc.SRP119165.MD.gz
## 5 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP133965/sra.recount_qc.SRP133965.MD.gz
## 6 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP096765/sra.recount_qc.SRP096765.MD.gz
## recount_seq_qc
## 1 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP107565/sra.recount_seq_qc.SRP107565.MD.gz
## 2 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP149665/sra.recount_seq_qc.SRP149665.MD.gz
## 3 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP017465/sra.recount_seq_qc.SRP017465.MD.gz
## 4 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP119165/sra.recount_seq_qc.SRP119165.MD.gz
## 5 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP133965/sra.recount_seq_qc.SRP133965.MD.gz
## 6 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP096765/sra.recount_seq_qc.SRP096765.MD.gz
## recount_pred
## 1 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP107565/sra.recount_pred.SRP107565.MD.gz
## 2 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP149665/sra.recount_pred.SRP149665.MD.gz
## 3 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP017465/sra.recount_pred.SRP017465.MD.gz
## 4 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP119165/sra.recount_pred.SRP119165.MD.gz
## 5 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP133965/sra.recount_pred.SRP133965.MD.gz
## 6 http://idies.jhu.edu/recount3/data/human/data_sources/sra/metadata/65/SRP096765/sra.recount_pred.SRP096765.MD.gz
The URL structure is:
<recount3_url>
/<organism>
/data_sources/<data_source>
/metadata/<last 2 project letters or digits>
/<project>
/<data_source>
.<table name>
.<project>
.MD.gz
6.5 Sample-level BigWig files
Each sample in recount3 has bigWig file publicly available and whose URL can be obtained using recount3::locate_url()
. Below we show the URL for one such sample.
## sra.base_sums.SRP009615_SRR387777.ALL.bw
## "http://duffel.rail.bio/recount3/human/data_sources/sra/base_sums/15/SRP009615/77/sra.base_sums.SRP009615_SRR387777.ALL.bw"
The URL structure is:
<recount3_url>
/<organism>
/data_sources/<data_source>
/base_sums/<last 2 project letters or digits>
/<project>
/<last 2 sample letters or digits>
/<data_source>
.base_sums.<project>
_<sample>
.ALL.bw
Valid recount3_url
options we support are http://duffel.rail.bio/recount3
and http://idies.jhu.edu/recount3/data
.
Only GTEx and TCGA have junction type
UNIQUE
available in addition toALL
.↩︎