I am coming into this project in a state of perfect ignorance. Carrie kindly sent me a few or two yesterday but I have yet to open the email and start reading them.
The only things I know for certain:
The document ‘preprocess.Rmd’ outlines the commands I ran. I used my pipeline’s Process_RNASeq function, which trims, runs fastqc, kraken, hisat, and htseq by default.
I received a complete sample sheet from either Najib or Carrie and modified it slightly to match the places where I put the raw data. I then copied it to ‘sample_sheets/all_samples.xlsx’ so that I need not worry about messing up the original.
My function ‘gather_preprocessing_metadata()’ has defaults which should provide some helpful new columns in this metadata sheet. Upon completion, it should write a new copy of the same file with a suffix ’_modified.xlsx’.
FIXME: modify the function to detect columns with dates and make sure to keep the encoding the same. FIXED! FIXME: Najib changed his template and added a new first row, add a check against that - or just delete the row manually. FIXED! FIXME: Set the default significant digits back to NULL, having all these darn .000’s is annoying. FIXED!
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Skipping for now
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.
I immediately learned that I somehow forgot to process the first sample!? It is processing now, I have no clue how that happened.
load_biomart_annotations, if not told anything else, will connect to ensembl and attempt to download the most commonly requested annotations for homo_sapiens from the archive server 2 years before the current date. This is because, as a general rule, I use genomes which are ~ 2-3 years old.
I am also downloading the ontology data, though most tools are aware of Mus.
In addition, I will load the gff annotations from the gff file used to count the genes, just in case there are some mismatches between the ensembl and gff gene IDs.
## The biomart annotations file already exists, loading from it.
tx_annotations <- annot[["annotation"]]
gene_annotations <- tx_annotations
kept <- !duplicated(gene_annotations[["ensembl_gene_id"]])
gene_annotations <- gene_annotations[kept, ]
rownames(gene_annotations) <- gene_annotations[["ensembl_gene_id"]]
gene_annotations[["ensembl_gene_id"]] <- NULL
go_db <- load_biomart_go(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
## Returning a df with 32 columns and 3899382 rows.
The experimental metadata now includes the count table filenames and I have a reasonable set of gene annotations. I should be able therefore the merge them all into an expressionset and/or summarizedExperiment.
mm_expt <- create_expt(modified[["new_file"]],
gene_info = gene_annotations,
annotation = "org.Mm.eg.db",
file_column = "hisatcounttable") %>%
set_expt_conditions(fact = "time") %>%
set_expt_batches(fact = "datecategorical")
## Reading the sample metadata.
## The sample definitions comprises: 17 rows(samples) and 46 columns(metadata fields).
## Matched 25654 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25760 features and 17 samples.
## The numbers of samples by condition are:
##
## d10 d13 d15
## 5 6 6
## The number of samples by batch are:
##
## b1 b2 b3 b4
## 11 2 2 2
## Writing the first sheet, containing a legend and some summary data.
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 152012 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 152012 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Subsetting on genes.
##
## remove_genes_expt(), before removal, there were 13457 genes, now there are 13457.
##
##
## Total:98 s
## Library sizes of 17 samples,
## ranging from 27,324,171 to 69,203,380.
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 17 samples.
## These samples have an average 37.06 CPM coverage and 16818 genes observed, ranging from 16399 to
## 17275.
## Removing 12303 low-count genes (13457 remaining).
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample correlations ranging from:
## 0.745893022508784 to 0.996487658158264.
## A heatmap of pairwise sample distances ranging from:
## 19.647203152876 to 167.113181354411.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by d10, d13, d15
## Shapes are defined by b1, b2, b3, b4.
Holy ass crackers! I am not sure I have ever had a dataset which split this coherently. I need better names than ‘A’ ‘B’ ‘C’.
Ok, I will just do a no-batch DE because I am not sure of the actual batches and/or surrogates, and who cares the data split so well I am worried (not really) it is simulated.
Oh, before I forget, April has been asking about rRNA content. I think I quantified that?
No, it appears I didn’t submit rRNA queries. Lets do that now before I forget.
cd preprocessing
start=$(pwd)
for i in A* B* C*; do
cd $i
cyoa --method hisat --species mm38_100 --libtype rRNA --gff_type misc_feature --gff_tag ID \
--input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cd $start
done
I checked the logs for a few samples and it looks like the rRNA content is less than 1% in most(all?) of the samples.
## Subsetting on genes.
## remove_genes_expt(), before removal, there were 18857 genes, now there are 17392.
##
## Total:130 s
## The result of using variancePartition with the model:
## ~ condition + batch
Since I have not read the kindly-sent reviews, I will cheat a little and use GSVA to get some ideas about potential papers. I default to C2 which is likely not the right gene set list.
I just downloaded the new msigdb, let us use that instead of the much less interesting GSVAdata set. Frustratingly, the new version of MSigDB provides invalid XML (there are apparently ‘<’ characters in the text fields of this file, which is explicitly forbidden in the XML standard), so I wrote a function to read the annotations from the SQLite database.
FIXME: I need to do some work to clean up the IDs with this new function.
mm_gsva <- simple_gsva(mm_expt, orgdb = "org.Mm.eg.db", signature_category = "c2",
msig_db = "reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db",
signatures = "reference/msigdb_v2023.2.Mm/m2.all.v2023.2.Mm.entrez.gmt")
## Converting the rownames() of the expressionset to ENTREZID.
## 4032 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 25760 entries.
## Subsetting on genes.
## remove_genes_expt(), before removal, there were 25760 genes, now there are 21669.
## After conversion, the expressionset has 21669 entries.
## Adding descriptions and IDs to the gene set annotations.
## Adding annotations from reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db.
## Not subsetting the msigdb metadata, the wanted_meta argument was NULL.
## GSVA result using method: ssgsea against the c2 dataset.
## Scores range from: -0.4357 to: 0.5643.
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Assuming this data is similar to a micro array and not performign voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust = FALSE and trend = FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: d13_vs_d10. Adjust = BH
## Limma step 6/6: 2/3: Creating table: d15_vs_d10. Adjust = BH
## Limma step 6/6: 3/3: Creating table: d15_vs_d13. Adjust = BH
## Limma step 6/6: 1/3: Creating table: d10. Adjust = BH
## Limma step 6/6: 2/3: Creating table: d13. Adjust = BH
## Limma step 6/6: 3/3: Creating table: d15. Adjust = BH
## The factor d10 has 5 rows.
## The factor d13 has 6 rows.
## The factor d15 has 6 rows.
## Testing each factor against the others.
## Scoring d10 against everything else.
## Scoring d13 against everything else.
## Scoring d15 against everything else.
## Deleting the file excel/gsva_sig_categories.xlsx before writing the tables.
## The set of GSVA categories deemed significantly higher than the
## distribution of all scores. It comprises 54 gene sets.
Come back to this, note to self the previous iteration was explicitly looking for Pseudomonas contamination.
genus_expt <- create_expt(gathered[["new_file"]],
file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
plot_libsize(genus_expt)
head(exprs(genus_expt))
exprs(genus_expt)["Pseudomonas", ]
Until I get more meaningful condition names, I will just do B/A C/A C/B
keepers <- list(
"d13_vs_d10" = c("d13", "d10"),
"d15_vs_d10" = c("d15", "d10"),
"d15_vs_d13" = c("d15", "d13"))
de <- all_pairwise(mm_expt, filter = TRUE, model_batch = FALSE)
##
## d10 d13 d15
## 5 6 6
mm38 is nicely supported in gProfiler/clusterProfiler.
## Running gProfiler on every set of significant genes found:
## GO KEGG REAC WP TF MIRNA HPA CORUM HP
## d13_vs_d10_up 1291 4 14 1 478 0 0 0 0
## d13_vs_d10_down 1294 8 58 1 401 0 0 0 0
## d15_vs_d10_up 1467 6 10 0 521 0 0 0 0
## d15_vs_d10_down 1346 10 83 2 483 0 0 0 0
## d15_vs_d13_up 625 4 3 0 142 0 0 0 1
## d15_vs_d13_down 156 0 0 0 34 0 0 0 0
## A set of ontologies produced by gprofiler using 2437
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1291 GO hits, 4, KEGG hits, 14 reactome hits, 1 wikipathway hits, 478 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1497
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1294 GO hits, 8, KEGG hits, 58 reactome hits, 1 wikipathway hits, 401 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 2916
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1467 GO hits, 6, KEGG hits, 10 reactome hits, 0 wikipathway hits, 521 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1840
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1346 GO hits, 10, KEGG hits, 83 reactome hits, 2 wikipathway hits, 483 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 753
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 625 GO hits, 4, KEGG hits, 3 reactome hits, 0 wikipathway hits, 142 transcription factor hits, 0 miRNA hits, 0 HPA hits, 1 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 199
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 156 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 34 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 30 hits.
I spent a little time messing with suppa (https://doi.org/10.1186/s13059-018-1417-1) and have what I think are the various comparisons across time for the different splicing variants etc.
I do not currently have an automagic way to load the results from suppa, so let us take a moment and see if the results make any sense.
d13_d10_all_isoforms <- "outputs/90suppa_mm38_100/diff/d13_d10_ioi.dpsi"
d13_d10_a3 <- "outputs/90suppa_mm38_100/diff/d13_d10_A3.dpsi"
d13_d10_a5 <- "outputs/90suppa_mm38_100/diff/d13_d10_A5.dpsi"
d13_d10_af <- "outputs/90suppa_mm38_100/diff/d13_d10_AF.dpsi"
d13_d10_al <- "outputs/90suppa_mm38_100/diff/d13_d10_AL.dpsi"
d13_d10_mx <- "outputs/90suppa_mm38_100/diff/d13_d10_MX.dpsi"
d13_d10_ri <- "outputs/90suppa_mm38_100/diff/d13_d10_RI.dpsi"
d13_d10_se <- "outputs/90suppa_mm38_100/diff/d13_d10_SE.dpsi"