I am coming into this project in a state of perfect ignorance. Carrie kindly sent me a few or two yesterday but I have yet to open the email and start reading them.
The only things I know for certain:
The document ‘preprocess.Rmd’ outlines the commands I ran. I used my pipeline’s Process_RNASeq function, which trims, runs fastqc, kraken, hisat, and htseq by default.
I received a complete sample sheet from either Najib or Carrie and modified it slightly to match the places where I put the raw data. I then copied it to ‘sample_sheets/all_samples.xlsx’ so that I need not worry about messing up the original.
My function ‘gather_preprocessing_metadata()’ has defaults which should provide some helpful new columns in this metadata sheet. Upon completion, it should write a new copy of the same file with a suffix ’_modified.xlsx’.
FIXME: modify the function to detect columns with dates and make sure to keep the encoding the same. FIXME: Najib changed his template and added a new first row, add a check against that - or just delete the row manually. FIXME: Set the default significant digits back to NULL, having all these darn .000’s is annoying.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Skipping for now
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A1/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A2/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B1/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B3/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B4/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B6/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C4/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C6/outputs/*kraken_*/kraken_report_matrix.tsv.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A1/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A2/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A3/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A4/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/A5/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B1/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B2/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B3/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B4/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B5/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/B6/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C1/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C2/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C3/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C4/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C5/outputs/*salmon_*/quant.sf.
## Warning in dispatch_filename_search(meta, input_file_spec, verbose = verbose, : The input file is NA for:
## preprocessing/C6/outputs/*salmon_*/quant.sf.
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.
I immediately learned that I somehow forgot to process the first sample!? It is processing now, I have no clue how that happened.
load_biomart_annotations, if not told anything else, will connect to ensembl and attempt to download the most commonly requested annotations for homo_sapiens from the archive server 2 years before the current date. This is because, as a general rule, I use genomes which are ~ 2-3 years old.
I am also downloading the ontology data, though most tools are aware of Mus.
In addition, I will load the gff annotations from the gff file used to count the genes, just in case there are some mismatches between the ensembl and gff gene IDs.
## The biomart annotations file already exists, loading from it.
gene_annotations <- annot[["gene_annotations"]]
go_db <- load_biomart_go(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
## Warning in load_gff_annotations("~/libraries/genome/mm38_100.gff", id_col = "gene"): Attempting to create a dataframe with gene and locus_tag both
## failed.
The experimental metadata now includes the count table filenames and I have a reasonable set of gene annotations. I should be able therefore the merge them all into an expressionset and/or summarizedExperiment.
mm_expt <- create_expt(modified[["new_file"]],
gene_info = gene_annotations,
file_column = "hisatcounttable") %>%
set_expt_conditions(fact = "abc") %>%
set_expt_batches(fact = "number")
## Reading the sample metadata.
## The sample definitions comprises: 17 rows(samples) and 41 columns(metadata fields).
## Matched 25760 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25760 features and 17 samples.
## The numbers of samples by condition are:
##
## A B C
## 5 6 6
## The number of samples by batch are:
##
## b1 b2 b3 b4 b5 b6
## 3 3 3 3 3 2
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"function"'
## Library sizes of 17 samples,
## ranging from 27,324,171 to 69,203,380.
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## A non-zero genes plot of 17 samples.
## These samples have an average 37.06 CPM coverage and 16818 genes observed, ranging from 16399 to
## 17275.
## Removing 12303 low-count genes (13457 remaining).
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample correlations ranging from:
## 0.745893022508784 to 0.996487658158264.
## A heatmap of pairwise sample distances ranging from:
## 19.647203152876 to 167.113181354411.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by A, B, C
## Shapes are defined by b1, b2, b3, b4, b5, b6.
Holy ass crackers! I am not sure I have ever had a dataset which split this coherently. I need better names than ‘A’ ‘B’ ‘C’.
Ok, I will just do a no-batch DE because I am not sure of the actual batches and/or surrogates, and who cares the data split so well I am worried (not really) it is simulated.
Oh, before I forget, April has been asking about rRNA content. I think I quantified that?
No, it appears I didn’t submit rRNA queries. Lets do that now before I forget.
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
##
## Total:162 s
## The result of using variancePartition with the model:
## ~ condition + batch
Since I have not read the kindly-sent reviews, I will cheat a little and use GSVA to get some ideas about potential papers. I default to C2 which is likely not the right gene set list.
I just downloaded the new msigdb, let us use that instead of the much less interesting GSVAdata set. Frustratingly, the new version of MSigDB provides invalid XML (there are apparently ‘<’ characters in the text fields of this file, which is explicitly forbidden in the XML standard), so I wrote a function to read the annotations from the SQLite database.
FIXME: I need to do some work to clean up the IDs with this new function.
## Converting the rownames() of the expressionset to ENTREZID.
## 4032 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 25760 entries.
## After conversion, the expressionset has 22019 entries.
#msig_meta <- get_msigdb_metadata(mm_gsva,
# msig_db = "reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db")
mm_gsva_sig <- get_sig_gsva_categories(mm_gsva)
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Assuming this data is similar to a micro array and not performign voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust = FALSE and trend = FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: B_vs_A. Adjust = BH
## Limma step 6/6: 2/3: Creating table: C_vs_A. Adjust = BH
## Limma step 6/6: 3/3: Creating table: C_vs_B. Adjust = BH
## Limma step 6/6: 1/3: Creating table: A. Adjust = BH
## Limma step 6/6: 2/3: Creating table: B. Adjust = BH
## Limma step 6/6: 3/3: Creating table: C. Adjust = BH
## The factor A has 5 rows.
## The factor B has 6 rows.
## The factor C has 6 rows.
## Testing each factor against the others.
## Scoring A against everything else.
## Scoring B against everything else.
## Scoring C against everything else.
## Deleting the file excel/gsva_subset.xlsx before writing the tables.
Come back to this, note to self the previous iteration was explicitly looking for Pseudomonas contamination.
genus_expt <- create_expt(gathered[["new_file"]],
file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
plot_libsize(genus_expt)
head(exprs(genus_expt))
exprs(genus_expt)["Pseudomonas", ]
Until I get more meaningful condition names, I will just do B/A C/A C/B
keepers <- list(
"ba" = c("B", "A"),
"ca" = c("C", "A"),
"cb" = c("C", "B"))
de <- all_pairwise(mm_expt, filter = TRUE, model_batch = FALSE)
##
## A B C
## 5 6 6
## Checking limma for name ba:B_vs_A
## Checking deseq for name ba:B_vs_A
## Checking edger for name ba:B_vs_A
## Checking ebseq for name ba:B_vs_A
## Checking noiseq for name ba:B_vs_A
## Checking basic for name ba:B_vs_A
## Checking limma for name ca:C_vs_A
## Checking deseq for name ca:C_vs_A
## Checking edger for name ca:C_vs_A
## Checking ebseq for name ca:C_vs_A
## Checking noiseq for name ca:C_vs_A
## Checking basic for name ca:C_vs_A
## Checking limma for name cb:C_vs_B
## Checking deseq for name cb:C_vs_B
## Checking edger for name cb:C_vs_B
## Checking ebseq for name cb:C_vs_B
## Checking noiseq for name cb:C_vs_B
## Checking basic for name cb:C_vs_B
## About to start combine_mapped_table()
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Finished combine_mapped_table()
## About to start combine_mapped_table()
## Finished combine_mapped_table()
mm38 is nicely supported in gProfiler/clusterProfiler.
## Running gProfiler on every set of significant genes found:
## GO KEGG REAC WP TF MIRNA HPA CORUM HP
## ba_up 1329 4 14 2 483 0 0 0 0
## ba_down 1296 9 58 1 407 0 0 0 0
## ca_up 1484 5 9 0 524 0 0 0 0
## ca_down 1374 11 80 2 489 0 0 0 0
## cb_up 656 3 3 0 146 0 0 0 1
## cb_down 168 0 0 0 36 0 0 0 0
## A set of ontologies produced by gprofiler using 2437
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1329 GO hits, 4, KEGG hits, 14 reactome hits, 2 wikipathway hits, 483 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1497
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1296 GO hits, 9, KEGG hits, 58 reactome hits, 1 wikipathway hits, 407 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 2916
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1484 GO hits, 5, KEGG hits, 9 reactome hits, 0 wikipathway hits, 524 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1840
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1374 GO hits, 11, KEGG hits, 80 reactome hits, 2 wikipathway hits, 489 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 753
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 656 GO hits, 3, KEGG hits, 3 reactome hits, 0 wikipathway hits, 146 transcription factor hits, 0 miRNA hits, 0 HPA hits, 1 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 199
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 168 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 36 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 30 hits.