I am coming into this project in a state of perfect ignorance. Carrie kindly sent me a few or two yesterday but I have yet to open the email and start reading them.
The only things I know for certain:
The document ‘preprocess.Rmd’ outlines the commands I ran. I used my pipeline’s Process_RNASeq function, which trims, runs fastqc, kraken, hisat, and htseq by default.
I received a complete sample sheet from either Najib or Carrie and modified it slightly to match the places where I put the raw data. I then copied it to ‘sample_sheets/all_samples.xlsx’ so that I need not worry about messing up the original.
My function ‘gather_preprocessing_metadata()’ has defaults which should provide some helpful new columns in this metadata sheet. Upon completion, it should write a new copy of the same file with a suffix ’_modified.xlsx’.
FIXME: modify the function to detect columns with dates and make sure to keep the encoding the same. FIXED! FIXME: Najib changed his template and added a new first row, add a check against that - or just delete the row manually. FIXED! FIXME: Set the default significant digits back to NULL, having all these darn .000’s is annoying. FIXED!
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Skipping for now
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.
I immediately learned that I somehow forgot to process the first sample!? It is processing now, I have no clue how that happened.
load_biomart_annotations, if not told anything else, will connect to ensembl and attempt to download the most commonly requested annotations for homo_sapiens from the archive server 2 years before the current date. This is because, as a general rule, I use genomes which are ~ 2-3 years old.
I am also downloading the ontology data, though most tools are aware of Mus.
In addition, I will load the gff annotations from the gff file used to count the genes, just in case there are some mismatches between the ensembl and gff gene IDs.
## The biomart annotations file already exists, loading from it.
gene_annotations <- annot[["gene_annotations"]]
go_db <- load_biomart_go(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
## Returning a df with 32 columns and 3899382 rows.
The experimental metadata now includes the count table filenames and I have a reasonable set of gene annotations. I should be able therefore the merge them all into an expressionset and/or summarizedExperiment.
mm_expt <- create_expt(modified[["new_file"]],
gene_info = gene_annotations,
annotation = "org.Mm.eg.db",
file_column = "hisatcounttable") %>%
set_expt_conditions(fact = "time") %>%
set_expt_batches(fact = "datecategorical")
## Reading the sample metadata.
## The sample definitions comprises: 17 rows(samples) and 42 columns(metadata fields).
## Matched 25760 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25760 features and 17 samples.
## The numbers of samples by condition are:
##
## d10 d13 d15
## 5 6 6
## The number of samples by batch are:
##
## b1 b2 b3 b4
## 11 2 2 2
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"function"'
## Library sizes of 17 samples,
## ranging from 27,324,171 to 69,203,380.
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 17 samples.
## These samples have an average 37.06 CPM coverage and 16818 genes observed, ranging from 16399 to
## 17275.
## Removing 12303 low-count genes (13457 remaining).
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample correlations ranging from:
## 0.745893022508784 to 0.996487658158264.
## A heatmap of pairwise sample distances ranging from:
## 19.647203152876 to 167.113181354411.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by d10, d13, d15
## Shapes are defined by b1, b2, b3, b4.
Holy ass crackers! I am not sure I have ever had a dataset which split this coherently. I need better names than ‘A’ ‘B’ ‘C’.
Ok, I will just do a no-batch DE because I am not sure of the actual batches and/or surrogates, and who cares the data split so well I am worried (not really) it is simulated.
Oh, before I forget, April has been asking about rRNA content. I think I quantified that?
No, it appears I didn’t submit rRNA queries. Lets do that now before I forget.
cd preprocessing
start=$(pwd)
for i in A* B* C*; do
cd $i
cyoa --method hisat --species mm38_100 --libtype rRNA --gff_type misc_feature --gff_tag ID \
--input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cd $start
done
I checked the logs for a few samples and it looks like the rRNA content is less than 1% in most(all?) of the samples.
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in cov2cor(as.matrix(V)): diag(.) had 0 or NA entries; non-finite result is doubtful
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in cov2cor(as.matrix(V)): diag(.) had 0 or NA entries; non-finite result is doubtful
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in cov2cor(as.matrix(V)): diag(.) had 0 or NA entries; non-finite result is doubtful
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in cov2cor(as.matrix(V)): diag(.) had 0 or NA entries; non-finite result is doubtful
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in cov2cor(as.matrix(V)): diag(.) had 0 or NA entries; non-finite result is doubtful
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Error in { : task 7884 failed - "missing value where TRUE/FALSE needed"
## Retrying with only condition in the model.
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be unreliable
##
## Total:139 s
## The result of using variancePartition with the model:
## ~ condition + batch
Since I have not read the kindly-sent reviews, I will cheat a little and use GSVA to get some ideas about potential papers. I default to C2 which is likely not the right gene set list.
I just downloaded the new msigdb, let us use that instead of the much less interesting GSVAdata set. Frustratingly, the new version of MSigDB provides invalid XML (there are apparently ‘<’ characters in the text fields of this file, which is explicitly forbidden in the XML standard), so I wrote a function to read the annotations from the SQLite database.
FIXME: I need to do some work to clean up the IDs with this new function.
mm_gsva <- simple_gsva(mm_expt, orgdb = "org.Mm.eg.db", signature_category = "c2",
msig_db = "reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db",
signatures = "reference/msigdb_v2023.2.Mm/m2.all.v2023.2.Mm.entrez.gmt")
## Converting the rownames() of the expressionset to ENTREZID.
## 4032 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 25760 entries.
## Subsetting on genes.
## remove_genes_expt(), before removal, there were 25760 genes, now there are 21669.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'converted_eset' not found
## Error in eval(expr, envir, enclos): object 'mm_gsva' not found
## Error in eval(expr, envir, enclos): object 'mm_gsva' not found
## Error in eval(expr, envir, enclos): object 'mm_gsva_sig' not found
## Error in eval(expr, envir, enclos): object 'mm_gsva_sig' not found
Come back to this, note to self the previous iteration was explicitly looking for Pseudomonas contamination.
genus_expt <- create_expt(gathered[["new_file"]],
file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
plot_libsize(genus_expt)
head(exprs(genus_expt))
exprs(genus_expt)["Pseudomonas", ]
Until I get more meaningful condition names, I will just do B/A C/A C/B
keepers <- list(
"d13_vs_d10" = c("d13", "d10"),
"d15_vs_d10" = c("d15", "d10"),
"d15_vs_d13" = c("d15", "d13"))
de <- all_pairwise(mm_expt, filter = TRUE, model_batch = FALSE)
##
## d10 d13 d15
## 5 6 6
## Deleting the file excel/de_tables-v20240205.xlsx before writing the tables.
sig <- extract_significant_genes(
tables, according_to = "deseq", excel = glue("excel/de_sig-v{ver}.xlsx"))
## Deleting the file excel/de_sig-v20240205.xlsx before writing the tables.
mm38 is nicely supported in gProfiler/clusterProfiler.
## Running gProfiler on every set of significant genes found:
## GO KEGG REAC WP TF MIRNA HPA CORUM HP
## d13_vs_d10_up 1329 4 14 2 483 0 0 0 0
## d13_vs_d10_down 1296 9 58 1 407 0 0 0 0
## d15_vs_d10_up 1484 5 9 0 524 0 0 0 0
## d15_vs_d10_down 1374 11 80 2 489 0 0 0 0
## d15_vs_d13_up 656 3 3 0 146 0 0 0 1
## d15_vs_d13_down 168 0 0 0 36 0 0 0 0
## A set of ontologies produced by gprofiler using 2437
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1329 GO hits, 4, KEGG hits, 14 reactome hits, 2 wikipathway hits, 483 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1497
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1296 GO hits, 9, KEGG hits, 58 reactome hits, 1 wikipathway hits, 407 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 2916
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1484 GO hits, 5, KEGG hits, 9 reactome hits, 0 wikipathway hits, 524 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 1840
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1374 GO hits, 11, KEGG hits, 80 reactome hits, 2 wikipathway hits, 489 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 753
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 656 GO hits, 3, KEGG hits, 3 reactome hits, 0 wikipathway hits, 146 transcription factor hits, 0 miRNA hits, 0 HPA hits, 1 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.
## A set of ontologies produced by gprofiler using 199
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 168 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 36 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 30 hits.