1 Introduction

I am coming into this project in a state of perfect ignorance. Carrie kindly sent me a few or two yesterday but I have yet to open the email and start reading them.

The only things I know for certain:

  1. There are ~ 24 samples with names prefixed with ‘A’ ‘B’ and ‘C’.
  2. The most likely reference is the ensembl house mouse mm39; though the real reference is actually the charles river CD-1. I was reasonably certain yesterday that it is possible to download this mouse line’s reference, but I think that is untrue – or at least my attempts to find it failed.
  3. The sequence libraries are likely in the reverse orientation. At least that was my assumption.

The document ‘preprocess.Rmd’ outlines the commands I ran. I used my pipeline’s Process_RNASeq function, which trims, runs fastqc, kraken, hisat, and htseq by default.

2 Metadata

I received a complete sample sheet from either Najib or Carrie and modified it slightly to match the places where I put the raw data. I then copied it to ‘sample_sheets/all_samples.xlsx’ so that I need not worry about messing up the original.

My function ‘gather_preprocessing_metadata()’ has defaults which should provide some helpful new columns in this metadata sheet. Upon completion, it should write a new copy of the same file with a suffix ’_modified.xlsx’.

FIXME: modify the function to detect columns with dates and make sure to keep the encoding the same. FIXME: Najib changed his template and added a new first row, add a check against that - or just delete the row manually. FIXME: Set the default significant digits back to NULL, having all these darn .000’s is annoying.

modified <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx")
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Skipping for now
I immediately learned that I somehow forgot to process the first sample!? It is processing now, I have no clue how that happened.

3 Annotations

load_biomart_annotations, if not told anything else, will connect to ensembl and attempt to download the most commonly requested annotations for homo_sapiens from the archive server 2 years before the current date. This is because, as a general rule, I use genomes which are ~ 2-3 years old.

I am also downloading the ontology data, though most tools are aware of Mus.

In addition, I will load the gff annotations from the gff file used to count the genes, just in case there are some mismatches between the ensembl and gff gene IDs.

annot <- load_biomart_annotations(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
gene_annotations <- annot[["gene_annotations"]]

go_db <- load_biomart_go(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
gff_annot <- load_gff_annotations("~/libraries/genome/mm38_100.gff", id_col = "gene")
## Warning in load_gff_annotations("~/libraries/genome/mm38_100.gff", id_col = "gene"): Attempting to create a dataframe with gene and locus_tag both
## failed.

4 Initial expressionset

The experimental metadata now includes the count table filenames and I have a reasonable set of gene annotations. I should be able therefore the merge them all into an expressionset and/or summarizedExperiment.

mm_expt <- create_expt(modified[["new_file"]],
                       gene_info = gene_annotations,
                       file_column = "hisatcounttable") %>%
  set_expt_conditions(fact = "abc") %>%
  set_expt_batches(fact = "number")
## Reading the sample metadata.
## The sample definitions comprises: 17 rows(samples) and 41 columns(metadata fields).
## Matched 25760 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25760 features and 17 samples.
## The numbers of samples by condition are:
## A B C 
## 5 6 6
## The number of samples by batch are:
## b1 b2 b3 b4 b5 b6 
##  3  3  3  3  3  2
written <- write_expt(expt, excel = glue("excel/all_samples-v{ver}.xlsx"))
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"function"'

5 Poke at it

## Library sizes of 17 samples, 
## ranging from 27,324,171 to 69,203,380.

## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## A non-zero genes plot of 17 samples.
## These samples have an average 37.06 CPM coverage and 16818 genes observed, ranging from 16399 to
## 17275.

norm <- normalize_expt(mm_expt, transform = "log2", convert = "cpm",
                       norm = "quant", filter = TRUE)
## Removing 12303 low-count genes (13457 remaining).
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample correlations ranging from: 
## 0.745893022508784 to 0.996487658158264.

## A heatmap of pairwise sample distances ranging from: 
## 19.647203152876 to 167.113181354411.

## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by A, B, C
## Shapes are defined by b1, b2, b3, b4, b5, b6.

Holy ass crackers! I am not sure I have ever had a dataset which split this coherently. I need better names than ‘A’ ‘B’ ‘C’.

Ok, I will just do a no-batch DE because I am not sure of the actual batches and/or surrogates, and who cares the data split so well I am worried (not really) it is simulated.

Oh, before I forget, April has been asking about rRNA content. I think I quantified that?

5.1 Check the rRNA content

No, it appears I didn’t submit rRNA queries. Lets do that now before I forget.

cd preprocessing
for i in A* B* C*; do
    cd $i
    cyoa --method hisat --species mm38_100 --libtype rRNA --gff_type misc_feature --gff_tag ID \
         --input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
    cd $start

6 varpart

varpart <- simple_varpart(mm_expt)
## Total:162 s
## The result of using variancePartition with the model:
## ~ condition + batch


Since I have not read the kindly-sent reviews, I will cheat a little and use GSVA to get some ideas about potential papers. I default to C2 which is likely not the right gene set list.

I just downloaded the new msigdb, let us use that instead of the much less interesting GSVAdata set. Frustratingly, the new version of MSigDB provides invalid XML (there are apparently ‘<’ characters in the text fields of this file, which is explicitly forbidden in the XML standard), so I wrote a function to read the annotations from the SQLite database.

FIXME: I need to do some work to clean up the IDs with this new function.

mm_gsva <- simple_gsva(mm_expt, orgdb = "org.Mm.eg.db")
## Converting the rownames() of the expressionset to ENTREZID.
## 4032 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 25760 entries.
## After conversion, the expressionset has 22019 entries.
#msig_meta <- get_msigdb_metadata(mm_gsva,
#                                 msig_db = "reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db")

mm_gsva_sig <- get_sig_gsva_categories(mm_gsva)
## Starting limma pairwise comparison.
## Deleting the file excel/gsva_subset.xlsx before writing the tables.

8 Kraken matrix

Come back to this, note to self the previous iteration was explicitly looking for Pseudomonas contamination.

genus_expt <- create_expt(gathered[["new_file"]],
                          file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
exprs(genus_expt)["Pseudomonas", ]

9 Differential Expression

Until I get more meaningful condition names, I will just do B/A C/A C/B

keepers <- list(
  "ba" = c("B", "A"),
  "ca" = c("C", "A"),
  "cb" = c("C", "B"))
de <- all_pairwise(mm_expt, filter = TRUE, model_batch = FALSE)
## A B C 
## 5 6 6
sig <- extract_significant_genes(
  tables, according_to = "deseq", excel = glue("excel/de_sig-v{ver}.xlsx"))

10 Ontology enrichment

mm38 is nicely supported in gProfiler/clusterProfiler.

all_gp <- all_gprofiler(sig, species = "mmusculus")
## all_cp <- all_clusterprofiler(sig, species = "mmusculus")