1 Introduction

I am coming into this project in a state of perfect ignorance. Carrie kindly sent me a few or two yesterday but I have yet to open the email and start reading them.

The only things I know for certain:

  1. There are ~ 24 samples with names prefixed with ‘A’ ‘B’ and ‘C’, corresponding to three time point collections at 10.5, 13.5, and 15.5 days. I am going to call these d10,d13,d15.
  2. The most likely reference is the ensembl house mouse mm39; though the real reference is actually the charles river CD-1. I was reasonably certain yesterday that it is possible to download this mouse line’s reference, but I think that is untrue – or at least my attempts to find it failed. At this time I am not certain which samples are which mouse line.
  3. The sequence libraries are likely in the reverse orientation. At least that was my assumption.

The document ‘preprocess.Rmd’ outlines the commands I ran. I used my pipeline’s Process_RNASeq function, which trims, runs fastqc, kraken, hisat, and htseq by default.

2 Metadata

I received a complete sample sheet from either Najib or Carrie and modified it slightly to match the places where I put the raw data. I then copied it to ‘sample_sheets/all_samples.xlsx’ so that I need not worry about messing up the original.

My function ‘gather_preprocessing_metadata()’ has defaults which should provide some helpful new columns in this metadata sheet. Upon completion, it should write a new copy of the same file with a suffix ’_modified.xlsx’.

FIXME: modify the function to detect columns with dates and make sure to keep the encoding the same. FIXED! FIXME: Najib changed his template and added a new first row, add a check against that - or just delete the row manually. FIXED! FIXME: Set the default significant digits back to NULL, having all these darn .000’s is annoying. FIXED!

modified <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx")
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Skipping for now
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.

I immediately learned that I somehow forgot to process the first sample!? It is processing now, I have no clue how that happened.

3 Annotations

load_biomart_annotations, if not told anything else, will connect to ensembl and attempt to download the most commonly requested annotations for homo_sapiens from the archive server 2 years before the current date. This is because, as a general rule, I use genomes which are ~ 2-3 years old.

I am also downloading the ontology data, though most tools are aware of Mus.

In addition, I will load the gff annotations from the gff file used to count the genes, just in case there are some mismatches between the ensembl and gff gene IDs.

annot <- load_biomart_annotations(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
tx_annotations <- annot[["annotation"]]
gene_annotations <- tx_annotations
kept <- !duplicated(gene_annotations[["ensembl_gene_id"]])
gene_annotations <- gene_annotations[kept, ]
rownames(gene_annotations) <- gene_annotations[["ensembl_gene_id"]]
gene_annotations[["ensembl_gene_id"]] <- NULL

go_db <- load_biomart_go(species = "mmusculus", year = 2022, month = 7)
## The biomart annotations file already exists, loading from it.
gff_annot <- load_gff_annotations("~/libraries/genome/mm38_100.gff", id_col = "gene_id")
## Returning a df with 32 columns and 3899382 rows.

4 Initial expressionset

The experimental metadata now includes the count table filenames and I have a reasonable set of gene annotations. I should be able therefore the merge them all into an expressionset and/or summarizedExperiment.

mm_expt <- create_expt(modified[["new_file"]],
                       gene_info = gene_annotations,
                       annotation = "org.Mm.eg.db",
                       file_column = "hisatcounttable") %>%
  set_expt_conditions(fact = "time") %>%
  set_expt_batches(fact = "datecategorical")
## Reading the sample metadata.
## The sample definitions comprises: 17 rows(samples) and 46 columns(metadata fields).
## Matched 25654 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25760 features and 17 samples.
## The numbers of samples by condition are:
## 
## d10 d13 d15 
##   5   6   6
## The number of samples by batch are:
## 
## b1 b2 b3 b4 
## 11  2  2  2
written <- write_expt(mm_expt, excel = glue("excel/all_samples-v{ver}.xlsx"))
## Writing the first sheet, containing a legend and some summary data.
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 152012 entries are 0.  We are on a log scale, adding 1 to the data.
## 
## Changed 152012 zero count features.
## 
## Naively calculating coefficient of variation/dispersion with respect to condition.
## 
## Finished calculating dispersion estimates.
## 
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Subsetting on genes.
## 
## remove_genes_expt(), before removal, there were 13457 genes, now there are 13457.
## 
## 
## Total:98 s

5 Poke at it

plot_libsize(mm_expt)
## Library sizes of 17 samples, 
## ranging from 27,324,171 to 69,203,380.

plot_nonzero(mm_expt)
## The following samples have less than 16744 genes.
## [1] "A1" "A2" "A5" "B1" "B3" "B4"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 17 samples.
## These samples have an average 37.06 CPM coverage and 16818 genes observed, ranging from 16399 to
## 17275.

norm <- normalize_expt(mm_expt, transform = "log2", convert = "cpm",
                       norm = "quant", filter = TRUE)
## Removing 12303 low-count genes (13457 remaining).
## transform_counts: Found 57 values equal to 0, adding 1 to the matrix.
plot_corheat(norm)
## A heatmap of pairwise sample correlations ranging from: 
## 0.745893022508784 to 0.996487658158264.

plot_disheat(norm)
## A heatmap of pairwise sample distances ranging from: 
## 19.647203152876 to 167.113181354411.

plot_pca(norm)
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by d10, d13, d15
## Shapes are defined by b1, b2, b3, b4.

Holy ass crackers! I am not sure I have ever had a dataset which split this coherently. I need better names than ‘A’ ‘B’ ‘C’.

Ok, I will just do a no-batch DE because I am not sure of the actual batches and/or surrogates, and who cares the data split so well I am worried (not really) it is simulated.

Oh, before I forget, April has been asking about rRNA content. I think I quantified that?

5.1 Check the rRNA content

No, it appears I didn’t submit rRNA queries. Lets do that now before I forget.

cd preprocessing
start=$(pwd)
for i in A* B* C*; do
    cd $i
    cyoa --method hisat --species mm38_100 --libtype rRNA --gff_type misc_feature --gff_tag ID \
         --input $(/bin/ls *-trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
    cd $start
done

I checked the logs for a few samples and it looks like the rRNA content is less than 1% in most(all?) of the samples.

6 varpart

varpart <- simple_varpart(mm_expt)
## Subsetting on genes.
## remove_genes_expt(), before removal, there were 18857 genes, now there are 17392.
## 
## Total:130 s
varpart
## The result of using variancePartition with the model:
## ~ condition + batch

7 GSVA

Since I have not read the kindly-sent reviews, I will cheat a little and use GSVA to get some ideas about potential papers. I default to C2 which is likely not the right gene set list.

I just downloaded the new msigdb, let us use that instead of the much less interesting GSVAdata set. Frustratingly, the new version of MSigDB provides invalid XML (there are apparently ‘<’ characters in the text fields of this file, which is explicitly forbidden in the XML standard), so I wrote a function to read the annotations from the SQLite database.

FIXME: I need to do some work to clean up the IDs with this new function.

mm_gsva <- simple_gsva(mm_expt, orgdb = "org.Mm.eg.db", signature_category = "c2",
                       msig_db =  "reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db",
                       signatures = "reference/msigdb_v2023.2.Mm/m2.all.v2023.2.Mm.entrez.gmt")
## Converting the rownames() of the expressionset to ENTREZID.
## 4032 ENSEMBL ID's didn't have a matching ENTEREZ ID. Dropping them now.
## Before conversion, the expressionset has 25760 entries.
## Subsetting on genes.
## remove_genes_expt(), before removal, there were 25760 genes, now there are 21669.
## After conversion, the expressionset has 21669 entries.
## Adding descriptions and IDs to the gene set annotations.
## Adding annotations from reference/msigdb_v2023.2.Mm/msigdb_v2023.2.Mm.db.
## Not subsetting the msigdb metadata, the wanted_meta argument was NULL.
mm_gsva
## GSVA result using method: ssgsea against the c2 dataset.
## Scores range from: -0.4357 to: 0.5643.
mm_gsva_sig <- get_sig_gsva_categories(mm_gsva, excel = "excel/gsva_sig_categories.xlsx")
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$libsize.
## Limma step 1/6: choosing model.
## Assuming this data is similar to a micro array and not performign voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust = FALSE and trend = FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: d13_vs_d10.  Adjust = BH
## Limma step 6/6: 2/3: Creating table: d15_vs_d10.  Adjust = BH
## Limma step 6/6: 3/3: Creating table: d15_vs_d13.  Adjust = BH
## Limma step 6/6: 1/3: Creating table: d10.  Adjust = BH
## Limma step 6/6: 2/3: Creating table: d13.  Adjust = BH
## Limma step 6/6: 3/3: Creating table: d15.  Adjust = BH
## The factor d10 has 5 rows.
## The factor d13 has 6 rows.
## The factor d15 has 6 rows.
## Testing each factor against the others.
## Scoring d10 against everything else.
## Scoring d13 against everything else.
## Scoring d15 against everything else.
## Deleting the file excel/gsva_sig_categories.xlsx before writing the tables.
mm_gsva_sig
## The set of GSVA categories deemed significantly higher than the
## distribution of all scores.  It comprises 54 gene sets.

mm_gsva_sig$score_pca

8 Kraken matrix

Come back to this, note to self the previous iteration was explicitly looking for Pseudomonas contamination.

genus_expt <- create_expt(gathered[["new_file"]],
                          file_column = "krakenmatrix", file_type = "table")
genus_norm <- normalize_expt(genus_expt, convert = "cpm")
plot_disheat(genus_norm)
genus_normv2 <- normalize_expt(genus_expt, convert = "cpm", transform = "log2")
plot_pca(genus_normv2)
plot_libsize(genus_expt)
head(exprs(genus_expt))
exprs(genus_expt)["Pseudomonas", ]

9 Differential Expression

Until I get more meaningful condition names, I will just do B/A C/A C/B

keepers <- list(
  "d13_vs_d10" = c("d13", "d10"),
  "d15_vs_d10" = c("d15", "d10"),
  "d15_vs_d13" = c("d15", "d13"))
de <- all_pairwise(mm_expt, filter = TRUE, model_batch = FALSE)
## 
## d10 d13 d15 
##   5   6   6

tables <- combine_de_tables(
  de, keepers = keepers, excel = glue("excel/de_tables-v{ver}.xlsx"))
sig <- extract_significant_genes(
  tables, according_to = "deseq", excel = glue("excel/de_sig-v{ver}.xlsx"))

10 Ontology enrichment

mm38 is nicely supported in gProfiler/clusterProfiler.

all_gp <- all_gprofiler(sig, species = "mmusculus", plot_type = "dotplot")
all_gp
## Running gProfiler on every set of significant genes found:
##                   GO KEGG REAC WP  TF MIRNA HPA CORUM HP
## d13_vs_d10_up   1291    4   14  1 478     0   0     0  0
## d13_vs_d10_down 1294    8   58  1 401     0   0     0  0
## d15_vs_d10_up   1467    6   10  0 521     0   0     0  0
## d15_vs_d10_down 1346   10   83  2 483     0   0     0  0
## d15_vs_d13_up    625    4    3  0 142     0   0     0  1
## d15_vs_d13_down  156    0    0  0  34     0   0     0  0
all_gp[["d13_vs_d10_up"]]
## A set of ontologies produced by gprofiler using 2437
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1291 GO hits, 4, KEGG hits, 14 reactome hits, 1 wikipathway hits, 478 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.

all_gp[["d13_vs_d10_down"]]
## A set of ontologies produced by gprofiler using 1497
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1294 GO hits, 8, KEGG hits, 58 reactome hits, 1 wikipathway hits, 401 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.

all_gp[["d15_vs_d10_up"]]
## A set of ontologies produced by gprofiler using 2916
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1467 GO hits, 6, KEGG hits, 10 reactome hits, 0 wikipathway hits, 521 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.

all_gp[["d15_vs_d10_down"]]
## A set of ontologies produced by gprofiler using 1840
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 1346 GO hits, 10, KEGG hits, 83 reactome hits, 2 wikipathway hits, 483 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.

all_gp[["d15_vs_d13_up"]]
## A set of ontologies produced by gprofiler using 753
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 625 GO hits, 4, KEGG hits, 3 reactome hits, 0 wikipathway hits, 142 transcription factor hits, 0 miRNA hits, 0 HPA hits, 1 HP hits, and 0 CORUM hits.
## Category MF is the most populated with 30 hits.

all_gp[["d15_vs_d13_down"]]
## A set of ontologies produced by gprofiler using 199
## genes against the mmusculus annotations and significance cutoff 0.05.
## There are 156 GO hits, 0, KEGG hits, 0 reactome hits, 0 wikipathway hits, 34 transcription factor hits, 0 miRNA hits, 0 HPA hits, 0 HP hits, and 0 CORUM hits.
## Category BP is the most populated with 30 hits.

## all_cp <- all_clusterprofiler(sig, species = "mmusculus")

11 Splicing isoform differences

I spent a little time messing with suppa (https://doi.org/10.1186/s13059-018-1417-1) and have what I think are the various comparisons across time for the different splicing variants etc.

I do not currently have an automagic way to load the results from suppa, so let us take a moment and see if the results make any sense.

d13_d10_all_isoforms <- "outputs/90suppa_mm38_100/diff/d13_d10_ioi.dpsi"
d13_d10_a3 <- "outputs/90suppa_mm38_100/diff/d13_d10_A3.dpsi"
d13_d10_a5 <- "outputs/90suppa_mm38_100/diff/d13_d10_A5.dpsi"
d13_d10_af <- "outputs/90suppa_mm38_100/diff/d13_d10_AF.dpsi"
d13_d10_al <- "outputs/90suppa_mm38_100/diff/d13_d10_AL.dpsi"
d13_d10_mx <- "outputs/90suppa_mm38_100/diff/d13_d10_MX.dpsi"
d13_d10_ri <- "outputs/90suppa_mm38_100/diff/d13_d10_RI.dpsi"
d13_d10_se <- "outputs/90suppa_mm38_100/diff/d13_d10_SE.dpsi"
