Take a look at these samples. I will spend a few minutes reading their previous document to fill this intro out.
I used the ensembl mm39 release 112. That should work find with the most recent biomart gene annotations (202411).
For the moment, the useast server seems to be down, I think I will just use the archive until that is fixed.
##mm_annot <- load_biomart_annotations(species = "mmusculus", archive = FALSE)
mm_annot <- load_biomart_annotations(species = "mmusculus")
## The biomart annotations file already exists, loading from it.
I will probably disable the following block once I have the starting metadata in a completed state so that if someone wants to see all the numbers of mapped reads etc without having my working tree they will still be available.
meta <- gather_preprocessing_metadata("sample_sheets/all_samples.xlsx", species = "mm39_112",
new_metadata = sample_sheet)
This results in a new column ‘hisat_count_table’, so let us use that, and the gene annotations, to create an expressionset.
mm_expt <- create_expt(sample_sheet,
gene_info = mm_genes,
file_column = "hisat_count_table") %>%
set_expt_conditions(fact = "flox")
## Reading the sample metadata.
## The sample definitions comprises: 24 rows(samples) and 27 columns(metadata fields).
## Matched 25409 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 25425 features and 24 samples.
## The numbers of samples by condition are:
##
## control_1a control_2b control_s9 ko_1a_c25b ko_2b_54C1 ko_2b_54C2 ko_s9_s8 so_s9_r14
## 3 3 3 3 3 3 3 3
## A modified expressionSet containing 25425 genes and 24 sample. There are 28 metadata columns and 13 annotation columns.
## The primary condition is comprised of:
## control_1a, control_2b, control_s9, ko_1a_c25b, ko_2b_54C1, ko_2b_54C2, ko_s9_s8, so_s9_r14.
## Its current state is: raw(data).
combined <- paste0(pData(mm_expt)[["state"]], "_",
pData(mm_expt)[["line"]])
pData(mm_expt)[["combined"]] <- combined
plot_meta_sankey(mm_expt, factors = c("flox", "line", "batch"))
## A sankey plot describing the metadata of 24 samples,
## including 40 out of 0 nodes and traversing metadata factors:
## .
## Library sizes of 24 samples,
## ranging from 14,686,859 to 17,960,838.
## The following samples have less than 16526.25 genes.
## [1] "1_IATF-1A_" "2_IATF-1A_" "3_IATF-1A_" "4_IATF-2B_" "5_IATF-2B_" "6_IATF-2B_" "7_IATF-S9_" "8_IATF-S9_"
## [9] "9_IATF-S9_" "10_IATF-1A" "11_IATF-1A" "12_IATF-1A" "13_IATF-2B" "14_IATF-2B" "15_IATF-2B" "16_IATF-2B"
## [17] "19_IATF-S9" "20_IATF-S9" "21_IATF-S9" "22_IATF-S9" "23_IATF-S9" "24_IATF-S9"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 24 samples.
## These samples have an average 16.65 CPM coverage and 15156 genes observed, ranging from 14576 to
## 16635.
## A heatmap of pairwise sample correlations ranging from:
## 0.573161898662224 to 0.999309770057417.
mm_norm <- normalize_expt(mm_expt, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)
## Removing 13165 low-count genes (12260 remaining).
## transform_counts: Found 1918 values equal to 0, adding 1 to the matrix.
## A heatmap of pairwise sample distances ranging from:
## 14.9259724949041 to 139.831122474879.
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by control_1a, control_2b, control_s9, ko_1a_c25b, ko_2b_54C1, ko_2b_54C2, ko_s9_s8, so_s9_r14
## Shapes are defined by b1, b2, b3.
Wow, these samples are clustering absurdly well.
## The numbers of samples by condition are:
##
## control_l1a control_l2b control_ls9 ko_l1a ko_l2b ko_ls9
## 3 3 3 3 6 6
keepers <- list(
"l1a" = c("ko_l1a", "control_l1a"),
"l2b" = c("ko_l2b", "control_l2b"),
"ls9" = c("ko_ls9", "control_ls9"))
mm_de <- all_pairwise(mm_combined, keepers = keepers, keep_underscore = TRUE)
##
## control_l1a control_l2b control_ls9 ko_l1a ko_l2b ko_ls9
## 3 3 3 3 6 6
##
## b1 b2 b3
## 8 8 8
## A pairwise differential expression with results from: basic, deseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: batch in model/limma.
## The primary analysis performed 10 comparisons.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup limma_sigdown
## 1 ko_l1a_vs_control_l1a 1203 1651 1333 1714 1711 1457
## 2 ko_l2b_vs_control_l2b 1612 1557 1759 1684 1723 1655
## 3 ko_ls9_vs_control_ls9 1089 539 1081 581 752 735
## Plot describing unique/shared genes in a differential expression table.
## A set of genes deemed significant according to limma, edger, deseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down basic_up basic_down
## ko_l1a_vs_control_l1a 1711 1457 1333 1714 1203 1651 1198 1079
## ko_l2b_vs_control_l2b 1723 1655 1759 1684 1612 1557 1261 1194
## ko_ls9_vs_control_ls9 752 735 1081 581 1089 539 536 588
## Running gProfiler on every set of significant genes found:
## BP CORUM HP KEGG MIRNA MF REAC TF WP
## ko_l1a_vs_control_l1a_up 330 0 0 0 0 63 0 150 0
## ko_l1a_vs_control_l1a_down 882 0 0 0 0 87 5 293 0
## ko_l2b_vs_control_l2b_up 181 0 0 0 0 69 0 106 0
## ko_l2b_vs_control_l2b_down 896 0 1 2 0 89 15 302 1
## ko_ls9_vs_control_ls9_up 517 0 0 4 0 61 5 254 0
## ko_ls9_vs_control_ls9_down 429 0 0 0 2 31 3 21 0