This will be a very minimal analysis until we get some replicates.
I am using mm38_95.
## My load_biomart_annotations() function defaults to human, so that will be quick.
mm_annot <- load_biomart_annotations(species="mmusculus")
## The biomart annotations file already exists, loading from it.
mm_annot <- mm_annot[["annotation"]]
mm_annot[["txid"]] <- paste0(mm_annot[["ensembl_transcript_id"]], ".", mm_annot[["version"]])
rownames(mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique=TRUE)
tx_gene_map <- mm_annot[, c("txid", "ensembl_gene_id")]
So, I now have 2 data frames of parasite annotations and 1 human.
I am going to write a quick sample sheet in the current working directory called ‘all_samples.xlsx’ and put the names of the count tables in it.
Here I combine the metadata, count data, and annotations.
It is worth noting that the gene IDs from htseq-count probably do not match the annotations retrieved because they are likely exon-based rather than gene based. This is not really a problem, but don’t forget it!
mm38_expt <- create_expt("sample_sheets/all_samples.xlsx", tx_gene_map=tx_gene_map,
gene_info=mm_annot, file_column="salmonfile")
## Reading the sample metadata.
## The sample definitions comprises: 8 rows(samples) and 8 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 6764 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 6764 rows and 8 columns.
mmtx_annot <- mm_annot
rownames(mmtx_annot) <- mm_annot[["txid"]]
mm38_tx <- create_expt("sample_sheets/all_samples.xlsx",
gene_info=mmtx_annot, file_column="salmonfile")
## Reading the sample metadata.
## The sample definitions comprises: 8 rows(samples) and 8 columns(metadata fields).
## Reading count tables.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 6897 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 56886 rows and 8 columns.
In this block I will calculate all the diagnostic plots, but not show them. I will show them next with a little annotation.
I will leave the output for the first of each invocation and silence it for the second.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 21398 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 21398 zero count features.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
The only interesting DE I see in this is to compare the retinas to the dlgns. I can treat them as replicates and compare.
mm <- set_expt_conditions(mm38_expt, fact="celltype")
mm_norm <- normalize_expt(mm, norm="quant",
convert="cpm", transform="log2", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
dlgn_ko <- "iprgc_06"
dlgn_wt <- "iprgc_01"
dlgn_ht <- "iprgc_04"
scn_ko <- "iprgc_08"
scn_wt <- "iprgc_03"
retina_ko <- "iprgc_07"
retina_wt <- "iprgc_02"
retina_ht <- "iprgc_05"
het_retina_vs_het_dlgn <- exprs(mm38_norm)[, retina_ht] - exprs(mm38_norm)[, dlgn_ht]
ko_retina_vs_ko_dlgn <- exprs(mm38_norm)[, retina_ko] - exprs(mm38_norm)[, dlgn_ko]
ko_retina_vs_ko_scn <- exprs(mm38_norm)[, retina_ko] - exprs(mm38_norm)[, scn_ko]
het_retina_vs_ko_retina <- exprs(mm38_norm)[, retina_ht] - exprs(mm38_norm)[, retina_ko]
het_dlgn_vs_ko_scn <- exprs(mm38_norm)[, dlgn_ht] - exprs(mm38_norm)[, scn_ko]
ko_dlgn_vs_ko_scn <- exprs(mm38_norm)[, dlgn_ko] - exprs(mm38_norm)[, scn_ko]
dlgn_kowt <- exprs(mm38_norm)[, dlgn_ko] - exprs(mm38_norm)[, dlgn_wt]
dlgn_htwt <- exprs(mm38_norm)[, dlgn_ht] - exprs(mm38_norm)[, dlgn_wt]
dlgn_koht <- exprs(mm38_norm)[, dlgn_ko] - exprs(mm38_norm)[, dlgn_ht]
scn_kowt <- exprs(mm38_norm)[, scn_ko] - exprs(mm38_norm)[, scn_wt]
retina_kowt <- exprs(mm38_norm)[, retina_ko] - exprs(mm38_norm)[, retina_wt]
retina_htwt <- exprs(mm38_norm)[, retina_ht] - exprs(mm38_norm)[, retina_wt]
retina_koht <- exprs(mm38_norm)[, retina_ko] - exprs(mm38_norm)[, retina_ht]
pair_mtrx <- cbind(het_retina_vs_het_dlgn,
ko_retina_vs_ko_dlgn,
ko_retina_vs_ko_scn,
het_retina_vs_ko_retina,
het_dlgn_vs_ko_scn,
ko_dlgn_vs_ko_scn,
dlgn_kowt, dlgn_htwt, dlgn_koht,
scn_kowt,
retina_kowt, retina_htwt, retina_koht)
mm_tables <- combine_de_tables(mm_de, extra_annot=pair_mtrx,
excel=glue::glue("excel/{rundate}mm_tables.xlsx"))
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on table 1/3: retina_vs_dlgn
## Working on table 2/3: scn_vs_dlgn
## Working on table 3/3: scn_vs_retina
## Adding venn plots for retina_vs_dlgn.
## Limma expression coefficients for retina_vs_dlgn; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for retina_vs_dlgn; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for retina_vs_dlgn; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Adding venn plots for scn_vs_dlgn.
## Limma expression coefficients for scn_vs_dlgn; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for scn_vs_dlgn; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for scn_vs_dlgn; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Adding venn plots for scn_vs_retina.
## Limma expression coefficients for scn_vs_retina; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for scn_vs_retina; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for scn_vs_retina; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/20200108mm_tables.xlsx.