This will be a very minimal analysis until we get some replicates.
I am using mm38_95.
## My load_biomart_annotations() function defaults to human, so that will be quick.
mm_annot <- load_biomart_annotations(species="mmusculus")
## The biomart annotations file already exists, loading from it.
mm_annot <- mm_annot[["annotation"]]
mm_annot[["txid"]] <- paste0(mm_annot[["ensembl_transcript_id"]], ".", mm_annot[["version"]])
rownames(mm_annot) <- make.names(mm_annot[["ensembl_gene_id"]], unique=TRUE)
tx_gene_map <- mm_annot[, c("txid", "ensembl_gene_id")]
So, I now have 2 data frames of parasite annotations and 1 human.
I am going to write a quick sample sheet in the current working directory called ‘all_samples.xlsx’ and put the names of the count tables in it.
Here I combine the metadata, count data, and annotations.
It is worth noting that the gene IDs from htseq-count probably do not match the annotations retrieved because they are likely exon-based rather than gene based. This is not really a problem, but don’t forget it!
mm38_salmon <- create_expt("sample_sheets/all_samples.xlsx", tx_gene_map=tx_gene_map,
gene_info=mm_annot, file_column="salmonfile")
## Reading the sample metadata.
## The sample definitions comprises: 8 rows(samples) and 8 columns(metadata fields).
## Reading count tables.
## Using the transcript to gene mapping.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 6764 annotations and counts.
## Bringing together the count matrix and gene information.
## The mapped IDs are not the rownames of your gene information, changing them now.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 6764 rows and 8 columns.
mmtx_annot <- mm_annot
rownames(mmtx_annot) <- mm_annot[["txid"]]
mm38_saltx <- create_expt("sample_sheets/all_samples.xlsx",
gene_info=mmtx_annot, file_column="salmonfile")
## Reading the sample metadata.
## The sample definitions comprises: 8 rows(samples) and 8 columns(metadata fields).
## Reading count tables.
## Reading salmon data with tximport.
## Finished reading count tables.
## Matched 6897 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 56886 rows and 8 columns.
In this block I will calculate all the diagnostic plots, but not show them. I will show them next with a little annotation.
I will leave the output for the first of each invocation and silence it for the second.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, adding 1 to the data.
## Changed 21398 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 21398 zero count features.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
mm38_norm <- normalize_expt(mm38_salmon, norm="quant", convert="cpm",
transform="log2", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
The only interesting DE I see in this is to compare the retinas to the dlgns. I can treat them as replicates and compare.
mm <- set_expt_conditions(mm38_salmon, fact="celltype")
mm_norm <- normalize_expt(mm, norm="quant",
convert="cpm", transform="log2", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 2794 low-count genes (3970 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 1105 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
Since we don’t have replicates, I am going to do simple subtractions. The most important caveat is that I must subtract each baseline from its ht/ko samples before considering the contrasts of interest.
dlgnbaseline <- "iprgc_01"
scnbaseline <- "iprgc_03"
retbaseline <- "iprgc_02"
dlgnko <- "iprgc_06"
dlgnht <- "iprgc_04"
scnht <- NULL ## Does not yet exist.
scnko <- "iprgc_08"
retko <- "iprgc_07"
retht <- "iprgc_05"
hetret <- exprs(mm38_norm)[, retht] - exprs(mm38_norm)[, retbaseline]
koret <- exprs(mm38_norm)[, retko] - exprs(mm38_norm)[, retbaseline]
koscn <- exprs(mm38_norm)[, scnko] - exprs(mm38_norm)[, scnbaseline]
hetdlgn <- exprs(mm38_norm)[, dlgnht] - exprs(mm38_norm)[, dlgnbaseline]
kodlgn <- exprs(mm38_norm)[, dlgnko] - exprs(mm38_norm)[, dlgnbaseline]
Now that we have the 5 available ‘conditions’, perform the subtractions to query the data.
## het vs het
hetret_vs_hetdlgn <- hetret - hetdlgn
## ko vs ko
koret_vs_kodlgn <- koret - koscn
koret_vs_koscn <- koret - koscn
kodlgn_vs_koscn <- kodlgn - koscn
## ko vs het
koret_vs_hetret <- koret - hetret
kodlgn_vs_hetdlgn <- kodlgn - hetdlgn
koscn_vs_hetdlgn <- koscn - hetdlgn
## ratio of ratios
rethet_vs_dlgnhet_over_retko_vs_dlgnko <- (hetret - hetdlgn) - (koret - kodlgn)
pair_mtrx <- cbind(
## Baseline subtractions
hetret, koret, koscn, hetdlgn, kodlgn,
## het_vs_het, of which there is only 1 because we do not have hetscn
hetret_vs_hetdlgn,
## ko_vs_ko, of which we have 3
koret_vs_kodlgn, koret_vs_koscn, kodlgn_vs_koscn,
## ko_vs_het, 3 including one getting around missing hetscn
koret_vs_hetret, kodlgn_vs_hetdlgn, koscn_vs_hetdlgn,
## ratio of ratios
rethet_vs_dlgnhet_over_retko_vs_dlgnko)
mm_tables <- combine_de_tables(mm_de, extra_annot=pair_mtrx,
excel=glue::glue("excel/{rundate}mm_tables.xlsx"))
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on table 1/3: retina_vs_dlgn
## Working on table 2/3: scn_vs_dlgn
## Working on table 3/3: scn_vs_retina
## Adding venn plots for retina_vs_dlgn.
## Limma expression coefficients for retina_vs_dlgn; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for retina_vs_dlgn; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for retina_vs_dlgn; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Adding venn plots for scn_vs_dlgn.
## Limma expression coefficients for scn_vs_dlgn; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for scn_vs_dlgn; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for scn_vs_dlgn; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Adding venn plots for scn_vs_retina.
## Limma expression coefficients for scn_vs_retina; R^2: 0.934; equation: y = 0.967x + 0.0347
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Deseq expression coefficients for scn_vs_retina; R^2: 0.822; equation: y = 0.955x + 0.0956
## Edger expression coefficients for scn_vs_retina; R^2: 0.917; equation: y = 0.977x + 0.0265
## Warning: Removed 1 rows containing missing values (geom_hline).
## Warning: Removed 1 rows containing missing values (geom_hline).
## Writing summary information, compare_plot is: TRUE.
## Performing save of excel/20200109mm_tables.xlsx.
As I understand it, there is some interest in an ontology search using the ratio of ratios.
ror <- rethet_vs_dlgnhet_over_retko_vs_dlgnko
up_idx <- ror >= 1
down_idx <- ror <= -1
ror_up <- ror[up_idx]
length(ror_up)
ror_down <- ror[down_idx]
length(ror_down)
ror_gprofiler <- simple_gprofiler(sig_genes=ror_up, species="mmusculus")
ror_gprofiler$pvalue_plots$mfp_plot_over
ror_gprofiler$pvalue_plots$bpp_plot_over
ror_gprofiler$pvalue_plots$ccp_plot_over
ror_gprofiler$pvalue_plots$tf_plot_over
ror_gprofiler$pvalue_plots$hp_plot_over`
``
## Error: <text>:14:40: unexpected symbol
## 14: ror_gprofiler$pvalue_plots$hp_plot_over`
## 15: `
## ^