I still need to do some reading to understand this series of experiments; but I believe the idea was to use a short-hairpin to lower expression of hdac5 in some cases, and an adenovirus overexpression vector in others.
The overarching goals I do not yet understand.
I passed these samples to my default preprocessing pipeline on the umiacs cluster with the following command:
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d *); do
cd $i
cyoa --task pipe --method prnaseq --species rnorvegicus_7.2_107 --gff_type gene \
--stranded reverse --gff_tag ID \
--input $(/bin/ls *.fastq.gz | tr '\n' ':' | sed 's/:$//g')
cd $start
done
The above command iterates over every directory of raw data I created and runs my rnaseq (–method prnaseq) pipeline (–task pipe) using a new copy of the ensembl Rattus norvegicus genome (version 7.2) from ensembl’s release 107 (–species rnorvegicus_7.2_107). When I downloaded it, I saved a gene feature file and the fasta genome with the same prefix and therefore htseq will use the gene annotations (–gff_type gene) by identifier ‘ID’ (–gff_tag ID).
I did not provide any instructions for trimming nor mapping, so it will use my defaults. All of the scripts run on the cluster are therefore located in the preprocessing/*/scripts/ directories in files named lexically in the order they were performed.
I will extract annotations from ensembl, since it is my favorite. My load_biomart_annotation() function attempts to find an archive server from one year ago, this may be made explicit if one wishes.
One important caveat, when creating the expressionset data structure, the gene IDs must match and the gff usually uses IDs which are a specific combination of columns from the ensembl table. It usually takes a moment to figure out how the two data sources line up; as a result sometimes it is easier to just extract annotations directly from the gff file.
load_biomart_annotations(species="rnorvegicus") rn_annot <-
## The biomart annotations file already exists, loading from it.
I often just copy/paste the relevant fields from the output files created by trimomatic/htseq/hisat/etc. However, I did make a fun function: gather_preprocessing_metadata() which in theory at least can fill out a metadata xlsx file.
gather_preprocessing_metadata("sample_sheets/all_samples.xlsx") written <-
I noticed a couple problems with my function:
Let us create an expression set of the metadata and annotations. Note that the
rn_annot[["annotation"]]
annot <-rownames(annot) <- make.names(paste0("gene:", annot[["ensembl_gene_id"]]), unique=TRUE)
rownames(annot) <- gsub(pattern="\\.", replacement=":", x=rownames(annot))
create_expt(metadata="sample_sheets/all_samples_modified.xlsx",
rn_expt <-gene_info=annot, file_column="hisatcounttable")
## Reading the sample metadata.
## The sample definitions comprises: 24 rows(samples) and 27 columns(metadata fields).
## Matched 18641 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 23139 features and 24 samples.
write_expt(rn_expt, excel="excel/rn_expt.xlsx") written <-
## Deleting the file excel/rn_expt.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## `geom_smooth()` using formula 'y ~ x'
## Error in checkModelStatus(fit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff, :
## The variables specified in this model are redundant,
## so the design matrix is not full rank
## Retrying with only condition in the model.
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
##
## Total:86 s
## `geom_smooth()` using formula 'y ~ x'
## Error in checkModelStatus(fit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff, :
## The variables specified in this model are redundant,
## so the design matrix is not full rank
## Retrying with only condition in the model.
##
## Total:66 s
Let us look briefly at a couple of likely plots of the data.
plot_libsize(rn_expt)$plot
plot_nonzero(rn_expt)$plot
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
This is worrisome. The trimming kept ~95% of the reads, which is awesome. Mapping appears to have been in the 90%s. I would therefore have assumed the numbers of reads mapped to genes to be higher than 1.1<x<3.7 million.
My first place to look into this is the fastqc output. My initial hypothesis is that the sequence duplication file will show a huge number of reads mapping to rRNA or some other ncRNA. It appears that guess was incorrect, there do not appear to be significantly over represented sequences in this data. Maybe lets look at the reads in IGV?
normalize_expt(rn_expt, filter=TRUE, convert="cpm",
rn_norm <-norm="quant", transform="log2")
## Removing 8998 low-count genes (14141 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(rn_norm)$plot
## Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
normalize_expt(rn_expt, filter=TRUE, convert="cpm",
rn_nbsva <-batch="svaseq", transform="log2")
## Removing 8998 low-count genes (14141 remaining).
## Setting 920 low elements to zero.
## transform_counts: Found 920 values equal to 0, adding 1 to the matrix.
plot_pca(rn_nbsva)$plot
## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The sample metadata is a little limited vis a vis the ways in which the samples are categorized. We have, in effect, 4 primary factors:
With the above in mind, let us consider some of the other categorizations of the data and see if any make sense. I know already from the first picture that knockdown and overexpression are the two primary sources of variance.
set_expt_conditions(rn_expt, fact="insertion")
kd_over <-## This will, by definition, just be the exact plot as before colored differently.
normalize_expt(kd_over, transform="log2", convert="cpm",
kd_over_norm <-filter=TRUE, norm="quant")
## Removing 8998 low-count genes (14141 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(kd_over_norm)$plot
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
set_expt_conditions(rn_expt, fact="perturbation")
perturb <-## This will, by definition, just be the exact plot as before colored differently.
normalize_expt(perturb, transform="log2", convert="cpm",
perturb_norm <-filter=TRUE, norm="quant")
## Removing 8998 low-count genes (14141 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(perturb_norm)$plot
## Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
normalize_expt(perturb, transform="log2", convert="cpm",
perturb_nb <-filter=TRUE, batch="svaseq")
## Removing 8998 low-count genes (14141 remaining).
## Setting 662 low elements to zero.
## transform_counts: Found 662 values equal to 0, adding 1 to the matrix.
plot_pca(perturb_nb)$plot
set_expt_conditions(rn_expt, fact="target")
target <-## This will, by definition, just be the exact plot as before colored differently.
normalize_expt(target, transform="log2", convert="cpm",
target_norm <-filter=TRUE, norm="quant")
## Removing 8998 low-count genes (14141 remaining).
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
plot_pca(target_norm)$plot
normalize_expt(target, transform="log2", convert="cpm",
target_nb <-filter=TRUE, batch="svaseq")
## Removing 8998 low-count genes (14141 remaining).
## Setting 560 low elements to zero.
## transform_counts: Found 560 values equal to 0, adding 1 to the matrix.
plot_pca(target_nb)$plot
subset_expt(rn_expt, subset="insertion=='short_hairpin'") kd <-
## subset_expt(): There were 24, now there are 10 samples.
normalize_expt(kd, transform="log2", convert="cpm", norm="quant", filter=TRUE) kd_norm <-
## Removing 9136 low-count genes (14003 remaining).
## transform_counts: Found 1 values equal to 0, adding 1 to the matrix.
plot_pca(kd_norm)$plot
normalize_expt(kd, transform="log2", convert="cpm", batch="svaseq", filter=TRUE) kd_nb <-
## Removing 9136 low-count genes (14003 remaining).
## Setting 220 low elements to zero.
## transform_counts: Found 220 values equal to 0, adding 1 to the matrix.
plot_pca(kd_nb)$plot
subset_expt(rn_expt, subset="insertion=='adeno_assoc'") oe <-
## subset_expt(): There were 24, now there are 14 samples.
normalize_expt(oe, transform="log2", convert="cpm", norm="quant", filter=TRUE) oe_norm <-
## Removing 9977 low-count genes (13162 remaining).
## transform_counts: Found 4 values equal to 0, adding 1 to the matrix.
plot_pca(oe_norm)$plot
normalize_expt(oe, transform="log2", convert="cpm", batch="svaseq", filter=TRUE) oe_nb <-
## Removing 9977 low-count genes (13162 remaining).
## Setting 75 low elements to zero.
## transform_counts: Found 75 values equal to 0, adding 1 to the matrix.
plot_pca(oe_nb)$plot
all_pairwise(kd, model_batch="svaseq", filter=TRUE) kd_de <-
## This DE analysis will perform all pairwise comparisons among:
##
## control_kd hdac5_kd
## 5 5
## This analysis will include surrogate estimates from: svaseq.
## This will pre-filter the input data using normalize_expt's: TRUE argument.
## Removing 0 low-count genes (14003 remaining).
## Setting 220 low elements to zero.
## transform_counts: Found 220 values equal to 0, adding 1 to the matrix.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
list("down_control" = c("hdac5kd", "controlkd"))
kd_keeper <- combine_de_tables(kd_de, keepers=kd_keeper, excel="excel/kd_table.xlsx") kd_table <-
## Deleting the file excel/kd_table.xlsx before writing the tables.
extract_significant_genes(kd_table, according_to="deseq", p=0.1,
kd_sig <-excel="excel/kd_sig.xlsx")
## Deleting the file excel/kd_sig.xlsx before writing the tables.
## Using p column: deseq_adjp.
all_pairwise(oe, model_batch="svaseq", filter=TRUE) oe_de <-
## This DE analysis will perform all pairwise comparisons among:
##
## control_over hdac5_over
## 7 7
## This analysis will include surrogate estimates from: svaseq.
## This will pre-filter the input data using normalize_expt's: TRUE argument.
## Removing 0 low-count genes (13162 remaining).
## Setting 75 low elements to zero.
## transform_counts: Found 75 values equal to 0, adding 1 to the matrix.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
list("up_control" = c("hdac5over", "controlover"))
oe_keeper <- combine_de_tables(oe_de, keepers=oe_keeper, excel="excel/oe_table.xlsx") oe_table <-
## Deleting the file excel/oe_table.xlsx before writing the tables.
extract_significant_genes(oe_table, according_to="deseq", p=0.1,
oe_sig <-excel="excel/oe_sig.xlsx")
## Deleting the file excel/oe_sig.xlsx before writing the tables.
## Using p column: deseq_adjp.
kd_sig[["deseq"]][["ups"]][["down_control"]]
kd_up <-nrow(kd_up)
## [1] 74
kd_sig[["deseq"]][["downs"]][["down_control"]]
kd_down <-nrow(kd_down)
## [1] 74
oe_sig[["deseq"]][["ups"]][["up_control"]]
oe_up <-nrow(oe_up)
## [1] 84
oe_sig[["deseq"]][["downs"]][["up_control"]]
oe_down <-nrow(oe_down)
## [1] 5
list(
up_lst <-"kd_up" = rownames(kd_up),
"oe_down" = rownames(oe_down))
::Venn(up_lst) Vennerable
## A Venn object on 2 sets named
## kd_up,oe_down
## 00 10 01 11
## 0 73 4 1
list(
down_lst <-"kd_down" = rownames(kd_down),
"oe_up" = rownames(oe_up))
Vennerable::Venn(down_lst)
first <-
list(
up_up <-"kd_up" = rownames(kd_up),
"oe_up" = rownames(oe_up))
::Venn(up_up) Vennerable
## A Venn object on 2 sets named
## kd_up,oe_up
## 00 10 01 11
## 0 72 82 2
list(
down_down <-"kd_down" = rownames(kd_down),
"oe_down" = rownames(oe_down))
::Venn(down_down) Vennerable
## A Venn object on 2 sets named
## kd_down,oe_down
## 00 10 01 11
## 0 74 5 0
::pander(sessionInfo()) pander
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: splines, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.7.1), lme4(v.1.1-30), Matrix(v.1.4-1), BiocParallel(v.1.30.3), variancePartition(v.1.26.0), hpgltools(v.1.0), testthat(v.3.1.4), reticulate(v.1.25), SummarizedExperiment(v.1.26.1), GenomicRanges(v.1.48.0), GenomeInfoDb(v.1.32.2), IRanges(v.2.30.0), S4Vectors(v.0.34.0), MatrixGenerics(v.1.8.1), matrixStats(v.0.62.0), Biobase(v.2.56.0) and BiocGenerics(v.0.42.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.3), rtracklayer(v.1.56.1), R.methodsS3(v.1.8.2), tidyr(v.1.2.0), ggplot2(v.3.3.6), bit64(v.4.0.5), knitr(v.1.39), R.utils(v.2.12.0), DelayedArray(v.0.22.0), data.table(v.1.14.2), KEGGREST(v.1.36.3), RCurl(v.1.98-1.7), doParallel(v.1.0.17), generics(v.0.1.3), GenomicFeatures(v.1.48.3), preprocessCore(v.1.58.0), callr(v.3.7.1), RhpcBLASctl(v.0.21-247.1), usethis(v.2.1.6), RSQLite(v.2.2.15), shadowtext(v.0.1.2), bit(v.4.0.4), enrichplot(v.1.16.1), xml2(v.1.3.3), httpuv(v.1.6.5), assertthat(v.0.2.1), viridis(v.0.6.2), xfun(v.0.31), hms(v.1.1.1), jquerylib(v.0.1.4), IHW(v.1.24.0), evaluate(v.0.15), promises(v.1.2.0.1), DEoptimR(v.1.0-11), fansi(v.1.0.3), restfulr(v.0.0.15), progress(v.1.2.2), caTools(v.1.18.2), dbplyr(v.2.2.1), geneplotter(v.1.74.0), igraph(v.1.3.4), DBI(v.1.1.3), htmlwidgets(v.1.5.4), purrr(v.0.3.4), ellipsis(v.0.3.2), dplyr(v.1.0.9), backports(v.1.4.1), annotate(v.1.74.0), aod(v.1.3.2), biomaRt(v.2.52.0), vctrs(v.0.4.1), remotes(v.2.4.2), cachem(v.1.0.6), withr(v.2.5.0), ggforce(v.0.3.3), robustbase(v.0.95-0), GenomicAlignments(v.1.32.1), treeio(v.1.20.1), fdrtool(v.1.2.17), prettyunits(v.1.1.1), DOSE(v.3.22.0), ape(v.5.6-2), lazyeval(v.0.2.2), crayon(v.1.5.1), genefilter(v.1.78.0), slam(v.0.1-50), edgeR(v.3.38.1), pkgconfig(v.2.0.3), labeling(v.0.4.2), tweenr(v.1.0.2), nlme(v.3.1-158), pkgload(v.1.3.0), devtools(v.2.4.4), rlang(v.1.0.4), lifecycle(v.1.0.1), miniUI(v.0.1.1.1), downloader(v.0.4), filelock(v.1.0.2), BiocFileCache(v.2.4.0), directlabels(v.2021.1.13), rprojroot(v.2.0.3), polyclip(v.1.10-0), graph(v.1.74.0), aplot(v.0.1.6), lpsymphony(v.1.24.0), boot(v.1.3-28), processx(v.3.7.0), png(v.0.1-7), viridisLite(v.0.4.0), rjson(v.0.2.21), bitops(v.1.0-7), R.oo(v.1.25.0), KernSmooth(v.2.23-20), pander(v.0.6.5), Biostrings(v.2.64.0), blob(v.1.2.3), stringr(v.1.4.0), qvalue(v.2.28.0), gridGraphics(v.0.5-1), scales(v.1.2.0), memoise(v.2.0.1), magrittr(v.2.0.3), plyr(v.1.8.7), gplots(v.3.1.3), zlibbioc(v.1.42.0), compiler(v.4.2.1), scatterpie(v.0.1.7), BiocIO(v.1.6.0), RColorBrewer(v.1.1-3), DESeq2(v.1.36.0), Rsamtools(v.2.12.0), cli(v.3.3.0), XVector(v.0.36.0), urlchecker(v.1.0.1), patchwork(v.1.1.1), ps(v.1.7.1), MASS(v.7.3-58), mgcv(v.1.8-40), tidyselect(v.1.1.2), stringi(v.1.7.8), highr(v.0.9), yaml(v.2.3.5), GOSemSim(v.2.22.0), locfit(v.1.5-9.6), ggrepel(v.0.9.1), grid(v.4.2.1), sass(v.0.4.2), fastmatch(v.1.1-3), tools(v.4.2.1), parallel(v.4.2.1), rstudioapi(v.0.13), foreach(v.1.5.2), gridExtra(v.2.3), farver(v.2.1.1), Rtsne(v.0.16), ggraph(v.2.0.5), digest(v.0.6.29), shiny(v.1.7.2), quadprog(v.1.5-8), Rcpp(v.1.0.9), broom(v.1.0.0), later(v.1.3.0), httr(v.1.4.3), AnnotationDbi(v.1.58.0), Rdpack(v.2.4), colorspace(v.2.0-3), brio(v.1.1.3), XML(v.3.99-0.10), fs(v.1.5.2), RBGL(v.1.72.0), yulab.utils(v.0.0.5), PROPER(v.1.28.0), tidytree(v.0.3.9), graphlayouts(v.0.8.0), ggplotify(v.0.1.0), plotly(v.4.10.0), sessioninfo(v.1.2.2), xtable(v.1.8-4), jsonlite(v.1.8.0), nloptr(v.2.0.3), ggtree(v.3.4.1), tidygraph(v.1.2.1), corpcor(v.1.6.10), ggfun(v.0.0.6), Vennerable(v.3.1.0.9000), R6(v.2.5.1), profvis(v.0.3.7), pillar(v.1.8.0), htmltools(v.0.5.3), mime(v.0.12), glue(v.1.6.2), fastmap(v.1.1.0), minqa(v.1.2.4), clusterProfiler(v.4.4.4), codetools(v.0.2-18), fgsea(v.1.22.0), pkgbuild(v.1.3.1), utf8(v.1.2.2), lattice(v.0.20-45), bslib(v.0.4.0), tibble(v.3.1.8), sva(v.3.44.0), pbkrtest(v.0.5.1), curl(v.4.3.2), gtools(v.3.9.3), zip(v.2.2.0), GO.db(v.3.15.0), openxlsx(v.4.2.5), survival(v.3.3-1), limma(v.3.52.2), rmarkdown(v.2.14), desc(v.1.4.1), munsell(v.0.5.0), DO.db(v.2.9), fastcluster(v.1.2.3), GenomeInfoDbData(v.1.2.8), iterators(v.1.0.14), reshape2(v.1.4.4), gtable(v.0.3.0) and rbibutils(v.2.2.8)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 605cc89b5f1cadea6923b53ac71e234ba0181fe7
## This is hpgltools commit: Wed Aug 10 22:39:40 2022 -0400: 605cc89b5f1cadea6923b53ac71e234ba0181fe7
paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
this_save <-message(paste0("Saving to ", this_save))
## Saving to index-v202208.rda.xz
sm(saveme(filename=this_save)) tmp <-