This is a hopefully straight forward comparison of 5 mouse DNA samples vs. the mm38 reference in order to find if/what changes have occurred in the colony and show that they are minimal and unrelated to the various ribosomal tagging processes.
FIXME: The previous paragraph is a woefully incomplete and bad representation of the goals of the project, I am assuming that when we meet I can acquire from Colenso’s team a more complete understanding of the goals.
Information on the mouse: https://jax.org/strain/029977
What are the most likely useful metrics? I am guessing some plots from IGV showing variants surrounding/in RPL22 and melanopsin. Other genes/loci of note?
So I can use my variant finder to compare the 5 samples with a specific goal of looking at how/if het_722C_old has variants which are relevant. The assumption is that the other samples will be otherwise relatively wt with the expected exceptions.
Images/worksheets which are useful?
The initial preprocessing of the samples should be relatively consistent.
It is worth noting that I have a pre-defined pipeline of tasks that I usually perform for DNA sequencing data, but am choosing not to use it because I do not know what all I want to play with in this dataset.
The following is what I ran on our cluster. It in turn created a series of trimomatic scripts which are responsible for trimming the data in preparation for mapping.
Another note: the sequencer already does an appropriate trimming process. This step is likely redundant and probably can be skipped entirely, however I like to run it still because of the quality check it adds; and upon completion it aggressively recompresses the data.
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v sequencer); do
cd $i
echo "Starting $i"
cyoa --method trim --input $(/bin/ls | tr '\n' ':' | sed 's/:$//g')
cd $start
done
The following is the actual trimming script for one sample. (It is actually the KO sample because my fingers hit ‘k’ when I went looking for an example).
#!/usr/bin/bash
#SBATCH --export=ALL --requeue --mail-type=NONE --open-mode=append
#SBATCH --chdir=/home/trey/sshfs/scratch/atb/dnaseq/mmusculus_iprgc_2023/preprocessing/KO_810B
#SBATCH --job-name=01trim_KO_810B_S3_R1_001 --nice=0
#SBATCH --output=outputs/log.txt.sbatch
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=4
#SBATCH --time=36:00:00
#SBATCH --mem=24G
set -o errexit
set -o errtrace
set -o pipefail
export LESS='--buffers 0 -B'
echo "## Started /home/trey/sshfs/scratch/atb/dnaseq/mmusculus_iprgc_2023/preprocessing/KO_810B/scripts/01trim_KO_810B_S3_R1_001.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt
mod=$( { type -t module || true; } )
if [[ -z "${mod}" ]]; then
module() {
# shellcheck disable=SC2086
eval "$(/usr/bin/modulecmd bash $*)"
}
export -f module
fi
module purge
module add trimomatic 2>/dev/null 1>&2
## This call to trimomatic removes illumina and epicentre adapters from KO_810B_S3_R1_001.fastq.gz:KO_810B_S3_R2_001.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
TrimmomaticPE \
-threads 1 \
-phred33 \
\
KO_810B_S3_R1_001.fastq.gz KO_810B_S3_R2_001.fastq.gz \
KO_810B_S3_R1_001-trimmed_paired.fastq KO_810B_S3_R1_001-trimmed_unpaired.fastq \
KO_810B_S3_R2_001-trimmed_paired.fastq KO_810B_S3_R2_001-trimmed_unpaired.fastq \
ILLUMINACLIP:/sw/local/cyoa/202302/prefix/lib/perl5/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
SLIDINGWINDOW:4:20 MINLEN:50 1>outputs/01trimomatic/KO_810B_S3_R1_001-trimomatic.stdout \
2>outputs/01trimomatic/KO_810B_S3_R1_001-trimomatic.stderr
excepted=$( { grep "Exception" "outputs/01trimomatic/KO_810B_S3_R1_001-trimomatic.stdout" || test $? = 1; } )
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
TrimmomaticPE \
-threads 1 \
-phred33 \
\
KO_810B_S3_R1_001.fastq.gz KO_810B_S3_R2_001.fastq.gz \
KO_810B_S3_R1_001-trimmed_paired.fastq KO_810B_S3_R1_001-trimmed_unpaired.fastq \
KO_810B_S3_R2_001-trimmed_paired.fastq KO_810B_S3_R2_001-trimmed_unpaired.fastq \
SLIDINGWINDOW:4:25 MINLEN:50 1>outputs/01trimomatic/KO_810B_S3_R1_001-trimomatic.stdout \
2>outputs/01trimomatic/KO_810B_S3_R1_001-trimomatic.stderr
fi
sleep 10
mv KO_810B_S3_R1_001-trimmed_paired.fastq KO_810B_S3_R1_001-trimmed.fastq
mv KO_810B_S3_R2_001-trimmed_paired.fastq KO_810B_S3_R2_001-trimmed.fastq
## Recompress the unpaired reads, this should not take long.
xz -9e -f KO_810B_S3_R1_001-trimmed_unpaired.fastq
xz -9e -f KO_810B_S3_R2_001-trimmed_unpaired.fastq
## Recompress the paired reads.
xz -9e -f KO_810B_S3_R1_001-trimmed.fastq
xz -9e -f KO_810B_S3_R2_001-trimmed.fastq
ln -sf KO_810B_S3_R1_001-trimmed.fastq.xz r1_trimmed.fastq.xz
ln -sf KO_810B_S3_R2_001-trimmed.fastq.xz r2_trimmed.fastq.xz
## The following lines give status codes and some logging
minutes_used=$(( SECONDS / 60 ))
echo " $(hostname) Finished ${SLURM_JOBID} 01trim_KO_810B_S3_R1_001.sh at $(date), it took ${minutes_used} minutes." >> outputs/log.txt
if [[ -x "$(command -v sstat)" && -n "${SLURM_JOBID}" ]]; then
echo " walltime used by ${SLURM_JOBID} was: ${minutes_used:-null} minutes." >> outputs/log.txt
maxmem=$(sstat -n -P --format=MaxVMSize -j "${SLURM_JOBID}")
## I am not sure why, but when I run a script in an interactive session, the maxmem variable
## gets set correctly everytime, but when it is run by another node, sometimes it does not.
## Lets try and figure that out...
echo "TESTME: ${maxmem}"
if [[ -n "${maxmem}" ]]; then
echo " maximum memory used by ${SLURM_JOBID} was: ${maxmem}." >> outputs/log.txt
else
echo " The maximum memory did not get set for this job: ${SLURM_JOBID}." >> outputs/log.txt
sstat -P -j "${SLURM_JOBID}" >> outputs/log.txt
fi
echo "" >> outputs/log.txt
fi
touch outputs/logs/01trim_KO_810B_S3_R1_001.finished
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v sequencer); do
cd $i
echo "Starting $i"
cyoa --method hisat --species mm38_100 --input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz \
--gff_type gene --gff_tag gene_id --stranded reverse
cd $start
done
Interestingly, I went over my memory allocation for these mapping jobs for 2 of them. As a result I had to rerun them manually.
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v sequencer); do
cd $i
echo "Starting $i"
cyoa --method freebayes --species mm38_100 --input outputs/40hisat_mm38_100/mm38_100_genome.bam \
--gff_type gene --gff_tag gene_id --stranded reverse --intron 1
cd $start
done
::pander(sessionInfo()) pander
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.1.0), testthat(v.3.1.8), reticulate(v.1.28), glue(v.1.6.2), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.9), IRanges(v.2.32.0), S4Vectors(v.0.36.2), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)
loaded via a namespace (and not attached): utf8(v.1.2.3), RUnit(v.0.4.32), tidyselect(v.1.2.0), lme4(v.1.1-33), RSQLite(v.2.3.1), AnnotationDbi(v.1.60.2), htmlwidgets(v.1.6.2), grid(v.4.2.0), BiocParallel(v.1.32.6), scatterpie(v.0.1.9), devtools(v.2.4.5), munsell(v.0.5.0), codetools(v.0.2-19), miniUI(v.0.1.1.1), withr(v.2.5.0), colorspace(v.2.1-0), GOSemSim(v.2.24.0), filelock(v.1.0.2), knitr(v.1.44), rstudioapi(v.0.14), DOSE(v.3.24.2), Rdpack(v.2.4), GenomeInfoDbData(v.1.2.9), polyclip(v.1.10-4), farver(v.2.1.1), bit64(v.4.0.5), downloader(v.0.4), rprojroot(v.2.0.3), treeio(v.1.22.0), vctrs(v.0.6.3), generics(v.0.1.3), gson(v.0.1.0), clusterGeneration(v.1.3.7), xfun(v.0.39), BiocFileCache(v.2.6.1), R6(v.2.5.1), doParallel(v.1.0.17), graphlayouts(v.1.0.0), locfit(v.1.5-9.7), gridGraphics(v.0.5-1), bitops(v.1.0-7), cachem(v.1.0.8), fgsea(v.1.24.0), DelayedArray(v.0.24.0), promises(v.1.2.0.1), BiocIO(v.1.8.0), scales(v.1.2.1), ggraph(v.2.1.0), enrichplot(v.1.18.4), gtable(v.0.3.3), sva(v.3.46.0), processx(v.3.8.1), tidygraph(v.1.2.3), rlang(v.1.1.1), genefilter(v.1.80.3), splines(v.4.2.0), rtracklayer(v.1.58.0), lazyeval(v.0.2.2), broom(v.1.0.4), yaml(v.2.3.7), reshape2(v.1.4.4), GenomicFeatures(v.1.50.4), backports(v.1.4.1), httpuv(v.1.6.10), qvalue(v.2.30.0), clusterProfiler(v.4.6.2), tools(v.4.2.0), usethis(v.2.1.6), ggplotify(v.0.1.0), ggplot2(v.3.4.2), ellipsis(v.0.3.2), gplots(v.3.1.3), RColorBrewer(v.1.1-3), jquerylib(v.0.1.4), sessioninfo(v.1.2.2), Rcpp(v.1.0.10), plyr(v.1.8.8), progress(v.1.2.2), zlibbioc(v.1.44.0), purrr(v.1.0.1), RCurl(v.1.98-1.12), ps(v.1.7.5), prettyunits(v.1.1.1), remaCor(v.0.0.11), viridis(v.0.6.3), cowplot(v.1.1.1), urlchecker(v.1.0.1), ggrepel(v.0.9.3), fs(v.1.6.2), variancePartition(v.1.28.9), magrittr(v.2.0.3), data.table(v.1.14.8), mvtnorm(v.1.1-3), pkgload(v.1.3.2), patchwork(v.1.1.2), hms(v.1.1.3), mime(v.0.12), evaluate(v.0.21), xtable(v.1.8-4), HDO.db(v.0.99.1), pbkrtest(v.0.5.2), RhpcBLASctl(v.0.23-42), XML(v.3.99-0.14), gridExtra(v.2.3), compiler(v.4.2.0), biomaRt(v.2.54.1), tibble(v.3.2.1), shadowtext(v.0.1.2), KernSmooth(v.2.23-21), crayon(v.1.5.2), minqa(v.1.2.5), htmltools(v.0.5.5), ggfun(v.0.0.9), mgcv(v.1.8-42), later(v.1.3.1), aplot(v.0.1.10), tidyr(v.1.3.0), DBI(v.1.1.3), tweenr(v.2.0.2), dbplyr(v.2.3.2), MASS(v.7.3-60), rappdirs(v.0.3.3), boot(v.1.3-28.1), Matrix(v.1.5-4), brio(v.1.1.3), cli(v.3.6.1), rbibutils(v.2.2.13), igraph(v.1.4.2), parallel(v.4.2.0), forcats(v.1.0.0), pkgconfig(v.2.0.3), GenomicAlignments(v.1.34.1), plotly(v.4.10.1), xml2(v.1.3.4), foreach(v.1.5.2), ggtree(v.3.6.2), annotate(v.1.76.0), bslib(v.0.4.2), XVector(v.0.38.0), yulab.utils(v.0.0.6), stringr(v.1.5.0), callr(v.3.7.3), digest(v.0.6.31), graph(v.1.76.0), Biostrings(v.2.66.0), rmarkdown(v.2.21), fastmatch(v.1.1-3), tidytree(v.0.4.2), edgeR(v.3.40.2), PROPER(v.1.30.0), GSEABase(v.1.60.0), restfulr(v.0.0.15), curl(v.5.0.1), shiny(v.1.7.4), Rsamtools(v.2.14.0), gtools(v.3.9.4), rjson(v.0.2.21), nloptr(v.2.0.3), lifecycle(v.1.0.3), nlme(v.3.1-162), jsonlite(v.1.8.7), aod(v.1.3.2), desc(v.1.4.2), viridisLite(v.0.4.2), limma(v.3.54.2), fansi(v.1.0.4), pillar(v.1.9.0), lattice(v.0.21-8), KEGGREST(v.1.38.0), fastmap(v.1.1.1), httr(v.1.4.6), pkgbuild(v.1.4.0), survival(v.3.5-5), GO.db(v.3.16.0), remotes(v.2.4.2), png(v.0.1-8), iterators(v.1.0.14), pander(v.0.6.5), bit(v.4.0.5), ggforce(v.0.4.1), stringi(v.1.7.12), sass(v.0.4.6), profvis(v.0.3.8), blob(v.1.2.4), caTools(v.1.18.2), memoise(v.2.0.1), dplyr(v.1.1.2) and ape(v.5.7-1)
message("This is hpgltools commit: ", get_git_commit())
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 169bc6dbf7ff8f90c297d2da5082d1bbc132566a
## This is hpgltools commit: Fri Sep 29 13:10:10 2023 -0400: 169bc6dbf7ff8f90c297d2da5082d1bbc132566a
<- paste0(gsub(pattern = "\\.Rmd", replace = "", x = rmd_file), "-v", ver, ".rda.xz")
this_save #message("Saving to ", this_save)
#tmp <- sm(saveme(filename = this_save))