This dataset has gone through a couple of sequencing iterations. The second of which appears to have been processed in 3 different ways. By that I mean the following: when I downloaded the data, there were three directories for each sample, one which was the top-level, one with the name ‘UGMC’ and one called ‘index 1 only’. There was also a note which had some text describing these three trees.
In my emails with Luke, I got the strong sense that a likely way to handle these is to concatenate them into a single set.
The following block does that and uses my consolidate_reads.pl script to search each read pair for a TA
Note, every time I insert an example script produced by cyoa, I am inserting it from the l1_input scripts/. I am reasonably certain all the scripts for the other samples are functionally identical, since I am invoking them all via shell for loops (though actually that was not true for a couple of steps because I have had some weird and annoying problems getting some things to work in this data, most notably the first steps: trimming, consolidation, and getting tpp to finish running.
start="$(pwd)/preprocessing/reseq"
dirs="l1_input l2_input l3_input l1_blood l2_blood l3_blood l1_brain l2_brain l3_brain"
cd ${start}
for dir in $dirs; do
echo $dir
mkdir -p "${start}/${dir}/concat"
cat L1*_R1_*.fastq.gz > "${start}/${dir}/concat/r1.fastq.gz"
cat L1*_R2_*.fastq.gz > "${start}/${dir}/concat/r2.fastq.gz"
cat index1/L1*_R1*.fastq.gz >> "${start}/${dir}/concat/r1.fastq.gz"
cat index1/L1*_R2*.fastq.gz >> "${start}/${dir}/concat/r2.fastq.gz"
cat ugmc/L1*_R1*.fastq.gz >> "${start}/${dir}/concat/r1.fastq.gz"
cat ugmc/L1*_R2*.fastq.gz >> "${start}/${dir}/concat/r2.fastq.gz"
cd "${start}/${dir}/concat"
../../../trimmer_r1.sh r1.fastq.gz
../../../trimmer_r2.sh r2.fastq.gz
done
The trimming sequences I am using were chosen by me scanning the two read sets for where the sequences were relatively invariant. I explicitly recall these sequences do not exactly match what I expected of the mariner boundary and PCR primer; but they were quite close.
The trimmer_r1.sh and trimmer_r2.sh scripts look like this:
#!/usr/bin/env bash
function module() {
eval $(/usr/bin/modulecmd bash $*)
}
module add cutadapt
cutadapt \
--trim-n \
-g GGGACTTATCATCCAACCTGT \
-o r1_ca_v3.fastq.gz \
--untrimmed-output untrimmed_r1_v3.fastq.gz \
${1} \
2>cutadapt_v3_r1.err 1>cutadapt_v3_r1.out
cutadapt \
--trim-n \
-b ACAGGTTGGATGATAAGTCCC \
-o r2_ca_v3.fastq.gz \
--untrimmed-output untrimmed_r2_v3.fastq.gz \
${1} \
2>cutadapt_v3_r2.err 1>cutadapt_v3_r2.out
Upon completion of this step, I expect to find the leading TA at the beginning of r1 or end of r2. I initially expected it to always be at the beginning of r1; but when I actually looked at the data I found that there seems to be a significant minority of reads which are reversed but otherwise seemed fine. As a result, I wrote a little script which scans each read pair and checks each read of the pair for the expected TA. If it finds the TA on the r2 sequence, it reverse complements the sequence and writes it to consolidated.fastq; if it finds it on r1, it just writes it; if the TA is found on both, only the portion from R2 is kept.
I added the consolidate functionality to my cyoa toy so that I can have some better control over the script outputs and can more easily run it on the cluster. It has a default input of r1 = ‘r1_ca_v3.fastq.gz’ and r2 = ‘r2_ca_v3.fastq.gz’ along with some defaults for read length and such.
start="$(pwd)/preprocessing/reseq"
dirs="l1_input l2_input l3_input l1_blood l2_blood l3_blood l1_brain l2_brain l3_brain"
cd ${start}
for dir in $dirs; do
echo $dir
cd "${start}/${dir}/concat"
cyoa --method consolidate
cd ${start}
done
I want to do my favorite methods of preprocessing the data along with the TPP method provided by transit. First let us do the latter. This has also been added to cyoa; but I added some logic to followup after tpp runs, notably I convert the tpp sam to a bam and count it against the genome’s gff file.
start="$(pwd)/preprocessing/reseq"
dirs="l1_input l2_input l3_input l1_blood l2_blood l3_blood l1_brain l2_brain l3_brain"
cd ${start}
for dir in $dirs; do
echo $dir
cd "${start}/${dir}/concat"
cyoa --method tpp --species sagalactiae_cjb111 --input consolidated.fastq \
--gff_type CDS --gff_tag locus_tag
cd ${start}
done
The processing I perform using TPP happens in 3 steps:
In the first step, I invoke tpp. There is one thing I do which is definitely silly; the less consolidated.fastq > outputs… is because sometimes I have already xz compressed the data and TPP hates compressed data.
The other big caveat is that I cannot use the –adapter argument because I already removed that in the trimmer above. But every read in consolidated.fastq starts with the TA, so I don’t feel too bad about this.
#!/usr/bin/bash
#SBATCH --export=ALL --requeue --mail-type=NONE --open-mode=append
#SBATCH --chdir=/home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat
#SBATCH --job-name=61tpp_sagalactiae_cjb111 --nice=0
#SBATCH --output=outputs/log.txt.sbatch
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1
#SBATCH --time=8:00:00
#SBATCH --mem=20
module add bwa transit htseq
set -o errexit
set -o errtrace
set -o pipefail
export LESS='--buffers 0 -B'
echo "## Started /home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat/scripts/61tpp_sagalactiae_cjb111.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt
## This is a transit preprocessing alignment of consolidated.fastq against
## ${HOME}/libraries/genome/sagalactiae_cjb111.fasta using arguments: .
mkdir -p outputs/61transit_sagalactiae_cjb111
rm -f tpp.cfg
sleep 3
less consolidated.fastq > outputs/61transit_sagalactiae_cjb111/r1.fastq
tpp -ref ${HOME}/libraries/genome/sagalactiae_cjb111.fasta \
-protocol Sassetti \
-primer '' \
-bwa $(which bwa) \
-reads1 outputs/61transit_sagalactiae_cjb111/r1.fastq \
-output outputs/61transit_sagalactiae_cjb111/concat \
2>outputs/61transit_sagalactiae_cjb111/concat.stderr 1>outputs/61transit_sagalactiae_cjb111/concat.stdout
rm -f outputs/61transit_sagalactiae_cjb111/r1.fastq outputs/61transit_sagalactiae_cjb111/r2.fastq
## The following lines give status codes and some logging
echo "Job status: $? " >> outputs/log.txt
minutes_used=$(( SECONDS / 60 ))
echo " $(hostname) Finished ${SLURM_JOBID} 61tpp_sagalactiae_cjb111.sh at $(date), it took ${minutes_used} minutes." >> outputs/log.txt
if [[ -x "$(command -v sstat)" && ! -z "${SLURM_JOBID}" ]]; then
echo " walltime used by ${SLURM_JOBID} was: ${minutes_used:-null} minutes." >> outputs/log.txt
maxmem=$(sstat -n -P --format=MaxVMSize -j "${SLURM_JOBID}")
echo " maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}." >> outputs/log.txt
echo "" >> outputs/log.txt
fi
touch outputs/logs/61tpp_sagalactiae_cjb111.finished
Convert the sam output from tpp to a compressed/sorted/indexed bam. I will remove the sbatch stuff and suffix this time.
module add samtools bamtools
set -o errexit
set -o errtrace
set -o pipefail
export LESS='--buffers 0 -B'
echo "## Started /home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat/scripts/61s2b_tpp_sagalactiae_cjb111_sagalactiae_cjb111.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt
## Converting the text sam to a compressed, sorted, indexed bamfile.
## Also printing alignment statistics to outputs/60transit_sagalactiae_cjb111/concat.bam.stats
## This job depended on: 23670
echo "Starting samtools"
if [[ ! -f "outputs/60transit_sagalactiae_cjb111/concat.sam" ]]; then
echo "Could not find the samtools input file."
exit 1
fi
## If a previous sort file exists due to running out of memory,
## then we need to get rid of them first.
## hg38_100_genome-sorted.bam.tmp.0000.bam
if [[ -f "outputs/60transit_sagalactiae_cjb111/concat.bam.tmp.000.bam" ]]; then
rm -f outputs/60transit_sagalactiae_cjb111/concat.bam.tmp.*.bam
fi
samtools view -u -t ${HOME}/libraries/genome/sagalactiae_cjb111.fasta \
-S outputs/60transit_sagalactiae_cjb111/concat.sam -o outputs/60transit_sagalactiae_cjb111/concat.bam \
2>outputs/60transit_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>outputs/60transit_sagalactiae_cjb111/concat.bam_samtools.stdout
echo "First samtools command finished with $?"
samtools sort -l 9 outputs/60transit_sagalactiae_cjb111/concat.bam \
-o outputs/60transit_sagalactiae_cjb111/concat-sorted.bam \
2>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr \
1>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stdout
rm outputs/60transit_sagalactiae_cjb111/concat.bam
rm outputs/60transit_sagalactiae_cjb111/concat.sam
mv outputs/60transit_sagalactiae_cjb111/concat-sorted.bam outputs/60transit_sagalactiae_cjb111/concat.bam
samtools index outputs/60transit_sagalactiae_cjb111/concat.bam \
2>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr \
1>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr
echo "Second samtools command finished with $?"
bamtools stats -in outputs/60transit_sagalactiae_cjb111/concat.bam \
2>>outputs/60transit_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
echo "Bamtools finished with $?"
## The following will fail if this is single-ended.
samtools view -b -f 2 \
-o outputs/60transit_sagalactiae_cjb111/concat-paired.bam \
\
outputs/60transit_sagalactiae_cjb111/concat.bam 2>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr \
1>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stdout
samtools index outputs/60transit_sagalactiae_cjb111/concat-paired.bam \
2>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr \
1>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stdout
bamtools stats -in outputs/60transit_sagalactiae_cjb111/concat-paired.bam \
2>>outputs/60transit_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
bamtools filter -tag XM:0 \
-in outputs/60transit_sagalactiae_cjb111/concat.bam \
-out outputs/60transit_sagalactiae_cjb111/concat-sorted_nomismatch.bam \
2>>outputs/60transit_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
echo "bamtools filter finished with: $?"
samtools index \
\
outputs/60transit_sagalactiae_cjb111/concat-sorted_nomismatch.bam 2>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stderr \
1>>outputs/60transit_sagalactiae_cjb111/concat-sorted_samtools.stdout
echo "final samtools index finished with: $?"
I do a couple of things in this script which don’t matter for this dataset: Make a filtered copy of the perfectly matching alignments, and make a filtered copy of only the proper paired reads (which fails here because I consolidated the 2 reads of each pair into a singleton).
Finally, count the results against the cjb111 annotations. This is sort of redundant (but also complementary) with the .wig output provided by TPP.
htseq-count \
-q -f bam \
-s no -a 0 \
--type CDS --idattr locus_tag \
\
outputs/61transit_sagalactiae_cjb111/concat.bam \
/home/trey/libraries/genome/sagalactiae_cjb111.gff 2>outputs/61transit_sagalactiae_cjb111/concat_all_sno_all_gene_id.stderr \
1>outputs/61transit_sagalactiae_cjb111/concat_all_sno_all_gene_id.count
xz -f -9e outputs/61transit_sagalactiae_cjb111/concat_all_sno_all_gene_id.count \
2>outputs/61transit_sagalactiae_cjb111/concat_all_sno_all_gene_id.stderr.xz \
1>outputs/61transit_sagalactiae_cjb111/concat_all_sno_all_gene_id.count.xz
In the ideal world, these bwa-derived counts will be nearly identical to those provided by bowtie2. I will test that later on.
In parallel, I am going to use bowtie to map these reads against the plasmid and cjb111. The sam2bam and counting scripts produced by cyoa are essentially identical to those used by bwa, so I will not add them.
With that in mind, here is what I used to poke krmit:
cd ${start}
for dir in $dirs; do
echo $dir
cd "${start}/${dir}/concat"
cyoa --method bt2 --species krmit --input consolidated.fastq
cd ${start}
done
I don’t maintain a gff file for pkrmit (perhaps I should?) so it does not get a htseq script, but here is an example of the bowtie2 script generated.
mkdir -p outputs/20bowtie2_krmit
bowtie2 -x /home/trey/libraries/genome/indexes/krmit --very-sensitive -L 14 \
-p 2 \
-q <(less /home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat/consolidated.fastq) \
--un outputs/20bowtie2_krmit/concat_unaligned_krmit.fastq \
--al outputs/20bowtie2_krmit/concat_aligned_krmit.fastq \
-S outputs/20bowtie2_krmit/concat.sam \
2>outputs/20bowtie2_krmit/concat.stderr \
1>outputs/20bowtie2_krmit/concat.stdout
The corresponding invocation for CJB111 follows:
cd ${start}
for dir in $dirs; do
echo $dir
cd "${start}/${dir}/concat"
cyoa --method bt2 --species sagalactiae_cjb111 --input consolidated.fastq \
--gff_type CDS --gff_tag locus_tag --stranded no
cd ${start}
done
What the heck, while I am at it I will insert a copy of the sam2bam and htseq invocations produced by cyoa.
## This is a bowtie2 alignment of <(less /home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat/consolidated.fastq) against
## /home/trey/libraries/genome/indexes/sagalactiae_cjb111 using arguments: --very-sensitive -L 14 .
mkdir -p outputs/20bowtie2_sagalactiae_cjb111
bowtie2 -x /home/trey/libraries/genome/indexes/sagalactiae_cjb111 --very-sensitive -L 14 \
-p 2 \
-q <(less /home/trey/sshfs/scratch/atb/tnseq/sagalacticae_2022/preprocessing/reseq/l1_input/concat/consolidated.fastq) \
--un outputs/20bowtie2_sagalactiae_cjb111/concat_unaligned_sagalactiae_cjb111.fastq \
--al outputs/20bowtie2_sagalactiae_cjb111/concat_aligned_sagalactiae_cjb111.fastq \
-S outputs/20bowtie2_sagalactiae_cjb111/concat.sam \
2>outputs/20bowtie2_sagalactiae_cjb111/concat.stderr \
1>outputs/20bowtie2_sagalactiae_cjb111/concat.stdout
echo "Starting samtools"
if [[ -f "outputs/20bowtie2_sagalactiae_cjb111/concat.bam" && -f "outputs/20bowtie2_sagalactiae_cjb111/concat.sam" ]]; then
echo "Both the bam and sam files exist, rerunning."
elif [[ -f "outputs/20bowtie2_sagalactiae_cjb111/concat.bam" ]]; then
echo "The output file exists, quitting."
exit 0
elif [[ ! -f "outputs/20bowtie2_sagalactiae_cjb111/concat.sam" ]]; then
echo "Could not find the samtools input file."
exit 1
fi
## If a previous sort file exists due to running out of memory,
## then we need to get rid of them first.
## hg38_100_genome-sorted.bam.tmp.0000.bam
if [[ -f "outputs/20bowtie2_sagalactiae_cjb111/concat.bam.tmp.000.bam" ]]; then
rm -f outputs/20bowtie2_sagalactiae_cjb111/concat.bam.tmp.*.bam
fi
samtools view -u -t ${HOME}/libraries/genome/sagalactiae_cjb111.fasta \
-S outputs/20bowtie2_sagalactiae_cjb111/concat.sam -o outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
2>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
echo "First samtools command finished with $?"
samtools sort -l 9 outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
-o outputs/20bowtie2_sagalactiae_cjb111/concat-sorted.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
rm outputs/20bowtie2_sagalactiae_cjb111/concat.bam
rm outputs/20bowtie2_sagalactiae_cjb111/concat.sam
mv outputs/20bowtie2_sagalactiae_cjb111/concat-sorted.bam outputs/20bowtie2_sagalactiae_cjb111/concat.bam
samtools index outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
echo "Second samtools command finished with $?"
bamtools stats -in outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
echo "Bamtools finished with $?"
## The following will fail if this is single-ended.
samtools view -b -f 2 \
-o outputs/20bowtie2_sagalactiae_cjb111/concat-paired.bam \
\
outputs/20bowtie2_sagalactiae_cjb111/concat.bam 2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
samtools index outputs/20bowtie2_sagalactiae_cjb111/concat-paired.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
bamtools stats -in outputs/20bowtie2_sagalactiae_cjb111/concat-paired.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
bamtools filter -tag XM:0 \
-in outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
-out outputs/20bowtie2_sagalactiae_cjb111/concat-sorted_nomismatch.bam \
2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stats 1>&2
echo "bamtools filter finished with: $?"
samtools index \
\
outputs/20bowtie2_sagalactiae_cjb111/concat-sorted_nomismatch.bam 2>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stderr \
1>>outputs/20bowtie2_sagalactiae_cjb111/concat.bam_samtools.stdout
echo "final samtools index finished with: $?"
htseq-count \
-q -f bam \
-s no -a 0 \
--type CDS --idattr locus_tag \
\
outputs/20bowtie2_sagalactiae_cjb111/concat.bam \
/home/trey/libraries/genome/sagalactiae_cjb111.gff 2>outputs/20bowtie2_sagalactiae_cjb111/concat_CDS_sno_CDS_locus_tag.stderr \
1>outputs/20bowtie2_sagalactiae_cjb111/concat_CDS_sno_CDS_locus_tag.count
xz -f -9e outputs/20bowtie2_sagalactiae_cjb111/concat_CDS_sno_CDS_locus_tag.count \
2>outputs/20bowtie2_sagalactiae_cjb111/concat_CDS_sno_CDS_locus_tag.stderr.xz \
1>outputs/20bowtie2_sagalactiae_cjb111/concat_CDS_sno_CDS_locus_tag.count.xz
One thing to note about the various mappings performed: there are still quite a large number of unmapped reads. I did a blast of a random assortment of them and it looks like they are dominated by a different expression vector. Does this make sense?
At this point I have the wig files from TPP along with the various alignments from bowtie2, let us start poking at them. Note the grep -v in the creation of the following bash variables, that is because my preprocessing script by default creates every file with the same name (‘concat.wig’ in this case). I am going to change this as soon as I finish this; but as a result I symbolically linked each filename to its parent directory name l1_input.wig etc…
control_files=$(find . -name '*.wig' | grep input | grep -v concat.wig | sed 's/\.\///g')
echo $control_files
blood_files=$(find . -name '*.wig' | grep blood | grep -v concat.wig | sed 's/\.\///g')
echo $blood_files
brain_files=$(find . -name '*.wig' | grep brain | grep -v concat.wig | sed 's/\.\///g')
echo $brain_files
I think I finally got my python setup/skills to a place where I can confidently play with and invoke transit. Lets find out! I have never written down how I use python with R in my editor; so lets just do it here; it is definitely idiosyncratic. I have tried using jupyter on at least 10 occasions and every time get annoyed and frustrated with it. I have also spent a significant amount of time messing with org and its literate programming interface; that works pretty well but I still find it difficult to use. Fundamentally, I like Rmarkdown and just want to use that.
I therefore setup my editor with some awareness of python virtual environments and symbolically linked my transit 202212 virtual environment (which is what I used for all the steps above) into this working tree as ‘venv’. Thus when I start my editor it will pick that up and use it for accessing python/transit.
The first test of my setup is to just load transit. I am using the examples from:
https://transit.readthedocs.io/en/latest/transit.html
to extract the various functions from transit.
I am not yet comfortable enough with python to easily get it to find me the .wig filenames generated by tpp; so I am copy/pasting the filenames produced by my shell from the previous bash block:
import os
import numpy
import scipy
import pandas
import matplotlib
import pytransit.norm_tools as norm_tools
import pytransit.tnseq_tools as tnseq_tools
import pytransit.transit_tools as transit_tools
import pytransit.transit_gui as transit_gui
= tnseq_tools.get_data(['preprocessing/reseq/l3_input/concat/outputs/61transit_sagalactiae_cjb111/l3_input.wig',
(input_data, input_positions) 'preprocessing/reseq/l1_input/concat/outputs/61transit_sagalactiae_cjb111/l1_input.wig',
'preprocessing/reseq/l2_input/concat/outputs/61transit_sagalactiae_cjb111/l2_input.wig',])
= tnseq_tools.get_data(['preprocessing/reseq/l3_blood/concat/outputs/61transit_sagalactiae_cjb111/l3_blood.wig',
(blood_data, blood_positions) 'preprocessing/reseq/l1_blood/concat/outputs/61transit_sagalactiae_cjb111/l1_blood.wig',
'preprocessing/reseq/l2_blood/concat/outputs/61transit_sagalactiae_cjb111/l2_blood.wig',])
= tnseq_tools.get_data(['preprocessing/reseq/l3_brain/concat/outputs/61transit_sagalactiae_cjb111/l3_brain.wig',
(brain_data, brain_positions) 'preprocessing/reseq/l1_brain/concat/outputs/61transit_sagalactiae_cjb111/l1_brain.wig',
'preprocessing/reseq/l2_brain/concat/outputs/61transit_sagalactiae_cjb111/l2_brain.wig',])
## Hmm it looks like the documentation is a little out of date...
## The following does not work, but is trivially run using normalize_data()
## input_norm = norm.aBGC_norm(input_data)
As I understand it, transit parses the input gff file and creates a ‘prot_table’ file. I assumed it did this while TPP was running, but I think that is incorrect.
= 'reference/only_cds.gff'
gff_file ## Make sure I typed that correctly (I must have, emacs did it for me, but I am learning python here)
os.stat(gff_file)
## Note to self tnseq.get_gene_info() and friends all fail with 'list index out of range'
## But I was able to read this gff file in the GUI, so I am definitely missing something obvious.
## os.stat_result(st_mode=33204, st_ino=122997, st_dev=66, st_nlink=1, st_uid=10186, st_gid=18062, st_size=641422, st_atime=1678384690, st_mtime=1678384690, st_ctime=1678384690)
= tnseq_tools.get_pos_hash_gff(gff_file) ## also fails with list index
gff_annot ## Ahh I see, it is trying to get the ID tag from every line of the
## gff file, including the first one which is the entire chromosome
## and doesn't have a tag 'ID'. Thus, the lambda never finds the tag
## and fails.
## In my forked version of transit, I added parameters to explicitly
## state which tags and types one might want.
## The other normalization methods include: 'emphist', 'nzmean', 'nonorm', 'quantile', 'emphist'
## 'totreads', 'zinfb', 'TTR'
## Holy crap aBGC takes forever to run, lets just do quantile.
= 'quantile'
chosen_norm = norm_tools.normalize_data(input_data, method = chosen_norm)
input_norm = norm_tools.normalize_data(blood_data, method = chosen_norm)
blood_norm = norm_tools.normalize_data(brain_data, method = chosen_norm) brain_norm
At this point I should be able to make some of the same plots as provided by the gui? Looking at the code, it looks like I cannot call the plotting functions without a wxgui event? I think if the authors don’t mind, I will rip those all out and call them, but until then I guess I will yank them out and put them here.
import matplotlib.pyplot as plt
import numpy
def plot_scatter(control, experimental, control_rep = 0,
= 0):
experimental_rep = numpy.log1p(control[control_rep, :])
X = numpy.log1p(experimental[experimental_rep, :])
Y "bo")
plt.plot(X,Y, 'Scatter plot - Reads at TA sites')
plt.title(## plt.xlabel(transit_tools.fetch_name(datasets[0]))
## plt.ylabel(transit_tools.fetch_name(datasets[1]))
= False)
plt.show(block
## Plot input 1 vs. blood 1
plot_scatter(input_data, blood_data)## Compare input 1 vs input 2
= 0, experimental_rep = 1)
plot_scatter(input_data, input_data, control_rep ## input 1 vs input 3
= 0, experimental_rep = 2)
plot_scatter(input_data, input_data, control_rep ## input 2 vs input 3
= 1, experimental_rep = 2)
plot_scatter(input_data, input_data, control_rep ## blood 1 vs brain 1
= 0, experimental_rep = 0) plot_scatter(blood_data, brain_data, control_rep
<- load_microbesonline_annotations(species="CJB111") cjb_microbes
## Found 1 entry.
## Streptococcus agalactiae CJB111Firmicutesyes2007-05-08noNANA2208342617
## The species being downloaded is: Streptococcus agalactiae CJB111
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=342617;export=tab
<- load_gff_annotations("~/scratch/libraries/genome/sagalactiae_cjb111.gff", type = "CDS") cjb_gff
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Returning a df with 42 columns and 2013 rows.
<- as.data.frame(cjb_microbes)
cjb_microbes rownames(cjb_gff) <- make.names(cjb_gff[["locus_tag"]], unique=TRUE)
## I am going to only pay attention to the first annotation for each locus tag from microbesonline.
"sysName"]] <- make.names(cjb_microbes[["sysName"]], unique=TRUE)
cjb_microbes[[## I should ask Lindsey if they would like me to merge these
<- cjb_gff
cjb_annot ##cjb_annot <- merge(cjb_gff, cjb_microbes, by.x="old_locus_tag", by.y="sysName")
## rownames(cjb_annot) <- make.names(cjb_annot[["locus_tag"]], unique=TRUE)
<- create_expt(metadata="sample_sheets/all_samples.xlsx",
cjb_expt gene_info=cjb_annot)
<- write_expt(cjb_expt, batch="raw",
cjb_written excel=glue::glue("excel/{rundate}-cjb_counts-v{ver}.xlsx"))
"legend_plot"]]
cjb_written[["raw_libsize"]]
cjb_written[["raw_density"]]
cjb_written[[## awesome
"norm_disheat"]]
cjb_written[["norm_corheat"]] cjb_written[[
::pander(sessionInfo()) pander
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.1.0), testthat(v.3.1.6), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.9), IRanges(v.2.32.0), S4Vectors(v.0.36.1), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.3), rtracklayer(v.1.58.0), tidyr(v.1.3.0), ggplot2(v.3.4.1), clusterGeneration(v.1.3.7), bit64(v.4.0.5), knitr(v.1.42), DelayedArray(v.0.24.0), data.table(v.1.14.6), KEGGREST(v.1.38.0), RCurl(v.1.98-1.10), doParallel(v.1.0.17), generics(v.0.1.3), GenomicFeatures(v.1.50.4), callr(v.3.7.3), RhpcBLASctl(v.0.23-42), cowplot(v.1.1.1), usethis(v.2.1.6), RSQLite(v.2.2.20), shadowtext(v.0.1.2), tzdb(v.0.3.0), bit(v.4.0.5), enrichplot(v.1.18.3), xml2(v.1.3.3), httpuv(v.1.6.8), assertthat(v.0.2.1), viridis(v.0.6.2), xfun(v.0.37), hms(v.1.1.2), jquerylib(v.0.1.4), evaluate(v.0.20), promises(v.1.2.0.1), fansi(v.1.0.4), restfulr(v.0.0.15), progress(v.1.2.2), caTools(v.1.18.2), dbplyr(v.2.3.0), igraph(v.1.4.0), DBI(v.1.1.3), htmlwidgets(v.1.6.1), purrr(v.1.0.1), ellipsis(v.0.3.2), selectr(v.0.4-2), dplyr(v.1.1.0), backports(v.1.4.1), annotate(v.1.76.0), aod(v.1.3.2), biomaRt(v.2.54.0), vctrs(v.0.5.2), remotes(v.2.4.2), here(v.1.0.1), cachem(v.1.0.6), withr(v.2.5.0), ggforce(v.0.4.1), HDO.db(v.0.99.1), vroom(v.1.6.1), GenomicAlignments(v.1.34.0), treeio(v.1.22.0), prettyunits(v.1.1.1), DOSE(v.3.24.2), ape(v.5.6-2), lazyeval(v.0.2.2), crayon(v.1.5.2), genefilter(v.1.80.3), edgeR(v.3.40.2), pkgconfig(v.2.0.3), tweenr(v.2.0.2), nlme(v.3.1-162), pkgload(v.1.3.2), devtools(v.2.4.5), rlang(v.1.0.6), lifecycle(v.1.0.3), miniUI(v.0.1.1.1), downloader(v.0.4), filelock(v.1.0.2), BiocFileCache(v.2.6.0), rprojroot(v.2.0.3), polyclip(v.1.10-4), graph(v.1.76.0), Matrix(v.1.5-3), aplot(v.0.1.9), boot(v.1.3-28.1), processx(v.3.8.0), png(v.0.1-8), viridisLite(v.0.4.1), rjson(v.0.2.21), bitops(v.1.0-7), gson(v.0.0.9), KernSmooth(v.2.23-20), pander(v.0.6.5), Biostrings(v.2.66.0), blob(v.1.2.3), stringr(v.1.5.0), qvalue(v.2.30.0), readr(v.2.1.4), remaCor(v.0.0.11), gridGraphics(v.0.5-1), scales(v.1.2.1), memoise(v.2.0.1), GSEABase(v.1.60.0), magrittr(v.2.0.3), plyr(v.1.8.8), gplots(v.3.1.3), zlibbioc(v.1.44.0), compiler(v.4.2.0), scatterpie(v.0.1.8), BiocIO(v.1.8.0), RColorBrewer(v.1.1-3), lme4(v.1.1-31), Rsamtools(v.2.14.0), cli(v.3.6.0), XVector(v.0.38.0), urlchecker(v.1.0.1), patchwork(v.1.1.2), ps(v.1.7.2), MASS(v.7.3-58.2), mgcv(v.1.8-41), tidyselect(v.1.2.0), stringi(v.1.7.12), highr(v.0.10), yaml(v.2.3.7), GOSemSim(v.2.24.0), locfit(v.1.5-9.7), ggrepel(v.0.9.3), grid(v.4.2.0), sass(v.0.4.5), fastmatch(v.1.1-3), tools(v.4.2.0), parallel(v.4.2.0), rstudioapi(v.0.14), foreach(v.1.5.2), gridExtra(v.2.3), farver(v.2.1.1), ggraph(v.2.1.0), digest(v.0.6.31), shiny(v.1.7.4), Rcpp(v.1.0.10), broom(v.1.0.3), later(v.1.3.0), httr(v.1.4.4), AnnotationDbi(v.1.60.0), Rdpack(v.2.4), colorspace(v.2.1-0), rvest(v.1.0.3), brio(v.1.1.3), XML(v.3.99-0.13), fs(v.1.6.1), reticulate(v.1.28), splines(v.4.2.0), yulab.utils(v.0.0.6), PROPER(v.1.30.0), tidytree(v.0.4.2), graphlayouts(v.0.8.4), ggplotify(v.0.1.0), plotly(v.4.10.1), sessioninfo(v.1.2.2), xtable(v.1.8-4), jsonlite(v.1.8.4), nloptr(v.2.0.3), ggtree(v.3.6.2), tidygraph(v.1.2.3), ggfun(v.0.0.9), R6(v.2.5.1), RUnit(v.0.4.32), profvis(v.0.3.7), pillar(v.1.8.1), htmltools(v.0.5.4), mime(v.0.12), glue(v.1.6.2), fastmap(v.1.1.0), minqa(v.1.2.5), clusterProfiler(v.4.6.0), BiocParallel(v.1.32.5), codetools(v.0.2-19), fgsea(v.1.24.0), pkgbuild(v.1.4.0), mvtnorm(v.1.1-3), utf8(v.1.2.3), lattice(v.0.20-45), bslib(v.0.4.2), tibble(v.3.1.8), sva(v.3.46.0), pbkrtest(v.0.5.2), curl(v.5.0.0), gtools(v.3.9.4), GO.db(v.3.16.0), survival(v.3.5-3), limma(v.3.54.1), rmarkdown(v.2.20), desc(v.1.4.2), munsell(v.0.5.0), GenomeInfoDbData(v.1.2.9), iterators(v.1.0.14), variancePartition(v.1.28.4), reshape2(v.1.4.4), gtable(v.0.3.1) and rbibutils(v.2.2.13)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset c100d5666f5032d24c933739015d267ef651c323
## This is hpgltools commit: Wed Mar 1 09:50:14 2023 -0500: c100d5666f5032d24c933739015d267ef651c323
<- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
this_save message(paste0("Saving to ", this_save))
## Saving to index-v202210.rda.xz
<- sm(saveme(filename=this_save)) tmp
loadme(filename=this_save)