This document seeks to lay out my process in poking at the DNAsequencing results of a series of Pseudomonas aeruginosa PA14 and PAK strains.
If I understand Dr. Lee and co.’s goal, they wish to ensure that these strains are still reasonably close to the associated reference strains. I therefore am running my default trimming/mapping/variant search methods.
I have a single command that can run all of these commands at the same time, but I have been actively breaking my tools recently; so I decided to run them one at a time with the assumption that something would not work (but everything did work on the first try, so that was nice).
I downloaded the .zip archive file using the link in Dr. Lee’s email. I did not save it though, so if we need to download the data again, we will have to go to him. I created my usual work directory ‘preprocessing/’ within this tree and moved it there. I unzipped it and moved each pair of reads to a directory which follows Dr. Lee’s desired naming convention.
I then created the directories: ‘reference/’ and ‘sample_sheets/’. The sample_sheets remained empty for a while, but I immediately downloaded the full genbank flat file for the Pseudomonas PAK strain from NCBI, found here:
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP020659
Note, that when downloading, one must hit the ‘customize view’ button on the right and ensure that the entire sequence and all annotations are included. Then hit the ‘send to’ button and send it to a file. This file I copied to reference/paeruginosa_pak.gb.
Given the full PAK genbank file, I converted it to the expected fasta/gff file for mapping:
This command created a series of fasta and gff files which provide the coordinates for the various annotations (genes/cds/rRNA/intercds) and sequence for the genome, CDS nucleotides, and amino acids. I then copied the genome/gff files to my global reference directory and prepared it for usage by my favorite mapper:
Now all of the pieces are in place for me to play. Each of the following steps was performed twice, once for the PA14 samples, once for the PAK samples. The only difference in the invocations was due to the fact that the PAK annotations provide different tags. E.g. I used the ‘Alias’ tag for PA14 and the ‘locus_tag’ tag for PAK. As a result I am only going to write down in this document the PA14 invocations and assume the reader can figure out the difference.
I have a couple of trimming methods, in this instance I just used the default and will operate under the assumption that it is sufficient until I see otherwise.
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d PA14*); do
cd $i
cyoa --method trim --input $(/bin/ls *.fastq.gz | tr '\n' ':' | sed 's/:$//g')
cd $start
done
The above command line invocation produced a series of trimming jobs which when examined look like this (I am only showing examples from PA14_exoUTY, and am leaving off the beginning and end).
## This is a portion of file:
## preprocessing/PA14_exoUTY/scripts/01trim_7_UTY_S138_R1_001.sh
module add trimomatic
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
trimmomatic PE \
-threads 1 \
-phred33 \
7_UTY_S138_R1_001.fastq.gz 7_UTY_S138_R2_001.fastq.gz \
7_UTY_S138_R1_001-trimmed_paired.fastq 7_UTY_S138_R1_001-trimmed_unpaired.fastq \
7_UTY_S138_R2_001-trimmed_paired.fastq 7_UTY_S138_R2_001-trimmed_unpaired.fastq \
ILLUMINACLIP:/fs/cbcb-software/RedHat-8-x86_64/local/cyoa/202302/prefix/lib/perl5/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
SLIDINGWINDOW:4:20 MINLEN:50 \
1>outputs/01trimomatic/7_UTY_S138_R1_001-trimomatic.stdout \
2>outputs/01trimomatic/7_UTY_S138_R1_001-trimomatic.stderr
excepted=$( { grep "Exception" "outputs/01trimomatic/7_UTY_S138_R1_001-trimomatic.stdout" || test $? = 1; } )
One thing I did not include in the above: upon completion, the script aggressively compresses the trimmed output and symbolically links it to r1_trimmed.fastq.xz and r2_trimmed.fastq.xz. Thus any following steps can use the same input name (r1_trimmed.fastq.xz:r2_trimmed.fastq.xz).
My default mappers run the actual alignment, convert it to a compressed/indexed bam, and count it against the reference genome. In this context, the counting is a little silly, but does have the potential to help find duplications and such.
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d PA14*); do
cd $i
cyoa --method hisat --input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz \
--stranded no --species paeruginosa_pa14 --gff_type gene --gff_tag Alias
cd $start
done
## Here is what I ran for PAK
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d PAK*); do
cd $i
cyoa --method hisat --input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz \
--stranded no --species paeruginosa_pa01 --gff_type gene --gff_tag locus_tag
cd $start
done
Similarly, I am just putting the meaty part.
module add hisat2 samtools htseq bamtools
mkdir -p outputs/40hisat2_paeruginosa_pa14
hisat2 -x ${HOME}/libraries/genome/indexes/paeruginosa_pa14 \
-p 8 \
-q -1 <(less /home/trey/sshfs/scratch/atb/dnaseq/paeruginosa_strains_202304/preprocessing/PA14_exoUTY/r1_trimmed.fastq.xz) -2 <(less /home/trey/sshfs/scratch/atb/dnaseq/paeruginosa_strains_202304/preprocessing/PA14_exoUTY/r2_trimmed.fastq.xz) \
--phred33 \
--un outputs/40hisat2_paeruginosa_pa14/unaldis_paeruginosa_pa14_genome.fastq \
--al outputs/40hisat2_paeruginosa_pa14/aldis_paeruginosa_pa14_genome.fastq \
--un-conc outputs/40hisat2_paeruginosa_pa14/unalcon_paeruginosa_pa14_genome.fastq \
--al-conc outputs/40hisat2_paeruginosa_pa14/alcon_paeruginosa_pa14_genome.fastq \
-S outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.sam \
2>outputs/40hisat2_paeruginosa_pa14/hisat2_paeruginosa_pa14_genome_PA14_exoUTY.stderr \
1>outputs/40hisat2_paeruginosa_pa14/hisat2_paeruginosa_pa14_genome_PA14_exoUTY.stdout
The above cyoa invocation also creates this script. It is a little long because it does some checks and creates a couple of filtered versions of the output.
module add samtools bamtools
echo "Starting samtools"
if [[ -f "outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam" && -f "outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.sam" ]]; then
echo "Both the bam and sam files exist, rerunning."
elif [[ -f "outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam" ]]; then
echo "The output file exists, quitting."
exit 0
elif [[ ! -f "outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.sam" ]]; then
echo "Could not find the samtools input file."
exit 1
fi
## If a previous sort file exists due to running out of memory,
## then we need to get rid of them first.
## hg38_100_genome-sorted.bam.tmp.0000.bam
if [[ -f "outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam.tmp.000.bam" ]]; then
rm -f outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam.tmp.*.bam
fi
samtools view -u -t ${HOME}/libraries/genome/paeruginosa_pa14.fasta \
-S outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.sam -o outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
2>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
echo "First samtools command finished with $?"
samtools sort -l 9 outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
-o outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-sorted.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
rm outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam
rm outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.sam
mv outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-sorted.bam outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam
samtools index outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
echo "Second samtools command finished with $?"
bamtools stats -in outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stats 1>&2
echo "Bamtools finished with $?"
## The following will fail if this is single-ended.
samtools view -b -f 2 \
-o outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired.bam \
outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
samtools index outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
bamtools stats -in outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stats 1>&2
bamtools filter -tag XM:0 \
-in outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
-out outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-sorted_nomismatch.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stats 1>&2
echo "bamtools filter finished with: $?"
samtools index \
outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-sorted_nomismatch.bam \
2>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stderr \
1>>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam_samtools.stdout
echo "final samtools index finished with: $?"
Note that this step is not really useful for a dnaseq dataset in most instances. I also have the default orientation set to reverse because most of the samples off our sequencer are reversed; but that is likely not true for this dataset. If it turns out we actually care about these counts, I may need to come back and rerun these.
module add htseq
htseq-count \
-q -f bam \
-s reverse -a 0 \
--type all --idattr Alias \
outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired.bam \
/home/trey/libraries/genome/paeruginosa_pa14.gff \
2>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.stderr \
1>outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count
xz -f -9e outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count
I tend to like to use freebayes for this. It is a little conservative, but I think it seems to work quite well. I can also use mpileup and snippy. freebayes and mpileup are setup to feed a post-processing script which I think is kind of fun and will be decribed momentarily.
cd preprocessing
start=$(pwd)
for i in $(/bin/ls -d PA14*); do
cd $i
cyoa --method freebayes \
--input outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired.bam \
--species paeruginosa_pa14 --gff_type gene --gff_tag Alias --intron 0
cd $start
done
## Here is what I ran for PAK
for i in $(/bin/ls -d PAK*); do
cd $i
cyoa --method freebayes \
--input outputs/40hisat2_paeruginosa_pa01/paeruginosa_pa01_genome-paired.bam \
--species paeruginosa_pa01 --gff_type gene --gff_tag locus_tag --intron 0
cd $start
done
Unlike hisat, I include the conversion to the binary/compressed/indexed format with the invocation of the variant search. I also include the duplicate search functionality from gatk.
module add gatk freebayes libgsl libhts samtools bcftools vcftools
mkdir -p outputs/50freebayes_paeruginosa_pa14
gatk MarkDuplicates \
-I outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome.bam \
-O outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14_genome_deduplicated.bam \
-M outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt --REMOVE_DUPLICATES true --COMPRESSION_LEVEL 9 \
2>outputs/50freebayes_paeruginosa_pa14/deduplication.stderr \
1>outputs/50freebayes_paeruginosa_pa14/deduplication.stdout
echo "Finished gatk deduplication." >> outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
samtools index outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14_genome_deduplicated.bam
echo "Finished samtools index." >> outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
freebayes -f /home/trey/libraries/genome/paeruginosa_pa14.fasta \
-v outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.vcf \
outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14_genome_deduplicated.bam \
1>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout \
2>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stderr
echo "Finished freebayes." >> outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
bcftools convert outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.vcf \
-Ob -o outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf \
2>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stderr \
1>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
echo "Finished bcftools convert." >> outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
bcftools index outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf \
2>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stderr \
1>>outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
echo "Finished bcftools index." >> outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.stdout
rm outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.vcf
The result from the above freebayes script is a bcf containing the high-quality observed variants. The cyoa invocation also creates the following script, which will require a bit of explanation.
use Bio::Adventure::SNP;
my $result = $h->Bio::Adventure::SNP::SNP_Ratio_Worker(
input => 'outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf',
species => 'paeruginosa_pa14',
vcf_method => 'freebayes',
vcf_cutoff => '5',
vcf_minpct => '0.8',
gff_tag => 'Alias',
gff_type => 'gene',
output_dir => 'outputs/50freebayes_paeruginosa_pa14',
output => 'outputs/50freebayes_paeruginosa_pa14/all_tags.txt',
output_count => 'outputs/50freebayes_paeruginosa_pa14/count.txt',
output_genome => 'outputs/50freebayes_paeruginosa_pa14/modified.fasta',
output_by_gene => 'outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt',
output_pkm => 'outputs/50freebayes_paeruginosa_pa14/pkm.txt',
);
The function ‘SNP_Ratio_Worker()’ reads the reference genome, the set of variants, and the genome annotations in order to create a new copy of the genome (modified.fasta) which should be equivalent to the input reads. It also rewrites the bcf data into a matrix which is easier to play with in R/python (all_tags.txt). Finally, it uses the annotation information to explicitly show the amino acid substitions observed in every ORF (variants_by_gene.txt). In theory it should also give a rpkm-esque copy of the variants observed / ORF, but I turned that off because it doesn’t seem very useful and it is a little tricky to get right.
In order to play further with the data, I will need a sample sheet. So I will start out by creating a blank one in excel (libreoffice) which contains only the samplenames in the same format as my directories in preprocessing/.
Once completed, I can use it as the input for my hpgltools package and it should extract the interesting information from the preprocessing logs and fill out the sample sheet accordingly. Lets see if it works!
Here is the before:
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
sampleid | parent | condition | batch | |
---|---|---|---|---|
PA14_exoUTY | PA14_exoUTY | PA14 | undefined | undefined |
PA14_JC | PA14_JC | PA14 | undefined | undefined |
PA14_lux | PA14_lux | PA14 | undefined | undefined |
PA14_NBH | PA14_NBH | PA14 | undefined | undefined |
PA14_pscD_A5 | PA14_pscD_A5 | PA14 | undefined | undefined |
PA14_pscD_E4 | PA14_pscD_E4 | PA14 | undefined | undefined |
PA14_xcp | PA14_xcp | PA14 | undefined | undefined |
PA14_xcp_pscD | PA14_xcp_pscD | PA14 | undefined | undefined |
PAK | PAK | PAK | undefined | undefined |
PAK_pscC | PAK_pscC | PAK | undefined | undefined |
PAK_xcp | PAK_xcp | PAK | undefined | undefined |
PAK_xcp_pscC | PAK_xcp_pscC | PAK | undefined | undefined |
Like I said, not much going on. Lets see what it looks like after I run the gatherer on it… (Note, I have been meaning to change this to drop the unused columns, but not yet).
spec <- make_dnaseq_spec()
queried_species <- c("paeruginosa_pak", "paeruginosa_pa01", "paeruginosa_pa14")
modified <- sm(gather_preprocessing_metadata("sample_sheets/all_samples.xlsx",
species = queried_species, verbose = FALSE,
specification = spec))
knitr::kable(extract_metadata("sample_sheets/all_samples_modified.xlsx"))
rownames | sampleid | parent | condition | batch | trimomaticinput | trimomaticoutput | trimomaticpercent | fastqcpctgc | hisatgenomesingleconcordantpaeruginosapak | hisatgenomesingleconcordantpaeruginosapa01 | hisatgenomesingleconcordantpaeruginosapa14 | hisatgenomemulticoncordantpaeruginosapak | hisatgenomemulticoncordantpaeruginosapa01 | hisatgenomemulticoncordantpaeruginosapa14 | hisatgenomesingleallpaeruginosapak | hisatgenomesingleallpaeruginosapa01 | hisatgenomesingleallpaeruginosapa14 | hisatgenomemultiallpaeruginosapak | hisatgenomemultiallpaeruginosapa01 | hisatgenomemultiallpaeruginosapa14 | hisatgenomepercentlogpaeruginosapak | hisatgenomepercentlogpaeruginosapa01 | hisatgenomepercentlogpaeruginosapa14 | gatkunpairedpaeruginosapak | gatkunpairedpaeruginosapa01 | gatkunpairedpaeruginosapa14 | gatkpairedpaeruginosapak | gatkpairedpaeruginosapa01 | gatkpairedpaeruginosapa14 | gatksupplementarypaeruginosapak | gatksupplementarypaeruginosapa01 | gatksupplementarypaeruginosapa14 | gatkunmappedpaeruginosapak | gatkunmappedpaeruginosapa01 | gatkunmappedpaeruginosapa14 | gatkunpairedduplicatespaeruginosapak | gatkunpairedduplicatespaeruginosapa01 | gatkunpairedduplicatespaeruginosapa14 | gatkpairedduplicatespaeruginosapak | gatkpairedduplicatespaeruginosapa01 | gatkpairedduplicatespaeruginosapa14 | gatkpairedoptduplicatespaeruginosapak | gatkpairedoptduplicatespaeruginosapa01 | gatkpairedoptduplicatespaeruginosapa14 | gatkduplicatepctpaeruginosapak | gatkduplicatepctpaeruginosapa01 | gatkduplicatepctpaeruginosapa14 | gatklibsizepaeruginosapak | gatklibsizepaeruginosapa01 | gatklibsizepaeruginosapa14 | freebayesobservedpaeruginosapak | freebayesobservedpaeruginosapa01 | freebayesobservedpaeruginosapa14 | freebayesobservedfilepaeruginosapak | freebayesobservedfilepaeruginosapa01 | freebayesobservedfilepaeruginosapa14 | hisatcounttablepaeruginosapak | hisatcounttablepaeruginosapa01 | hisatcounttablepaeruginosapa14 | deduplicationstatspaeruginosapak | deduplicationstatspaeruginosapa01 | deduplicationstatspaeruginosapa14 | freebayesvariantsbygenepaeruginosapak | freebayesvariantsbygenepaeruginosapa01 | freebayesvariantsbygenepaeruginosapa14 | freebayesvariantstablepaeruginosapak | freebayesvariantstablepaeruginosapa01 | freebayesvariantstablepaeruginosapa14 | freebayesmodifiedgenomepaeruginosapak | freebayesmodifiedgenomepaeruginosapa14 | freebayesbcffilepaeruginosapak | freebayesbcffilepaeruginosapa01 | freebayesbcffilepaeruginosapa14 | freebayespenetrancefilepaeruginosapak | freebayespenetrancefilepaeruginosapa01 | freebayespenetrancefilepaeruginosapa14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PA14_exoUTY | PA14_exoUTY | PA14_exoUTY | PA14 | undefined | undefined | 4706372 | 4350712 | 0.924 | NA | 3690737 | NA | 4280240 | 664 | NA | 25722 | 307644 | NA | 34468 | 513 | NA | 562 | 88.56 | NA | 99.60 | NA | NA | 0 | NA | NA | 4305962 | NA | NA | 83320 | NA | NA | 0 | NA | NA | 0 | NA | NA | 823232 | NA | NA | 81874 | NA | NA | 0.1912 | NA | NA | 10580352 | NA | NA | 199 | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_exoUTY/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_exoUTY/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_exoUTY.fasta | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_exoUTY/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_JC | PA14_JC | PA14_JC | PA14 | undefined | undefined | 5786839 | 5336197 | 0.922 | 66 | 4550108 | NA | 5275687 | 1028 | NA | 32239 | 330047 | NA | 23521 | 586 | NA | 421 | 88.46 | NA | 99.77 | NA | NA | 0 | NA | NA | 5307926 | NA | NA | 107378 | NA | NA | 0 | NA | NA | 0 | NA | NA | 1127763 | NA | NA | 103875 | NA | NA | 0.2125 | NA | NA | 11426698 | NA | NA | 196 | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_JC/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_JC/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_JC.fasta | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_lux | PA14_lux | PA14_lux | PA14 | undefined | undefined | 6622570 | 6099776 | 0.921 | NA | 5205608 | NA | 6028065 | 999 | NA | 36827 | 354183 | NA | 24917 | 645 | NA | 461 | 88.33 | NA | 99.70 | NA | NA | 0 | NA | NA | 6064892 | NA | NA | 121596 | NA | NA | 0 | NA | NA | 0 | NA | NA | 1432296 | NA | NA | 127776 | NA | NA | 0.2362 | NA | NA | 11448973 | NA | NA | 197 | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_lux/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_lux/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_lux.fasta | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_lux/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_NBH | PA14_NBH | PA14_NBH | PA14 | undefined | undefined | 5151127 | 4581433 | 0.889 | NA | 3883544 | NA | 4516421 | 711 | NA | 26915 | 313829 | NA | 32560 | 547 | NA | 460 | 88.31 | NA | 99.64 | NA | NA | 0 | NA | NA | 4543336 | NA | NA | 88454 | NA | NA | 0 | NA | NA | 0 | NA | NA | 896720 | NA | NA | 83669 | NA | NA | 0.1974 | NA | NA | 10694127 | NA | NA | 196 | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_NBH/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_NBH/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_NBH.fasta | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_pscD_A5 | PA14_pscD_A5 | PA14_pscD_A5 | PA14 | undefined | undefined | 5898210 | 5417082 | 0.918 | NA | 4579807 | NA | 5359077 | 989 | NA | 32852 | 338340 | NA | 21806 | 664 | NA | 440 | 87.75 | NA | 99.79 | NA | NA | 0 | NA | NA | 5391929 | NA | NA | 110988 | NA | NA | 0 | NA | NA | 0 | NA | NA | 1134595 | NA | NA | 110041 | NA | NA | 0.2104 | NA | NA | 11790543 | NA | NA | 204 | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_pscD_A5/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_pscD_A5/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_pscD_A5.fasta | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_pscD_E4 | PA14_pscD_E4 | PA14_pscD_E4 | PA14 | undefined | undefined | 5854559 | 5418227 | 0.925 | NA | 4589839 | NA | 5361935 | 920 | NA | 33424 | 325823 | NA | 19561 | 572 | NA | 387 | 87.80 | NA | 99.81 | NA | NA | 0 | NA | NA | 5395359 | NA | NA | 112228 | NA | NA | 0 | NA | NA | 0 | NA | NA | 1204336 | NA | NA | 118910 | NA | NA | 0.2232 | NA | NA | 10998073 | NA | NA | 203 | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_pscD_E4/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_pscD_E4/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_pscD_E4.fasta | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_xcp | PA14_xcp | PA14_xcp | PA14 | undefined | undefined | 5683132 | 5214875 | 0.918 | NA | 4430035 | NA | 5134054 | 1300 | NA | 30352 | 356590 | NA | 39479 | 588 | NA | 589 | 88.55 | NA | 99.62 | NA | NA | 0 | NA | NA | 5164406 | NA | NA | 97676 | NA | NA | 0 | NA | NA | 0 | NA | NA | 1092613 | NA | NA | 103935 | NA | NA | 0.2116 | NA | NA | 11202466 | NA | NA | 195 | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_xcp/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_xcp/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_xcp.fasta | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_xcp/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PA14_xcp_pscD | PA14_xcp_pscD | PA14_xcp_pscD | PA14 | undefined | undefined | 2026150 | 1514509 | 0.747 | NA | 1223766 | NA | 1471930 | 191 | NA | 10238 | 137765 | NA | 27094 | 182 | NA | 367 | 85.66 | NA | 99.16 | NA | NA | 0 | NA | NA | 1482168 | NA | NA | 41804 | NA | NA | 0 | NA | NA | 0 | NA | NA | 188223 | NA | NA | 20007 | NA | NA | 0.1270 | NA | NA | 5857317 | NA | NA | 216 | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_xcp_pscD/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PA14_xcp_pscD/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sreverse_all_Alias.count.xz | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/deduplication_stats.txt | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/variants_by_gene.txt.xz | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14-PA14_xcp_pscD.fasta | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/paeruginosa_pa14.bcf | preprocessing/PA14_xcp_pscD/outputs/50freebayes_paeruginosa_pa14/variants_penetrance.txt.xz | ||||||||||||||
PAK | PAK | PAK | PAK | undefined | undefined | 4779558 | 4318745 | 0.904 | NA | 4049183 | 3925784 | 3731781 | 1093 | 22836 | 19482 | 179265 | 179736 | 308170 | 455 | 2255 | 2452 | 96.71 | 94.27 | 91.10 | 168494 | 0 | NA | 4092362 | 3948620 | NA | 3393 | 99172 | NA | 284272 | 0 | NA | 126668 | 0 | NA | 902830 | 873059 | NA | 98051 | 94715 | NA | 0.2313 | 0.2211 | NA | 8530650 | 8207871 | NA | 333 | 26786 | NA | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sreverse_all_locus_tag.count.xz | preprocessing/PAK/outputs/40hisat2_paeruginosa_pa01/paeruginosa_pa01_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PAK/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sno_gene_Alias.count.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/deduplication_stats.txt | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/deduplication_stats.txt | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/variants_by_gene.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/variants_by_gene.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak-PAK.fasta | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak.bcf | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/paeruginosa_pa01.bcf | preprocessing/PAK/outputs/50freebayes_paeruginosa_pak/variants_penetrance.txt.xz | preprocessing/PAK/outputs/50freebayes_paeruginosa_pa01/variants_penetrance.txt.xz | |||||||
PAK_pscC | PAK_pscC | PAK_pscC | PAK | undefined | undefined | 5734960 | 5271470 | 0.919 | NA | 5090759 | 4853631 | 4638113 | 1343 | 29403 | 25963 | 116625 | 126483 | 266811 | 475 | 1937 | 2341 | 97.78 | 93.94 | 91.15 | 115014 | 0 | NA | 5097071 | 4883034 | NA | 4081 | 142224 | NA | 233784 | 0 | NA | 92389 | 0 | NA | 1102959 | 1056234 | NA | 106532 | 102207 | NA | 0.2229 | 0.2163 | NA | 10771706 | 10325724 | NA | 373 | 26856 | NA | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_pscC/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sreverse_gene_locus_tag.count.xz | preprocessing/PAK_pscC/outputs/40hisat2_paeruginosa_pa01/paeruginosa_pa01_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PAK_pscC/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sno_gene_Alias.count.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/deduplication_stats.txt | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/deduplication_stats.txt | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/variants_by_gene.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/variants_by_gene.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak-PAK_pscC.fasta | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak.bcf | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/paeruginosa_pa01.bcf | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pak/variants_penetrance.txt.xz | preprocessing/PAK_pscC/outputs/50freebayes_paeruginosa_pa01/variants_penetrance.txt.xz | |||||||
PAK_xcp | PAK_xcp | PAK_xcp | PAK | undefined | undefined | 4843414 | 4443669 | 0.917 | NA | 4293814 | 4088322 | 3904688 | 977 | 24341 | 21592 | 96933 | 110085 | 229330 | 394 | 1787 | 2140 | 97.82 | 93.90 | 91.07 | 95387 | 0 | NA | 4299065 | 4112663 | NA | 3145 | 116582 | NA | 193821 | 0 | NA | 73985 | 0 | NA | 889293 | 849827 | NA | 90548 | 86772 | NA | 0.2131 | 0.2066 | NA | 9634785 | 9231062 | NA | 363 | 26756 | NA | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_xcp/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sreverse_gene_locus_tag.count.xz | preprocessing/PAK_xcp/outputs/40hisat2_paeruginosa_pa01/paeruginosa_pa01_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PAK_xcp/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sno_gene_Alias.count.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/deduplication_stats.txt | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/deduplication_stats.txt | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/variants_by_gene.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/variants_by_gene.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak-PAK_xcp.fasta | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak.bcf | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/paeruginosa_pa01.bcf | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pak/variants_penetrance.txt.xz | preprocessing/PAK_xcp/outputs/50freebayes_paeruginosa_pa01/variants_penetrance.txt.xz | |||||||
PAK_xcp_pscC | PAK_xcp_pscC | PAK_xcp_pscC | PAK | undefined | undefined | 5195158 | 4611474 | 0.888 | NA | 4344138 | 4220150 | 4008749 | 1070 | 23899 | 20629 | 177601 | 174592 | 308658 | 400 | 2279 | 2409 | 96.74 | 94.46 | 91.21 | 168983 | 0 | NA | 4376720 | 4244049 | NA | 3164 | 102714 | NA | 300525 | 0 | NA | 129907 | 0 | NA | 943947 | 917470 | NA | 92772 | 89839 | NA | 0.2261 | 0.2162 | NA | 9299455 | 8989454 | NA | 384 | 26949 | NA | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_xcp_pscC/outputs/40hisat2_paeruginosa_pak/paeruginosa_pak_genome-paired_sreverse_gene_locus_tag.count.xz | preprocessing/PAK_xcp_pscC/outputs/40hisat2_paeruginosa_pa01/paeruginosa_pa01_genome-paired_sno_gene_locus_tag.count.xz | preprocessing/PAK_xcp_pscC/outputs/40hisat2_paeruginosa_pa14/paeruginosa_pa14_genome-paired_sno_gene_Alias.count.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/deduplication_stats.txt | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/deduplication_stats.txt | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/variants_by_gene.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/variants_by_gene.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/all_tags.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/all_tags.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak-PAK_xcp_pscC.fasta | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/paeruginosa_pak.bcf | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/paeruginosa_pa01.bcf | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pak/variants_penetrance.txt.xz | preprocessing/PAK_xcp_pscC/outputs/50freebayes_paeruginosa_pa01/variants_penetrance.txt.xz |
I reran the missing PAK samples and looked into the logs. It may be the case that the PAK genome I downloaded is of somewhat lower quality than the PA14 and that is skewing the results somewhat.
Lets go one small step further. I have a series of modified genomes as well as the reference. We can do a quickie tree of them: First I will copy each modified genome to the tree/ directory and rename them to the sampleID.
start=$(pwd)
mkdir tree
cd preprocessing
for i in $(/bin/ls -d PA*); do
cp $i/outputs/50*/paeruginosa_pak-*.fasta ${start}/tree/
cp $i/outputs/50*/paeruginosa_pa14-*.fasta ${start}/tree/
done
cd $start
cp ~/libraries/genome/paeruginosa_pa14.fa ${start}/tree
cp ~/libraries/genome/paeruginosa_pak.fa ${start}/tree
Oh, it turns out that at the time of this writing, I forgot to run 3 samples, so this section will need to be redone. But I can at least run it for the samples that I didn’t forget.
## Error in genomic_sequence_phylo("tree", root = "paeruginosa_pa14"): could not find function "genomic_sequence_phylo"
## Error in eval(expr, envir, enclos): object 'funkytown' not found
The counts from hisat in theory are not very interesting for DNAseq data, except in this instance we want to see the coverage of the knockouts.
pa14_annot <- load_gff_annotations("~/libraries/genome/paeruginosa_pa14.gff",
type = "gene", id_col = "Alias")
## Returning a df with 16 columns and 5979 rows.
rownames(pa14_annot) <- pa14_annot[["Alias"]]
pa14_expt <- create_expt("sample_sheets/all_samples_modified.xlsx", gene_info = pa14_annot,
file_column = "hisatcounttablepaeruginosapa14")
## Reading the sample metadata.
## The sample definitions comprises: 12 rows(samples) and 77 columns(metadata fields).
## Matched 5979 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 12 samples.
## Deleting the file excel/pa14_strains.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 1758 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 1758 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## The expressionset has a minimal or missing set of conditions/batches.
##
## `geom_smooth()` using formula = 'y ~ x'
pak_annot <- load_gff_annotations("~/libraries/genome/paeruginosa_pak.gff", type = "gene", id_col = "locus_tag")
## Returning a df with 35 columns and 5871 rows.
rownames(pak_annot) <- pak_annot[["locus_tag"]]
pak_expt <- create_expt("sample_sheets/all_samples_modified.xlsx", file_column = "hisatcounttablepaeruginosapak")
## Reading the sample metadata.
## The sample definitions comprises: 12 rows(samples) and 77 columns(metadata fields).
## Matched 5871 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5871 features and 12 samples.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.2527 entries are 0. We are on a log scale, adding 1 to the data.
## Changed 2527 zero count features.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## `geom_smooth()` using formula = 'y ~ x'The expressionset has a minimal or missing set of conditions/batches.
## `geom_smooth()` using formula = 'y ~ x'
pa14_variants <- pData(pa14_expt)[["variantspenetrancefilepaeruginosapa14"]]
names(pa14_variants) <- rownames(pData(pa14_expt))
## Error in names(pa14_variants) <- rownames(pData(pa14_expt)): attempt to set an attribute on NULL
start <- init_xlsx(excel = "excel/pa14_variants.xlsx")
wb <- start[["wb"]]
for (s in seq_len(length(pa14_variants))) {
sample_name <- names(pa14_variants)[[s]]
if (pa14_variants[[s]] == "") {
next
}
sample_data <- readr::read_tsv(pa14_variants[[s]])
if (nrow(sample_data) == 0) {
next
}
written <- write_xlsx(data = sample_data, sheet = sample_name, wb = wb)
}
saved <- openxlsx::saveWorkbook(written[["workbook"]], file = "excel/pa14_variants.xlsx")
## Error in eval(expr, envir, enclos): object 'written' not found
pak_variants <- pData(pak_expt)[["variantspenetrancefilepaeruginosapak"]]
names(pak_variants) <- rownames(pData(pak_expt))
## Error in names(pak_variants) <- rownames(pData(pak_expt)): attempt to set an attribute on NULL
start <- init_xlsx(excel = "excel/pak_variants.xlsx")
wb <- start[["wb"]]
for (s in seq_len(length(pak_variants))) {
sample_name <- names(pak_variants)[[s]]
if (pak_variants[[s]] == "") {
next
}
sample_data <- readr::read_tsv(pak_variants[[s]])
if (nrow(sample_data) == 0) {
next
}
written <- write_xlsx(data = sample_data, sheet = sample_name, wb = wb)
}
saved <- openxlsx::saveWorkbook(written[["workbook"]], file = "excel/pak_variants.xlsx")
## Error in eval(expr, envir, enclos): object 'written' not found
In this following block we will instead write out the nt/aa mutations of CDS/proteins.
pa14_mutations <- pData(pa14_expt)[["variantsbygenefilepaeruginosapa14"]]
names(pa14_mutations) <- rownames(pData(pa14_expt))
## Error in names(pa14_mutations) <- rownames(pData(pa14_expt)): attempt to set an attribute on NULL
start <- init_xlsx(excel = "excel/pa14_mutations.xlsx")
wb <- start[["wb"]]
for (s in seq_len(length(pa14_mutations))) {
sample_name <- names(pa14_mutations)[[s]]
if (pa14_mutations[[s]] == "") {
next
}
sample_data <- readr::read_tsv(pa14_mutations[[s]])
if (nrow(sample_data) == 0) {
next
}
written <- write_xlsx(data = sample_data, sheet = sample_name, wb = wb)
}
saved <- openxlsx::saveWorkbook(written[["workbook"]], file = "excel/pa14_mutations.xlsx")
## Error in eval(expr, envir, enclos): object 'written' not found
pak_mutations <- pData(pak_expt)[["variantsbygenefilepaeruginosapak"]]
names(pak_mutations) <- rownames(pData(pak_expt))
## Error in names(pak_mutations) <- rownames(pData(pak_expt)): attempt to set an attribute on NULL
start <- init_xlsx(excel = "excel/pak_mutations.xlsx")
wb <- start[["wb"]]
for (s in seq_len(length(pak_mutations))) {
sample_name <- names(pak_mutations)[[s]]
if (pak_mutations[[s]] == "") {
next
}
sample_data <- readr::read_tsv(pak_mutations[[s]])
if (nrow(sample_data) == 0) {
next
}
written <- write_xlsx(data = sample_data, sheet = sample_name, wb = wb)
}
saved <- openxlsx::saveWorkbook(written[["workbook"]], file = "excel/pak_mutations.xlsx")
## Error in eval(expr, envir, enclos): object 'written' not found
spec <- make_dnaseq_spec()
modified_exoUTY <- gather_preprocessing_metadata("sample_sheets/exoUTY_224_samples.xlsx",
species = "paeruginosa_exoUTY_224", verbose = FALSE,
specification = spec)
## Error in `colnames<-`(`*tmp*`, value = tolower(gsub(pattern = "[[:punct:]]", : attempt to set 'colnames' on an object with less than two dimensions
modified_pscd <- sm(gather_preprocessing_metadata("sample_sheets/pscd_222_samples.xlsx",
species = "paeruginosa_pscd_222", verbose = FALSE,
specification = spec))
## Error in `colnames<-`(`*tmp*`, value = tolower(gsub(pattern = "[[:punct:]]", : attempt to set 'colnames' on an object with less than two dimensions
modified_wt <- sm(gather_preprocessing_metadata("sample_sheets/wt_221_samples.xlsx",
species = "paeruginosa_wt_221", verbose = FALSE,
specification = spec))
## Error in `colnames<-`(`*tmp*`, value = tolower(gsub(pattern = "[[:punct:]]", : attempt to set 'colnames' on an object with less than two dimensions
## Error in eval(expr, envir, enclos): object 'modified_exoUTY' not found
## Error in eval(expr, envir, enclos): object 'modified_pscd' not found
## Error in eval(expr, envir, enclos): object 'modified_wt' not found
As the above suggests but does not explicitly state, I mapped some of the samples against multiple potential parental strains in an attempt to make it clear that some specific genes are or are not observed. Thus these last three expressionsets. Now write them out so that Vince can check out the mapping results with respect to these non-standard references.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'samples_exo' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'samples_pscd' not found
## Deleting the file excel/samples_vs_wt_reference.xlsx before writing the tables.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'samples_wt' not found
nbh_tags <- "preprocessing/PA14_NBH/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz"
jc_tags <- "preprocessing/PA14_JC/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz"
a5_tags <- "preprocessing/PA14_pscD_A5/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz"
e4_tags <- "preprocessing/PA14_pscD_E4/outputs/50freebayes_paeruginosa_pa14/all_tags.txt.xz"
nbh_in <- as.data.frame(readr::read_tsv(nbh_tags)[, c(1,2,3)])
## New names:
## Rows: 194 Columns: 47
## ── Column specification
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr
## (13): position, AF, PAO, PQA, SAP, AB, ABP, RPP, RPR, EPP, DPRA, TYPE, CIGAR dbl (21): NS, DP...3, DPB, AN, RO...8, PRO, QR...12, PQR, SRF, SRR,
## SRP, RPPR, EPPR, ODDS, GTI, NUMALT, MQMR, PAIREDR, DP...42, RO...43, QR...44 num (13): AC, AO...9, QA...13, SAF, SAR, RUN, RPL, LEN, MQM, PAIRED,
## AO...45, QA...46, MEANALT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
## message.
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## New names:
## Rows: 194 Columns: 47
## ── Column specification
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr
## (13): position, AF, PAO, PQA, SAP, AB, ABP, RPP, RPR, EPP, DPRA, TYPE, CIGAR dbl (21): NS, DP...3, DPB, AN, RO...8, PRO, QR...12, PQR, SRF, SRR,
## SRP, RPPR, EPPR, ODDS, GTI, NUMALT, MQMR, PAIREDR, DP...42, RO...43, QR...44 num (13): AC, AO...9, QA...13, SAF, SAR, RUN, RPL, LEN, MQM, PAIRED,
## AO...45, QA...46, MEANALT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
## message.
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## New names:
## Rows: 202 Columns: 47
## ── Column specification
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr
## (13): position, AF, PAO, PQA, SAP, AB, ABP, RPP, RPR, EPP, DPRA, TYPE, CIGAR dbl (21): NS, DP...3, DPB, AN, RO...8, PRO, QR...12, PQR, SRF, SRR,
## SRP, RPPR, EPPR, ODDS, GTI, NUMALT, MQMR, PAIREDR, DP...42, RO...43, QR...44 num (13): AC, AO...9, QA...13, SAF, SAR, RUN, RPL, LEN, MQM, PAIRED,
## AO...45, QA...46, MEANALT
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
## message.
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## New names:
## Rows: 201 Columns: 47
## ── Column specification
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr
## (3): position, TYPE, CIGAR dbl (44): NS, DP...3, DPB, AC, AN, AF, RO...8, AO...9, PRO, PAO, QR...12, QA...13, PQR, PQA, SRF, SRR, SAF, SAR, SRP,
## SAP, AB, ABP, RUN, RPP, R...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
## message.
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## [1] 194 3
## [1] 194 3
## [1] 202 3
## [1] 201 3
shared_nbh <- nbh_in[["position"]] %in% jc_in[["position"]]
shared_jc <- jc_in[["position"]] %in% nbh_in[["position"]]
nbh_in[!shared_nbh, "position"]
## [1] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_732792_ref_A_alt_G" "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_732857_ref_A_alt_C"
## [3] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_4952996_ref_A_alt_C"
## [1] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_807679_ref_A_alt_G" "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_5536163_ref_G_alt_A"
## [3] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_6070084_ref_T_alt_C"
together <- merge(nbh_in, jc_in, by = "position", all = TRUE)
shared_a5 <- a5_in[["position"]] %in% e4_in[["position"]]
shared_e4 <- e4_in[["position"]] %in% a5_in[["position"]]
a5_in[!shared_a5, "position"]
## [1] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_807679_ref_A_alt_G"
## [2] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_1640196_ref_G_alt_T"
## [3] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_2787819_ref_CAG_alt_CAAG,CAA"
## [4] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_3515863_ref_TTCA_alt_CTCA"
## [5] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_4952882_ref_T_alt_G"
## [6] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_4957228_ref_T_alt_C"
## [7] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_5028194_ref_CGGGTGGGTTCTTCCC_alt_CGGTCGCGTCTTCCC"
## [8] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_5460283_ref_A_alt_G"
## [1] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_732792_ref_A_alt_G"
## [2] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_732857_ref_A_alt_C"
## [3] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_2787819_ref_CAG_alt_CAAG"
## [4] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_3515863_ref_T_alt_C"
## [5] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_3558713_ref_T_alt_C"
## [6] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_3832569_ref_T_alt_A"
## [7] "chr_Pseudomonas_aeruginosa_UCBPP_PA14_pos_5028194_ref_CGGGTGGGT_alt_CGGTCGCG"
I keep forgetting to send Vince a sheet describing the state of the cup/pel and vibrio samples. Let us fix that now. I did process them and create a sample sheet, so it should at least be pretty easy.
Oh crap I used the gene# instead of PA# when mapping. My previous IDs are invalid for these samples.
pa14_annot <- load_gff_annotations("~/libraries/genome/paeruginosa_pa14.gff",
type = "gene", id_col = "gene_id")
## Returning a df with 16 columns and 5979 rows.
rownames(pa14_annot) <- pa14_annot[["gene_id"]]
cup_expt <- create_expt("sample_sheets/all_samples_pa14_202308_modified.xlsx",
gene_info = pa14_annot, file_column = "hisatcounttable")
## Reading the sample metadata.
## The sample definitions comprises: 6 rows(samples) and 27 columns(metadata fields).
## Matched 5979 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 6 samples.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 67 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 67 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## The expressionset has a minimal or missing set of conditions/batches.
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
## `geom_smooth()` using formula = 'y ~ x'
I do not think I ever wrote down the commands used to preprocess the vibrio samples, likely because I just did them with another set?
I just added a job to give per-base coverage stats, lets run that. Note, if I use this new cyoa version, a lot of things will break horribly because I reorganized my reference data directory but have not yet finished the process.
module purge
module add cyoa/202302
cd preprocessing/202308_vibrio/
start=$(pwd)
for i in $(/bin/ls -d A*); do
cd $i
echo $i
input=$(find unprocessed -type f | tr '\n' ':')
cyoa --method pdnaseq --species vibrio_cholerae_a1552 \
--introns 0 --gff_type CDS --gff_tag locus_tag \
--input $input
cd $start
done
module purge
module add cyoa/202402
cd preprocessing/202308_vibrio/
start=$(pwd)
for i in $(/bin/ls -d A*); do
cd $i
echo $i
input=$(/bin/ls outputs/02hisat2_vibrio_cholerae_a1552/vibrio_cholerae_a1552_genome.bam)
cyoa --method bam2cov --input ${input}
cd $start
done
I am going to change my metadata collector to accept a non-existant sample sheet.
queried_species <- "vibrio_cholerae_a1552"
modified <- gather_preprocessing_metadata(basedir = "preprocessing/202308_vibrio", verbose = TRUE,
species = queried_species, specification = make_dnaseq_spec(),
new_metadata = "sample_sheets/202308_vibrio.xlsx")
## Using provided specification
## Starting trimomatic_input: 1.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*trimomatic/*-trimomatic.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*trimomatic/*-trimomatic.stderr.
## Found the correct line:
## Input Read Pairs: 2512423 Both Surviving: 2265635 (90.18%) Forward Only Surviving: 138802 (5.52%) Reverse Only Surviving: 36832 (1.47%) Dropped: 71154 (2.83%)
## Found the correct line:
## Input Read Pairs: 2966181 Both Surviving: 2726646 (91.92%) Forward Only Surviving: 122563 (4.13%) Reverse Only Surviving: 44966 (1.52%) Dropped: 72006 (2.43%)
## Found the correct line:
## Input Read Pairs: 2716935 Both Surviving: 2468715 (90.86%) Forward Only Surviving: 136092 (5.01%) Reverse Only Surviving: 39967 (1.47%) Dropped: 72161 (2.66%)
## Found the correct line:
## Input Read Pairs: 2506738 Both Surviving: 2262819 (90.27%) Forward Only Surviving: 87915 (3.51%) Reverse Only Surviving: 80945 (3.23%) Dropped: 75059 (2.99%)
## Starting trimomatic_output: 2.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*trimomatic/*-trimomatic.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*trimomatic/*-trimomatic.stderr.
## Found the correct line:
## Input Read Pairs: 2512423 Both Surviving: 2265635 (90.18%) Forward Only Surviving: 138802 (5.52%) Reverse Only Surviving: 36832 (1.47%) Dropped: 71154 (2.83%)
## Found the correct line:
## Input Read Pairs: 2966181 Both Surviving: 2726646 (91.92%) Forward Only Surviving: 122563 (4.13%) Reverse Only Surviving: 44966 (1.52%) Dropped: 72006 (2.43%)
## Found the correct line:
## Input Read Pairs: 2716935 Both Surviving: 2468715 (90.86%) Forward Only Surviving: 136092 (5.01%) Reverse Only Surviving: 39967 (1.47%) Dropped: 72161 (2.66%)
## Found the correct line:
## Input Read Pairs: 2506738 Both Surviving: 2262819 (90.27%) Forward Only Surviving: 87915 (3.51%) Reverse Only Surviving: 80945 (3.23%) Dropped: 75059 (2.99%)
## Starting trimomatic_ratio: 3.
## Checking input_file_spec: .
## The numerator column is: trimomatic_output.
## The denominator column is: trimomatic_input.
## Starting fastqc_pct_gc: 4.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Found the correct line:
## %GC 47
## Found the correct line:
## %GC 47
## Found the correct line:
## %GC 48
## Found the correct line:
## %GC 47
## Starting fastqc_most_overrepresented: 5.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Not including new entries for: fastqc_most_overrepresented, it is empty.
## Starting hisat_genome_single_concordant: 6.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/hisat*_*genome*.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/hisat*_*genome*.stderr.
## Found the correct line:
## 2220685 (98.02%) aligned concordantly exactly 1 time
## Found the correct line:
## 2666383 (97.79%) aligned concordantly exactly 1 time
## Found the correct line:
## 2428374 (98.37%) aligned concordantly exactly 1 time
## Found the correct line:
## 2214393 (97.86%) aligned concordantly exactly 1 time
## Starting hisat_genome_multi_concordant: 7.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/hisat*_*genome*.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/hisat*_*genome*.stderr.
## Found the correct line:
## 40157 (1.77%) aligned concordantly >1 times
## Found the correct line:
## 56552 (2.07%) aligned concordantly >1 times
## Found the correct line:
## 35989 (1.46%) aligned concordantly >1 times
## Found the correct line:
## 45388 (2.01%) aligned concordantly >1 times
## Starting hisat_genome_single_all: 8.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/hisat*_*genome*.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/hisat*_*genome*.stderr.
## Found the correct line:
## 2538 (50.70%) aligned exactly 1 time
## Found the correct line:
## 2086 (49.81%) aligned exactly 1 time
## Found the correct line:
## 2364 (49.35%) aligned exactly 1 time
## Found the correct line:
## 1855 (50.77%) aligned exactly 1 time
## Starting hisat_genome_multi_all: 9.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/hisat*_*genome*.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/hisat*_*genome*.stderr.
## Found the correct line:
## 172 (3.44%) aligned >1 times
## Found the correct line:
## 117 (2.79%) aligned >1 times
## Found the correct line:
## 112 (2.34%) aligned >1 times
## Found the correct line:
## 112 (3.07%) aligned >1 times
## Starting hisat_genome_percent_log: 10.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/hisat*_*{type}*.stderr.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/hisat*_*genome*.stderr.
## Found the correct line:
## 99.95% overall alignment rate
## Found the correct line:
## 99.96% overall alignment rate
## Found the correct line:
## 99.95% overall alignment rate
## Found the correct line:
## 99.96% overall alignment rate
## Starting hisat_genome_pct_vs_trimmed: 11.
## Warning in gather_preprocessing_metadata(basedir = "preprocessing/202308_vibrio", : Column: hisat_genome_percent_log already exists, replacing it.
## Checking input_file_spec: .
## Starting gatk_unpaired: 12.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Not including new entries for: gatk_unpaired, it is empty.
## Starting gatk_paired: 13.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting gatk_supplementary: 14.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting gatk_unmapped: 15.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Not including new entries for: gatk_unmapped, it is empty.
## Starting gatk_unpaired_duplicates: 16.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Not including new entries for: gatk_unpaired_duplicates, it is empty.
## Starting gatk_paired_duplicates: 17.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting gatk_paired_opt_duplicates: 18.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting gatk_duplicate_pct: 19.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting gatk_libsize: 20.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example regex filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Found the correct line:
## Unknown Library 0 2260842 257414 0 0 205928 100039 0.091085 21320690
## Found the correct line:
## Unknown Library 0 2722935 327740 0 0 256760 103550 0.094295 21509499
## Found the correct line:
## Unknown Library 0 2464363 229232 0 0 233245 103907 0.094647 20745281
## Found the correct line:
## Unknown Library 0 2259781 263980 0 0 202776 86705 0.089733 19611117
## Starting freebayes_observed: 21.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/all_tags*.
## Example count filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/all_tags*.
## Starting input_r1: 22.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/scripts/*trim_*.sh.
## Example regex filename: preprocessing/202308_vibrio/A1552/scripts/*trim_*.sh.
## Found the correct line:
## <(less unprocessed/A1552_S52_R1_001.fastq.xz) <(less unprocessed/A1552_S52_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_capV_S53_R1_001.fastq.xz) <(less unprocessed/A1552_capV_S53_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_capV_dncV_S55_R1_001.fastq.xz) <(less unprocessed/A1552_capV_dncV_S55_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_dncV_S54_R1_001.fastq.xz) <(less unprocessed/A1552_dncV_S54_R2_001.fastq.xz) \
## Starting input_r2: 23.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/scripts/*trim_*.sh.
## Example regex filename: preprocessing/202308_vibrio/A1552/scripts/*trim_*.sh.
## Found the correct line:
## <(less unprocessed/A1552_S52_R1_001.fastq.xz) <(less unprocessed/A1552_S52_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_capV_S53_R1_001.fastq.xz) <(less unprocessed/A1552_capV_S53_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_capV_dncV_S55_R1_001.fastq.xz) <(less unprocessed/A1552_capV_dncV_S55_R2_001.fastq.xz) \
## Found the correct line:
## <(less unprocessed/A1552_dncV_S54_R1_001.fastq.xz) <(less unprocessed/A1552_dncV_S54_R2_001.fastq.xz) \
## Starting freebayes_observed_file: 24.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/all_tags*.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/all_tags*.
## Starting hisat_count_table: 25.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat*_{species}/{species}_*{type}*.count.xz.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*hisat*_vibrio_cholerae_a1552/vibrio_cholerae_a1552_*genome*.count.xz.
## Starting deduplication_stats: 26.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/deduplication_stats.txt.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/deduplication_stats.txt.
## Starting freebayes_variants_by_gene: 27.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/variants_by_gene*.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/variants_by_gene*.
## Starting freebayes_variants_table: 28.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/all_tags*.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/all_tags*.
## Starting freebayes_modified_genome: 29.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/{species}*.fasta.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552*.fasta.
## Starting freebayes_bcf_file: 30.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/{species}*.bcf.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552*.bcf.
## Starting freebayes_penetrance_file: 31.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*freebayes_{species}/variants_penetrance*.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*freebayes_vibrio_cholerae_a1552/variants_penetrance*.
## Starting bedtools_coverage_file: 32.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*bedtools_coverage_{species}/*.bed.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*bedtools_coverage_vibrio_cholerae_a1552/*.bed.
## Not including new entries for: bedtools_coverage_file, it is empty.
## Starting bbmap_coverage_stats: 33.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*bam2coverage*/coverage.tsv.xz.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*bam2coverage*/coverage.tsv.xz.
## Not including new entries for: bbmap_coverage_stats, it is empty.
## Starting bbmap_coverage_per_nt: 34.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*bam2coverage*/base_coverage.tsv.xz.
## Example filename: preprocessing/202308_vibrio/A1552/outputs/*bam2coverage*/base_coverage.tsv.xz.
## Not including new entries for: bbmap_coverage_per_nt, it is empty.
## Writing new metadata to: sample_sheets/202308_vibrio.xlsx
## Deleting the file sample_sheets/202308_vibrio.xlsx before writing the tables.
rownames | sampleid | condition | batch | trimomaticinput | trimomaticoutput | trimomaticpercent | fastqcpctgc | hisatgenomesingleconcordant | hisatgenomemulticoncordant | hisatgenomesingleall | hisatgenomemultiall | hisatgenomepercentlog | gatkpaired | gatksupplementary | gatkpairedduplicates | gatkpairedoptduplicates | gatkduplicatepct | gatklibsize | freebayesobserved | inputr1 | inputr2 | freebayesobservedfile | hisatcounttable | deduplicationstats | freebayesvariantsbygene | freebayesvariantstable | freebayesmodifiedgenome | freebayesbcffile | freebayespenetrancefile | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1552 | A1552 | A1552 | undefined | undefined | 2512423 | 2265635 | 0.902 | 47 | 2220685 | 40157 | 2538 | 172 | 0.980 | 2260842 | 257414 | 205928 | 100039 | 0.0911 | 21320690 | 10 | unprocessed/A1552_S52_R1_001.fastq.xz | unprocessed/A1552_S52_R2_001.fastq.xz | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552/outputs/02hisat2_vibrio_cholerae_a1552/vibrio_cholerae_a1552_genome-paired_sreverse_CDS_locus_tag.count.xz | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/deduplication_stats.txt | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/variants_by_gene.txt.xz | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552-A1552.fasta | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552.bcf | preprocessing/202308_vibrio/A1552/outputs/03freebayes_vibrio_cholerae_a1552/variants_penetrance.txt.xz |
A1552_capV | A1552_capV | A1552_capV | undefined | undefined | 2966181 | 2726646 | 0.919 | 47 | 2666383 | 56552 | 2086 | 117 | 0.978 | 2722935 | 327740 | 256760 | 103550 | 0.0943 | 21509499 | 13 | unprocessed/A1552_capV_S53_R1_001.fastq.xz | unprocessed/A1552_capV_S53_R2_001.fastq.xz | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_capV/outputs/02hisat2_vibrio_cholerae_a1552/vibrio_cholerae_a1552_genome-paired_sreverse_CDS_locus_tag.count.xz | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/deduplication_stats.txt | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/variants_by_gene.txt.xz | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552-A1552_capV.fasta | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552.bcf | preprocessing/202308_vibrio/A1552_capV/outputs/03freebayes_vibrio_cholerae_a1552/variants_penetrance.txt.xz |
A1552_capV_dncV | A1552_capV_dncV | A1552_capV_dncV | undefined | undefined | 2716935 | 2468715 | 0.909 | 48 | 2428374 | 35989 | 2364 | 112 | 0.984 | 2464363 | 229232 | 233245 | 103907 | 0.0946 | 20745281 | 11 | unprocessed/A1552_capV_dncV_S55_R1_001.fastq.xz | unprocessed/A1552_capV_dncV_S55_R2_001.fastq.xz | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/02hisat2_vibrio_cholerae_a1552/vibrio_cholerae_a1552_genome-paired_sreverse_CDS_locus_tag.count.xz | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/deduplication_stats.txt | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/variants_by_gene.txt.xz | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552-A1552_capV_dncV.fasta | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552.bcf | preprocessing/202308_vibrio/A1552_capV_dncV/outputs/03freebayes_vibrio_cholerae_a1552/variants_penetrance.txt.xz |
A1552_dncV | A1552_dncV | A1552_dncV | undefined | undefined | 2506738 | 2262819 | 0.903 | 47 | 2214393 | 45388 | 1855 | 112 | 0.979 | 2259781 | 263980 | 202776 | 86705 | 0.0897 | 19611117 | 14 | unprocessed/A1552_dncV_S54_R1_001.fastq.xz | unprocessed/A1552_dncV_S54_R2_001.fastq.xz | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_dncV/outputs/02hisat2_vibrio_cholerae_a1552/vibrio_cholerae_a1552_genome-paired_sreverse_CDS_locus_tag.count.xz | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/deduplication_stats.txt | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/variants_by_gene.txt.xz | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/all_tags.txt.xz | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552-A1552_dncV.fasta | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/vibrio_cholerae_a1552.bcf | preprocessing/202308_vibrio/A1552_dncV/outputs/03freebayes_vibrio_cholerae_a1552/variants_penetrance.txt.xz |
vibrio_annot <- load_gff_annotations("~/libraries/genome/vibrio_cholerae_a1552.gff", id_col = "locus_tag",
type = "CDS")
## Returning a df with 39 columns and 3833 rows.
rownames(vibrio_annot) <- make.names(vibrio_annot[["locus_tag"]], unique = TRUE)
vibrio_expt <- create_expt("sample_sheets/202308_vibrio.xlsx",
file_column = "hisatcounttable", gene_info = vibrio_annot)
## Reading the sample metadata.
## The sample definitions comprises: 4 rows(samples) and 30 columns(metadata fields).
## Matched 3825 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 3825 features and 4 samples.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning in plot_pca(..., pc_method = "tsne"): TSNE: Attempting to auto-detect perplexity failed, setting it to 1.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## `geom_smooth()` using formula = 'y ~ x'
## Warning in plot_pca(..., pc_method = "tsne"): TSNE: Attempting to auto-detect perplexity failed, setting it to 1.
## The expressionset has a minimal or missing set of conditions/batches.
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
## `geom_smooth()` using formula = 'y ~ x'
Copy/pasted from above and modified to match the vibrio samples.
## [1] "rownames" "sampleid" "condition" "batch"
## [5] "trimomaticinput" "trimomaticoutput" "trimomaticpercent" "fastqcpctgc"
## [9] "hisatgenomesingleconcordant" "hisatgenomemulticoncordant" "hisatgenomesingleall" "hisatgenomemultiall"
## [13] "hisatgenomepercentlog" "gatkpaired" "gatksupplementary" "gatkpairedduplicates"
## [17] "gatkpairedoptduplicates" "gatkduplicatepct" "gatklibsize" "freebayesobserved"
## [21] "inputr1" "inputr2" "freebayesobservedfile" "hisatcounttable"
## [25] "deduplicationstats" "freebayesvariantsbygene" "freebayesvariantstable" "freebayesmodifiedgenome"
## [29] "freebayesbcffile" "freebayespenetrancefile" "file"
gene_mutations <- pData(vibrio_expt)[["freebayesvariantsbygene"]]
names(gene_mutations) <- rownames(pData(vibrio_expt))
start <- init_xlsx(excel = "excel/vibrio_mutations.xlsx")
## Deleting the file excel/vibrio_mutations.xlsx before writing the tables.
wb <- start[["wb"]]
for (s in seq_len(length(gene_mutations))) {
sample_name <- names(gene_mutations)[[s]]
if (gene_mutations[[s]] == "") {
next
}
sample_data <- readr::read_tsv(gene_mutations[[s]])
if (nrow(sample_data) == 0) {
next
}
written <- write_xlsx(data = sample_data, sheet = sample_name, wb = wb)
}
## Rows: 6 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): gene, chromosome, from_to, aa_subst
## dbl (1): position
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 7 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): gene, chromosome, from_to, aa_subst
## dbl (1): position
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 8 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): gene, chromosome, from_to, aa_subst
## dbl (1): position
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 7 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): gene, chromosome, from_to, aa_subst
## dbl (1): position
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
couple of notes to myself: I think these newest samples are seeking to understand the evolutionary pressures which resulted in plasmid mutations to the yciV promoter. Thus, I think they primary place to look is at that locus, both in the genome and plasmid.
In PA01 it is PA3200 and located at approximately location 3592500 as the first ORF of a reverse strand operon.
Repeat the printing of counts/gene for a series of samples which have deleted operons for combinations of cupA/cupB/cupC,cupD, orn, pel, pilA and fliC.
I just created a blank sample sheet with appropriate IDs. Let us take a peek!
Caveat: I have yet to run the full freebayes variant caller. That is running now, but I think is not particularly desired for the purposes of this query.
spec <- make_dnaseq_spec()
meta_202401 <- gather_preprocessing_metadata("sample_sheets/202401_samples.xlsx",
species = "paeruginosa_pa14", verbose = FALSE,
specification = spec, basedir = "preprocessing/202401")
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Warning in gather_preprocessing_metadata("sample_sheets/202401_samples.xlsx", : Column: hisat_genome_percent_log already exists, replacing it.
## Writing new metadata to: sample_sheets/202401_samples_modified.xlsx
## Deleting the file sample_sheets/202401_samples_modified.xlsx before writing the tables.
expt_202401 <- create_expt(meta_202401[["new_meta"]], file_column = "hisatcounttable",
gene_info = pa14_annot)
## Reading the sample metadata.
## The sample definitions comprises: 13 rows(samples) and 22 columns(metadata fields).
## Warning in create_expt(meta_202401[["new_meta"]], file_column = "hisatcounttable", : Even after changing the rownames in gene info, they do not
## match the count table.
## Even after changing the rownames in gene info, they do not match the count table.
## Here are the first few rownames from the count tables:
## PA14_00010, PA14_00020, PA14_00030, PA14_00050, PA14_00060, PA14_00070
## Here are the first few rownames from the gene information table:
## gene1650835, gene1650837, gene1650839, gene1650841, gene1650843, gene1650845
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 13 samples.
## Deleting the file excel/cup_and_friends.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 182 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 182 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## The expressionset has a minimal or missing set of conditions/batches.
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
## `geom_smooth()` using formula = 'y ~ x'
I am basically going to copy/paste the previous block and assume all goes well.
spec <- make_dnaseq_spec()
meta_202406 <- gather_preprocessing_metadata("sample_sheets/202406_samples.xlsx",
species = "paeruginosa_pa14", verbose = FALSE,
specification = spec, basedir = "preprocessing/202406")
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Warning in gather_preprocessing_metadata("sample_sheets/202406_samples.xlsx", : Column: hisat_genome_percent_log already exists, replacing it.
## Writing new metadata to: sample_sheets/202406_samples_modified.xlsx
## Deleting the file sample_sheets/202406_samples_modified.xlsx before writing the tables.
expt_202406 <- create_expt(meta_202406[["new_meta"]], file_column = "hisatcounttable",
gene_info = pa14_annot)
## Reading the sample metadata.
## The sample definitions comprises: 9 rows(samples) and 29 columns(metadata fields).
## Warning in create_expt(meta_202406[["new_meta"]], file_column = "hisatcounttable", : Even after changing the rownames in gene info, they do not
## match the count table.
## Even after changing the rownames in gene info, they do not match the count table.
## Here are the first few rownames from the count tables:
## PA14_00010, PA14_00020, PA14_00030, PA14_00050, PA14_00060, PA14_00070
## Here are the first few rownames from the gene information table:
## gene1650835, gene1650837, gene1650839, gene1650841, gene1650843, gene1650845
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 9 samples.
## Deleting the file excel/cup_and_friends_202406.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 199 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 199 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## The expressionset has a minimal or missing set of conditions/batches.
## Error in dstat0[, i] : subscript out of bounds
## `geom_smooth()` using formula = 'y ~ x'
Vince made a query about the libraries: could the strange ‘dippyness’ of these new libraries be the reason some of the previous samples had so many contigs when we attempted to assemble them. Therefore he is curious to see the assembly state of a few of the original samples.
To that end, I am going to invoke an assembly of a couple of the earliest samples.
While I am at it, I will remap these two to get rid of the junctions. In addition, I loaded them alongside one of the new samples.
cd preprocessing
cd PA14_JC
cyoa --method unicycler --input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz
cyoa --method hisat --species paeruginosa_pa14 --intron 0 \
--input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz \
--gff_tag Alias
cd ../PA14_NBH
cyoa --method unicycler --input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz
cyoa --method hisat --species paeruginosa_pa14 --intron 0 \
--input r1_trimmed.fastq.xz:r2_trimmed.fastq.xz \
--gff_tag Alias
The previous samples had some problems vis a vis duplicate reads and were resequenced. I downloaded the resequenced samples into a new directory (202408_pa14). I was away for a couple weeks and seem to be having a little difficulty getting my head on straight, so I will hopefully get this correctly done on the first pass by doing the following:
module add cyoa/202404_hack
cd preprocessing/202408_pa14
start=$(pwd)
for i in $(/bin/ls -d PA14*); do
cd $i
echo $i
cyoa --method pdnaseq --species paeruginosa_pa14 --intron 0 \
--input $(/bin/ls *.gz | tr '\n' ':') \
--gff_tag Alias
cd $start
done
Upon completion, I will collect the various stats using the following block, which I copy/pasted from above, with the caveat that I copied the sample sheet from the seqcenter to ‘202408_samples.xlsx’ and so all the metadata collected will get added as new columns to it.
spec <- make_dnaseq_spec()
meta_202408 <- gather_preprocessing_metadata("sample_sheets/202408_samples.xlsx",
species = "paeruginosa_pa14", verbose = FALSE,
specification = spec, basedir = "preprocessing/202408_pa14")
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Warning in gather_preprocessing_metadata("sample_sheets/202408_samples.xlsx", : Column: hisat_genome_percent_log already exists, replacing it.
## Writing new metadata to: sample_sheets/202408_samples_modified.xlsx
## Deleting the file sample_sheets/202408_samples_modified.xlsx before writing the tables.
expt_202408 <- create_expt(meta_202408[["new_meta"]], file_column = "hisatcounttable",
gene_info = pa14_annot)
## Reading the sample metadata.
## The sample definitions comprises: 6 rows(samples) and 35 columns(metadata fields).
## Warning in create_expt(meta_202408[["new_meta"]], file_column = "hisatcounttable", : Even after changing the rownames in gene info, they do not
## match the count table.
## Even after changing the rownames in gene info, they do not match the count table.
## Here are the first few rownames from the count tables:
## PA14_00010, PA14_00020, PA14_00030, PA14_00050, PA14_00060, PA14_00070
## Here are the first few rownames from the gene information table:
## gene1650835, gene1650837, gene1650839, gene1650841, gene1650843, gene1650845
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 5979 features and 6 samples.
## Deleting the file excel/cup_and_friends_202408_reseq.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 106 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 106 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## The expressionset has a minimal or missing set of conditions/batches.
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
## `geom_smooth()` using formula = 'y ~ x'
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.7.1), hpgltools(v.1.0), testthat(v.3.2.1), reticulate(v.1.38.0), Matrix(v.1.6-5), glue(v.1.7.0), SummarizedExperiment(v.1.32.0), GenomicRanges(v.1.54.1), GenomeInfoDb(v.1.38.6), IRanges(v.2.36.0), S4Vectors(v.0.40.2), MatrixGenerics(v.1.14.0), matrixStats(v.1.2.0), Biobase(v.2.62.0) and BiocGenerics(v.0.48.1)
loaded via a namespace (and not attached): fs(v.1.6.3), bitops(v.1.0-7), enrichplot(v.1.22.0), devtools(v.2.4.5), HDO.db(v.0.99.1), httr(v.1.4.7), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), profvis(v.0.3.8), tools(v.4.3.1), backports(v.1.4.1), utf8(v.1.2.4), R6(v.2.5.1), lazyeval(v.0.2.2), mgcv(v.1.9-1), urlchecker(v.1.0.1), withr(v.3.0.0), prettyunits(v.1.2.0), gridExtra(v.2.3), cli(v.3.6.2), scatterpie(v.0.2.1), labeling(v.0.4.3), sass(v.0.4.8), mvtnorm(v.1.2-4), readr(v.2.1.5), genefilter(v.1.84.0), Rsamtools(v.2.18.0), yulab.utils(v.0.1.4), gson(v.0.1.0), DOSE(v.3.28.2), sessioninfo(v.1.2.2), limma(v.3.58.1), rstudioapi(v.0.15.0), RSQLite(v.2.3.5), generics(v.0.1.3), gridGraphics(v.0.5-1), BiocIO(v.1.12.0), vroom(v.1.6.5), gtools(v.3.9.5), zip(v.2.3.1), dplyr(v.1.1.4), GO.db(v.3.18.0), fansi(v.1.0.6), abind(v.1.4-5), lifecycle(v.1.0.4), yaml(v.2.3.8), edgeR(v.4.0.16), Rtsne(v.0.17), gplots(v.3.1.3.1), qvalue(v.2.34.0), SparseArray(v.1.2.4), BiocFileCache(v.2.10.1), grid(v.4.3.1), blob(v.1.2.4), promises(v.1.2.1), crayon(v.1.5.2), miniUI(v.0.1.1.1), lattice(v.0.22-5), cowplot(v.1.1.3), GenomicFeatures(v.1.54.3), annotate(v.1.80.0), KEGGREST(v.1.42.0), pillar(v.1.9.0), knitr(v.1.45), varhandle(v.2.0.6), fgsea(v.1.28.0), rjson(v.0.2.21), boot(v.1.3-29), corpcor(v.1.6.10), codetools(v.0.2-19), fastmatch(v.1.1-4), ggfun(v.0.1.4), data.table(v.1.15.0), remotes(v.2.4.2.1), treeio(v.1.26.0), vctrs(v.0.6.5), png(v.0.1-8), Rdpack(v.2.6), gtable(v.0.3.4), cachem(v.1.0.8), openxlsx(v.4.2.5.2), xfun(v.0.42), rbibutils(v.2.2.16), S4Arrays(v.1.2.0), mime(v.0.12), tidygraph(v.1.3.1), survival(v.3.5-8), iterators(v.1.0.14), statmod(v.1.5.0), directlabels(v.2024.1.21), ellipsis(v.0.3.2), nlme(v.3.1-164), pbkrtest(v.0.5.2), ggtree(v.3.10.0), usethis(v.2.2.3), bit64(v.4.0.5), progress(v.1.2.3), EnvStats(v.2.8.1), filelock(v.1.0.3), rprojroot(v.2.0.4), bslib(v.0.6.1), KernSmooth(v.2.23-22), colorspace(v.2.1-0), DBI(v.1.2.2), tidyselect(v.1.2.0), bit(v.4.0.5), compiler(v.4.3.1), curl(v.5.2.0), graph(v.1.80.0), xml2(v.1.3.6), desc(v.1.4.3), DelayedArray(v.0.28.0), plotly(v.4.10.4), shadowtext(v.0.1.3), rtracklayer(v.1.62.0), scales(v.1.3.0), caTools(v.1.18.2), remaCor(v.0.0.18), quadprog(v.1.5-8), rappdirs(v.0.3.3), stringr(v.1.5.1), digest(v.0.6.34), minqa(v.1.2.6), variancePartition(v.1.32.5), rmarkdown(v.2.25), aod(v.1.3.3), XVector(v.0.42.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.7), pkgconfig(v.2.0.3), lme4(v.1.1-35.1), dbplyr(v.2.4.0), fastmap(v.1.1.1), rlang(v.1.1.3), htmlwidgets(v.1.6.4), shiny(v.1.8.0), farver(v.2.1.1), jquerylib(v.0.1.4), jsonlite(v.1.8.8), BiocParallel(v.1.36.0), GOSemSim(v.2.28.1), RCurl(v.1.98-1.14), magrittr(v.2.0.3), GenomeInfoDbData(v.1.2.11), ggplotify(v.0.1.2), patchwork(v.1.2.0), munsell(v.0.5.0), Rcpp(v.1.0.12), ape(v.5.7-1), viridis(v.0.6.5), stringi(v.1.8.3), ggraph(v.2.1.0), brio(v.1.1.4), zlibbioc(v.1.48.0), MASS(v.7.3-60.0.1), plyr(v.1.8.9), pkgbuild(v.1.4.3), parallel(v.4.3.1), ggrepel(v.0.9.5), forcats(v.1.0.0), Biostrings(v.2.70.2), graphlayouts(v.1.1.0), splines(v.4.3.1), pander(v.0.6.5), hms(v.1.1.3), locfit(v.1.5-9.8), fastcluster(v.1.2.6), igraph(v.2.0.2), reshape2(v.1.4.4), biomaRt(v.2.58.2), pkgload(v.1.3.4), XML(v.3.99-0.16.1), evaluate(v.0.23), tzdb(v.0.4.0), nloptr(v.2.0.3), PROPER(v.1.34.0), foreach(v.1.5.2), tweenr(v.2.0.2), httpuv(v.1.6.14), tidyr(v.1.3.1), purrr(v.1.0.2), polyclip(v.1.10-6), ggplot2(v.3.5.0), ggforce(v.0.4.2), broom(v.1.0.5), xtable(v.1.8-4), restfulr(v.0.0.15), tidytree(v.0.4.6), fANCOVA(v.0.6-1), later(v.1.3.2), viridisLite(v.0.4.2), tibble(v.3.2.1), lmerTest(v.3.1-3), clusterProfiler(v.4.10.0), aplot(v.0.2.2), memoise(v.2.0.1), AnnotationDbi(v.1.64.1), GenomicAlignments(v.1.38.2), sva(v.3.50.0) and GSEABase(v.1.64.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 5779a65508e724127c3394e5c0b4c56d3b650901
## This is hpgltools commit: Fri Jun 28 10:56:42 2024 -0400: 5779a65508e724127c3394e5c0b4c56d3b650901