Preprocessing Streptococcus pyogenes samples for TNSeq/RNASeq

Copy reads to the working tree

I prefer to concatenate the multiple sequence files into a single file for each sample. This process is however a bit different for rnaseq/tnseq.

RNASeq

cd preprocessing/rnaseq
for i in $(/bin/ls hpgl*); do
  ./copy_process.sh ${i}
done
## startdir=$(pwd)
## for i in "$@"; do
##   cd $startdir
##   cyoa --task rnaseq --method copyraw --raw_dir "/cbcb/lab/nelsayed/raw_data/ade" --hpgl ${i}
##   cd $i
##   inputs="${i}_forward.fastq.gz:${i}_reverse.fastq.gz"
##   trimmed="${i}_forward-trimmed.fastq.gz:${i}_reverse-trimmed.fastq.gz"
##   cyoa --task rnaseq --method fastqc --input ${inputs}
##   cyoa --task rnaseq --method trimomatic --input ${inputs}
##   trim_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s lpanamensis --depends "${trim_id}"
##   lp_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s lbraziliensis --depends "${trim_id}"
##   lb_id=$(cat last_job.txt)
##   cyoa --task rnaseq --method tophat --input ${trimmed} -s hsapiens --depends "${trim_id}"
##   hs_id=$(cat last_job.txt)
##   rm last_job.txt
##   cd $startdir
## done

TNSeq

The TNSeq data comes in a rather different format, the reads can not be successfully demultiplexed by the illumina fastatobcl.pl script because the indexes are in fact later in each read. As a result, the cyoa utility has a demultiplexing function. It is able to read a directory of sequence file and a .csv of index sequence to sample mappings. It goes one step further because of problems we have had in previous samples with N’s in the first 8 bases; and therefore attempts to take into account ambiguous heptamers and properly sort them.

Oh, random caveat: the computer cluster has a memory leak which will crash the computer if you try doing this on them, therefore this step is performed on a local workstation: DONT RUN IT ON IBIS OR YOU WILL GET YELLED AT!

cd preprocessing/tnseq/unprocessed/libs12  ## (repeat for libs34)
cyoa --task tnseq --method sort --sample_sheet=tnseq_samples.csv
## This assumes a file tnseq_samples.csv which looks like:
## Except the following example is from the new data, but still.
##TATAGCCT,subcuT0
##ATAGAGGC,subcuT12M2
##CCTATCCT,subcuT24M1
##GGCTCTGA,subcuT24M2
##AGGCGAAG,subcuT284M1
##TAATCTTA,subcuT284M2
##CAGGACGT,PMNLAVT0
##GTACTGAC,PMNLAVplasma
##ATTACTCG,PMNLAVexp1
##TCCGGAGA,PMNLAVexp2

Once that is complete, the data is ready for adapter trimming via cutadapt etc.

Fastqc (all sample types)

The cyoa tool writes out a script which looks like the following:

## This FastQC run is against rnaseq data and is used for
## an initial estimation of the overall sequencing quality.
mkdir -p outputs/fastqc && \
  fastqc --extract -o outputs/fastqc hpgl0241_forward-trimmed.fastq hpgl0241_reverse-trimmed.fastq \
  2>outputs/fastqc.out 1>&2

Adapter trimming

Trimomatic is the quickest tool for the rnaseq data, while cutadapt has flexibility for the tnseq.

RNASeq

The trimomatic script looks like: It is worth noting that in some cases, the illumina adapter removal results in a java exception. If that happens, rather than lose the data I made my cyoa script re-invoke trimomatic without the illumina clipping. In addition, if things need to be redone, then the sequences may in fact have been moved into the sequences/ directory and compressed with xz -9e, thus the if [[]] at the beginning testing for those files.

## Trimomatic_Pairwise: In case a trimming needs to be redone...
if [[ ! -r "hpgl0663_forward.fastq.gz" ]]; then
  if [[ -r "sequences/hpgl0663_forward.fastq.xz" ]]; then
    mv sequences/hpgl0663_forward.fastq.xz . && pxz -d hpgl0663_forward.fastq.xz && pigz hpgl0663_forward.fastq && mv sequences/hpgl0663_reverse.fastq.xz . &&
 pxz -d hpgl0663_reverse.fastq.xz && pigz hpgl0663_reverse.fastq
  else
    echo "Missing files. Did not find hpgl0663_forward.fastq.gz nor sequences/hpgl0663_forward.fastq.xz"
    exit 1
  fi
fi
trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpair
ed.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz ILLUMINACLIP:/cbcbhomes/abelew/libraries/adapters.fa:2:20:4 SL
IDINGWINDOW:4:25 1>outputs/hpgl0663-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/hpgl0663-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimomatic PE -threads 1 -phred33 hpgl0663_forward.fastq.gz hpgl0663_reverse.fastq.gz hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed_unpa
ired.fastq.gz hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed_unpaired.fastq.gz SLIDINGWINDOW:4:25 1>>outputs/hpgl0663-trimomatic.out 2>&1
fi
sleep 10
mv hpgl0663_forward-trimmed_paired.fastq.gz hpgl0663_forward-trimmed.fastq.gz && mv hpgl0663_reverse-trimmed_paired.fastq.gz hpgl0663_reverse-trimmed.fastq.gz

TNSeq

The tnseq adapter trimming requires some extra flexibility to take into account the read lengths which must be between ~10-16 nucleotides. It is worth noting that there are extra ‘-a’ arguments to cutadapt which attempt to take into account some of the strange apparent PCR artifacts observed.

cyoa --task tnseq --method cutadapt --input file.fastq.gz
## The resulting script looks like
## mkdir -p /cbcb/nelsayed-scratch/atb/tnseq/gasv2/preprocessing/l1t0/outputs/cutadapt && \
##  less 5448_l1t0.fastq.gz | cutadapt -    -a ACAGGTTGGATGATAAGTCCCCGGTCTGACACATC -a ACAGTCCCCGGTCTGACACATCTCCCTAT -a ACAGTCCNCGGTCTGACACATCTCCCTAT  -e 0.1 -n 3 -m 7
##  -M 20 \
##   --too-short-output=/cbcb/nelsayed-scratch/atb/tnseq/gasv2/preprocessing/l1t0/outputs/cutadapt/5448_l1t0_tooshort.fastq \
##   --too-long-output=/cbcb/nelsayed-scratch/atb/tnseq/gasv2/preprocessing/l1t0/outputs/cutadapt/5448_l1t0_toolong.fastq \
##   --untrimmed-output=/cbcb/nelsayed-scratch/atb/tnseq/gasv2/preprocessing/l1t0/outputs/cutadapt/5448_l1t0_untrimmed.fastq \
##  2>outputs/cutadapt.err 1>5448_l1t0-trimmed.fastq

Alignments

For both the rnaseq and tnseq data, the reads were aligned with bowtie1 against the 5448 genome (NCBI accession: CP008776). Bowtie1 was chosen because all the reads are relatively short (either 50 or ~10 nucleotides). The same sets of parameters were chosen for both the rna and tn data for all the following steps.

cyoa --task rnaseq --method btmulti --input file.fastq.gz
## This generates a set of bowtie mapping scripts, sorted bam conversions, and htseq countings, they look like this:

## bowtie, 0 mismatches tnseq l1t0
## mkdir -p outputs/bowtie_mgas_5448 && sleep 10 && bowtie /cbcbhomes/abelew/libraries/genome/indexes/mgas_5448##  --best -v 0 -M 1  \
##  -p 1 \
##  -q  \
##  --un outputs/bowtie_mgas_5448/l1t0-v0M1_unaligned_mgas_5448.fastq \
##  --al outputs/bowtie_mgas_5448/l1t0-v0M1_aligned_mgas_5448.fastq \
##  -S outputs/bowtie_mgas_5448/l1t0-v0M1.sam \
##  2>outputs/bowtie_mgas_5448/l1t0-v0M1.err \
##  1>outputs/bowtie_mgas_5448/l1t0-v0M1.out

## Sam->Bam conversion
##samtools view -u -t /cbcbhomes/abelew/libraries/genome/mgas_5448.fasta -S outputs/bowtie_mgas_5448/l1t0-def.sam 1>outputs/bowtie_mgas_5448/l1t0-def.bam
##samtools sort -l 9 outputs/bowtie_mgas_5448/l1t0-def.bam outputs/bowtie_mgas_5448/l1t0-def-sorted
##rm outputs/bowtie_mgas_5448/l1t0-def.bam && rm outputs/bowtie_mgas_5448/l1t0-def.sam && mv outputs/bowtie_mgas_5448/l1t0-def-sorted.bam outputs/bowtie_mgas_5448/l1
##t0-def.bam && samtools index outputs/bowtie_mgas_5448/l1t0-def.bam
##bamtools stats -in outputs/bowtie_mgas_5448/l1t0-def.bam 2>outputs/bowtie_mgas_5448/l1t0-def.bam.stats 1>&2

## And counting via htseq
##htseq-count -q -f bam -s no  -i ID  -t gene \
##  outputs/bowtie_mgas_5448/5448_l1t0-trimmed_ta-v0M1.bam /cbcbhomes/abelew/libraries/genome/mgas_5448.gff \
##  1>outputs/bowtie_mgas_5448/5448_l1t0-trimmed_ta-v0M1.count 2>outputs/bowtie_mgas_5448/5448_l1t0-trimmed_ta-v0M1_htseq.err && \
##    xz -f -9e outputs/bowtie_mgas_5448/5448_l1t0-trimmed_ta-v0M1.count
### Small caveat, the rnaseq data is in fact stranded, so the -s no is set to -s yes here.