1 Introduction

One way which might be useful to learn about how to use these computers is to do everything through a document like this. In this markdown document, I can interleave text and code; I am currently (of course) typing arbitrary text.

But I can just as easily try out commands

start=$(pwd)/preprocessing
cd $start
rsync -av /mnt/cifs/sequencer/SeqData/NextSeq1000/250729_VL00136_107_AAH7C2HM5/Analysis/1/Data/fastq/ .

I then dumped a list of all the files to my editor and removed everything in the filenames starting at _S[0-9]+* This resulted in the following string which I should be able to use in order to organize the samples into reasonable sub directories.

samplenames="5448_dciaR_pSin_DMSO_R2 5448_dciaR_pSin_DMSO_R3 5448_dciaR_pSin_DMSO_R5 5448_dciaR_pSin_Heme_R2_58 5448_dciaR_pSin_Heme_R3_59 5448_dciaR_pSin_Heme_R5_60 5448_pJRS525_DMSO_R2_46 5448_pJRS525_DMSO_R3_76 5448_pJRS525_DMSO_R5_48 5448_pJRS525_Heme_R2_54 5448_pJRS525_Heme_R3_55 5448_pJRS525_Heme_R5_56 CDM_glucose_manLMN_A_5 CDM_glucose_manLMN_B_6 CDM_glucose_manLMN_C_7 CDM_glucose_manLMN_D_8 CDM_glucose_manLMN_Rescue_A_9 CDM_glucose_manLMN_Rescue_B_10 CDM_glucose_manLMN_Rescue_C_11 CDM_glucose_manLMN_Rescue_D_12 CDM_glucose_WT_5448_A_1 CDM_glucose_WT_5448_B_2 CDM_glucose_WT_5448_C_3 CDM_glucose_WT_5448_D_4 CDM_glucose_WT_NZ131_A_33 CDM_glucose_WT_NZ131_B_34 CDM_glucose_WT_NZ131_C_35 CDM_glucose_WT_NZ131_D_36 CDM_mannose_manLMN_Rescue_A_29 CDM_mannose_manLMN_Rescue_B_30 CDM_mannose_manLMN_Rescue_C_13cycles_67 CDM_mannose_manLMN_Rescue_D_32 CDM_mannose_WT_NZ131_A_41 CDM_mannose_WT_NZ131_B_42 CDM_mannose_WT_NZ131_C_43 CDM_mannose_WT_NZ131_D_44 CDM_manose_WT_5448_A_13cycles_63 CDM_manose_WT_5448_B_13cycles_64 CDM_manose_WT_5448_C_13cycles_65 CDM_manose_WT_5448_D_13cycles_66 CDM_sucrose_manLMN_A_17 CDM_sucrose_manLMN_B_18 CDM_sucrose_manLMN_C_19 CDM_sucrose_manLMN_D_20 CDM_sucrose_manLMN_Rescue_A_21 CDM_sucrose_manLMN_Rescue_B_22 CDM_sucrose_manLMN_Rescue_C_23 CDM_sucrose_manLMN_Rescue_D_24 CDM_sucrose_WT_5448_A_13 CDM_sucrose_WT_5448_B_14 CDM_sucrose_WT_5448_C_13cycles_61 CDM_sucrose_WT_5448_D_13cycles_62 CDM_sucrose_WT_NZ131_A_37 CDM_sucrose_WT_NZ131_B_38 CDM_sucrose_WT_NZ131_C_39 CDM_sucrose_WT_NZ131_D_40 Undetermined"
for i in $samplenames; do
    echo $i
    mkdir $i
    mv ${i}_*.fastq.gz ${i}/
done

2 Invoke trimomatic

Maybe I should just move to using fastp all the time?

start=$(pwd)
for i in $(/bin/ls -d * ); do
    cd ${start}/${i}
    mkdir unprocessed
    mv ../$i*.fastq.gz unprocessed
    input=$(/bin/ls unprocessed/* | tr '\n' ':')
    cyoa --jprefix 01 --method trimom --input $input
done
cd $start

We previously did ls -l together, in this instance I added the ‘tr’ which orders the files by time (t) and in reverse order (r). Therefore the bottom lines are the newest files.

Note, McIver lab explicitly asked that we move to the following assemblies when aligning the data: NZ_CP140117.2 for 5448 and GCA_000018125.5 for NZ131.

Let us take a moment and see which assemblies I have on hand at the moment. It turns out that GCA_000018125 has been suppressed by NCBI because it is not deemed complete; but it remains the only NZ131 assembly available.

In addition, NZ_CP140117 is the accession for the chromosome associated with assembly ‘GCF_034434355.2’

cd ~/libraries/genome/downloads
cyoa --method dldataset --input GCA_000018125.1
cyoa --method dldataset --input GCF_034434355.2
mv *.gbff.gz ../genbank
mv *.fsa ../fasta
mv *.gff ../gff
cd ../genbank
gunzip *.gbff.gz
ln -s GCF_034434355.2.gbff spyogenes_nv1.gbff
ln -s GCA_000018125.1.gbff spyogenes_nz131.gbff
cd ../gff
ln -s GCF_034434355.2.gff spyogenes_nv1.gff
head -n 40 spyogenes_nv1.gff
ln -s GCA_000018125.1.gff spyogenes_nz131.gff
head -n 40 spyogenes_nz131.gff
cd ../fasta
ln -s GCF_034434355.2.fsa spyogenes_nv1.fasta
ln -s GCA_000018125.1.fsa spyogenes_nz131.fasta

Now that the two genomes of interest are in place and have easy-to-read synonyms, let us do the mapping… I am just going to map all samples against both genomes.

I printed the first 40 lines of the gff files, it looks like the most appropriate types/tags are gene and locus_tag for both.

But first, use my test directory to make sure the genomes have been indexed.

cd test
less ../Undetermined/unprocessed/Undetermined_S0_R1_001.fastq.gz | head -n 100000 > r1.fastq
less ../Undetermined/unprocessed/Undetermined_S0_R2_001.fastq.gz | head -n 100000 > r2.fastq
mkdir unprocessed
mv *.fastq unprocessed
input=$(/bin/ls unprocessed/* | tr '\n' ':')
cyoa --jprefix 01 --method trimom --input $input

sp1=spyogenes_nv1
sp2=spyogenes_nz131
input=$(/bin/ls outputs/01trimomatic/*-trimmed.fastq.xz | tr '\n' ':')
echo $input
cyoa --jprefix 02 --method hisat --input $input --species $sp1 \
     --gff_type gene --gff_tag locus_tag --introns 0
cyoa --jprefix 02 --method hisat --input $input --species $sp2 \
     --gff_type gene --gff_tag locus_tag --introns 0
## wait a minute or so
ls -l outputs/02*
sp1=spyogenes_nv1
sp2=spyogenes_nz131
cd /scratch/atb/rnaseq/spyogenes_mgas_2025/preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v test); do
    cd ${start}/${i}
    input=$(/bin/ls outputs/01trimomatic/*-trimmed.fastq.xz | tr '\n' ':')
    cyoa --jprefix 02 --method hisat --input $input --species $sp1 \
         --gff_type gene --gff_tag locus_tag --introns 0
    cyoa --jprefix 02 --method hisat --input $input --species $sp2 \
         --gff_type gene --gff_tag locus_tag --introns 0
done
cd $start
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
message(paste0("Saving to ", savefile))
tmp <- sm(saveme(filename = savefile))
tmp <- loadme(filename = savefile)
---
title: "Streptococcus pyogenes 5448 and NZ131 comparisons."
author: "atb abelew@gmail.com"
bibliography: /home/trey/scratch/zotero_library/atb.bib
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---


```{r options, include = FALSE}
library(dplyr)
library(forcats)
library(glue)
library(hpgltools)
library(tidyr)

knitr::opts_knit$set(progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
rundate <- format(Sys.Date(), format = "%Y%m%d")

rmd_file <- "01datasets.Rmd"
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
methods <- list("basic" = TRUE, "deseq" = TRUE, "dream" = FALSE,
                "ebseq" = FALSE, "edger" = TRUE, "limma" = TRUE, "noiseq" = TRUE)
data_structures <- c("methods")
```

# Introduction

One way which might be useful to learn about how to use these
computers is to do everything through a document like this.  In this
markdown document, I can interleave text and code; I am currently (of
course) typing arbitrary text.


But I can just as easily try out commands

```{bash, eval=FALSE}
start=$(pwd)/preprocessing
cd $start
rsync -av /mnt/cifs/sequencer/SeqData/NextSeq1000/250729_VL00136_107_AAH7C2HM5/Analysis/1/Data/fastq/ .
```

I then dumped a list of all the files to my editor and removed
everything in the filenames starting at _S[0-9]+*  This resulted in
the following string which I should be able to use in order to
organize the samples into reasonable sub directories.

```{bash, eval=FALSE}
samplenames="5448_dciaR_pSin_DMSO_R2 5448_dciaR_pSin_DMSO_R3 5448_dciaR_pSin_DMSO_R5 5448_dciaR_pSin_Heme_R2_58 5448_dciaR_pSin_Heme_R3_59 5448_dciaR_pSin_Heme_R5_60 5448_pJRS525_DMSO_R2_46 5448_pJRS525_DMSO_R3_76 5448_pJRS525_DMSO_R5_48 5448_pJRS525_Heme_R2_54 5448_pJRS525_Heme_R3_55 5448_pJRS525_Heme_R5_56 CDM_glucose_manLMN_A_5 CDM_glucose_manLMN_B_6 CDM_glucose_manLMN_C_7 CDM_glucose_manLMN_D_8 CDM_glucose_manLMN_Rescue_A_9 CDM_glucose_manLMN_Rescue_B_10 CDM_glucose_manLMN_Rescue_C_11 CDM_glucose_manLMN_Rescue_D_12 CDM_glucose_WT_5448_A_1 CDM_glucose_WT_5448_B_2 CDM_glucose_WT_5448_C_3 CDM_glucose_WT_5448_D_4 CDM_glucose_WT_NZ131_A_33 CDM_glucose_WT_NZ131_B_34 CDM_glucose_WT_NZ131_C_35 CDM_glucose_WT_NZ131_D_36 CDM_mannose_manLMN_Rescue_A_29 CDM_mannose_manLMN_Rescue_B_30 CDM_mannose_manLMN_Rescue_C_13cycles_67 CDM_mannose_manLMN_Rescue_D_32 CDM_mannose_WT_NZ131_A_41 CDM_mannose_WT_NZ131_B_42 CDM_mannose_WT_NZ131_C_43 CDM_mannose_WT_NZ131_D_44 CDM_manose_WT_5448_A_13cycles_63 CDM_manose_WT_5448_B_13cycles_64 CDM_manose_WT_5448_C_13cycles_65 CDM_manose_WT_5448_D_13cycles_66 CDM_sucrose_manLMN_A_17 CDM_sucrose_manLMN_B_18 CDM_sucrose_manLMN_C_19 CDM_sucrose_manLMN_D_20 CDM_sucrose_manLMN_Rescue_A_21 CDM_sucrose_manLMN_Rescue_B_22 CDM_sucrose_manLMN_Rescue_C_23 CDM_sucrose_manLMN_Rescue_D_24 CDM_sucrose_WT_5448_A_13 CDM_sucrose_WT_5448_B_14 CDM_sucrose_WT_5448_C_13cycles_61 CDM_sucrose_WT_5448_D_13cycles_62 CDM_sucrose_WT_NZ131_A_37 CDM_sucrose_WT_NZ131_B_38 CDM_sucrose_WT_NZ131_C_39 CDM_sucrose_WT_NZ131_D_40 Undetermined"
for i in $samplenames; do
    echo $i
    mkdir $i
    mv ${i}_*.fastq.gz ${i}/
done
```

# Invoke trimomatic

Maybe I should just move to using fastp all the time?

```{bash, eval=FALSE}
start=$(pwd)
for i in $(/bin/ls -d * ); do
    cd ${start}/${i}
    mkdir unprocessed
    mv ../$i*.fastq.gz unprocessed
    input=$(/bin/ls unprocessed/* | tr '\n' ':')
    cyoa --jprefix 01 --method trimom --input $input
done
cd $start
```

We previously did ls -l together, in this instance I added the 'tr'
which orders the files by time (t) and in reverse order (r).
Therefore the bottom lines are the newest files.

Note, McIver lab explicitly asked that we move to the following
assemblies when aligning the data:  NZ_CP140117.2 for 5448 and
GCA_000018125.5 for NZ131.

Let us take a moment and see which assemblies I have on hand at the
moment.  It turns out that GCA_000018125 has been suppressed by NCBI
because it is not deemed complete; but it remains the only NZ131
assembly available.

In addition, NZ_CP140117 is the accession for the chromosome
associated with assembly 'GCF_034434355.2'

```{bash, eval=FALSE}
cd ~/libraries/genome/downloads
cyoa --method dldataset --input GCA_000018125.1
cyoa --method dldataset --input GCF_034434355.2
mv *.gbff.gz ../genbank
mv *.fsa ../fasta
mv *.gff ../gff
cd ../genbank
gunzip *.gbff.gz
ln -s GCF_034434355.2.gbff spyogenes_nv1.gbff
ln -s GCA_000018125.1.gbff spyogenes_nz131.gbff
cd ../gff
ln -s GCF_034434355.2.gff spyogenes_nv1.gff
head -n 40 spyogenes_nv1.gff
ln -s GCA_000018125.1.gff spyogenes_nz131.gff
head -n 40 spyogenes_nz131.gff
cd ../fasta
ln -s GCF_034434355.2.fsa spyogenes_nv1.fasta
ln -s GCA_000018125.1.fsa spyogenes_nz131.fasta
```

Now that the two genomes of interest are in place and have
easy-to-read synonyms, let us do the mapping...  I am just going to
map all samples against both genomes.

I printed the first 40 lines of the gff files, it looks like the most
appropriate types/tags are gene and locus_tag for both.

But first, use my test directory to make sure the genomes have been
indexed.

```{bash, eval=FALSE}
cd test
less ../Undetermined/unprocessed/Undetermined_S0_R1_001.fastq.gz | head -n 100000 > r1.fastq
less ../Undetermined/unprocessed/Undetermined_S0_R2_001.fastq.gz | head -n 100000 > r2.fastq
mkdir unprocessed
mv *.fastq unprocessed
input=$(/bin/ls unprocessed/* | tr '\n' ':')
cyoa --jprefix 01 --method trimom --input $input

sp1=spyogenes_nv1
sp2=spyogenes_nz131
input=$(/bin/ls outputs/01trimomatic/*-trimmed.fastq.xz | tr '\n' ':')
echo $input
cyoa --jprefix 02 --method hisat --input $input --species $sp1 \
     --gff_type gene --gff_tag locus_tag --introns 0
cyoa --jprefix 02 --method hisat --input $input --species $sp2 \
     --gff_type gene --gff_tag locus_tag --introns 0
## wait a minute or so
ls -l outputs/02*
```


```{bash, eval=FALSE}
sp1=spyogenes_nv1
sp2=spyogenes_nz131
cd /scratch/atb/rnaseq/spyogenes_mgas_2025/preprocessing
start=$(pwd)
for i in $(/bin/ls -d * | grep -v test); do
    cd ${start}/${i}
    input=$(/bin/ls outputs/01trimomatic/*-trimmed.fastq.xz | tr '\n' ':')
    cyoa --jprefix 02 --method hisat --input $input --species $sp1 \
         --gff_type gene --gff_tag locus_tag --introns 0
    cyoa --jprefix 02 --method hisat --input $input --species $sp2 \
         --gff_type gene --gff_tag locus_tag --introns 0
done
cd $start
```


```{r saveme, eval=FALSE}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
message(paste0("Saving to ", savefile))
tmp <- sm(saveme(filename = savefile))
```

```{r loadme_after, eval=FALSE}
tmp <- loadme(filename = savefile)
```
