1 Introduction

This document aims to record the tasks performed to re-preprocess the IPRGC samples of the Speer lab. I have every confidence that Theresa did everything perfectly, the only reason I wish to redo these tasks lies in the fact that we reorganized slightly the data to represent the new files produced by April following a reassessment of the samples in ~202404. It turns out that a small number of samples were misnamed and therefore it is difficult to ensure with absolute certainty that the existing state of Theresa’s working tree is representative of the new state of the samples.

Thus it is worth explaining slightly what I did before starting this document:

  1. I spent some time with Najib and April in order to create the sample sheet ‘20240521_speer_consolidated_sample_sheet.xlsx’. This is mostly identical to Theresa’s with a few caveats:

    1. There is now a series of columns with the suffix ‘AH’. With Najib’s help, I reconciled the existing IPRGC IDs created by Theresa with the extant samples and copied April’s sample sheet information into these columns. They have the names ‘Project # AH’, ‘Sample # AH’, ‘Rashmi’s Sample Name AH’, ‘Prep & Clean AH’, ‘Sequencing Run AH’, ‘Read 1 AH’, ‘Read 2 AH’, ‘Reload & Map Again AH’ and comprise columns AO-AV. These columns are only populated for samples iprgc62-iprgc130, because these are (I think) the only samples sequenced by April, and definitely the only samples potentially affected by the file re-copying.
    2. Given these new columns, I created a series of columns with the suffix ‘atb’: ‘genotype_atb’, ‘location_atb’, ‘time_atb’, ‘time_geo_location_source_atb’, and ‘rashmi_code_atb’. These columns are generally identical to Theresa’s columns of the same information with potential exceptions in the cases of the files which were moved/reorganized. The ‘time_geo_location_source’ column is used to signify where I took the information for the previous columns, e.g. if it says TAA_meta, then I took the genotype etc classifications from Theresa’s metadata without any interpretation/modification. When it says ‘AH filename’, then I took that, as one might expect, from April’s newly provided filenames.
    3. I later added Column AN when assaying Najib’s raw_data tree for these samples. Najib had previously not populated his raw directory tree with any sample >= iprgc_64; so I recorded in this column any actions I took. These actions fell into three categories: 1. Nothing (the first 2), 2. Almost nothing (the raw files were actually in Najib’s tree, but he had not yet put them into the ‘unprocessed’ directory. 3. Nothing existed yet, so I copied the files from my fresh tree (downloaded from the sequencer earlier in 202405) to the appropriate directories in Najib’s tree. I then copy/pasted my command into this column so that we may later double-check them. I probably could have put this information into this file…
    4. At this moment I added a column ‘sequence_type_atb’ which is either ‘previous’ or ‘takara’ so that I can record the likely required trimming process. I am hopeful that this column will not be needed and that I will be able to use fastp to identically treat all samples old and new. I hypothesize this is possible after spending a little time reading the takara cogent package code. It seems to me that their trimming operations are either close-to or identical to what fastp will perform. If that is not true, then I will do my default trimomatic/fastp trimming for the previous samples and the cogent method for the others. It should also be noted that I am making the assumption that any sample < 2021 is not takara and vice versa (e.g. anything received with a date after 20201221 is takara).
  2. It is hoped that 1c. above has left me in a consistent state and that I have every sample required in the appropriate location in Najib’s tree and that they match the modified sample sheet.

  3. Thus I created a fresh working tree named ‘mmusculus_iprgc_2024’ and populated it with empty directories ‘sample_sheets’, ‘reference’, and ‘preprocessing’.

2 Populating the raw data

From this point on, any command run will be performed in this text document, so you will get to read along as I make my decisions/mistakes.

I decided I will work using the read-only copies in Najib’s tree.

cd preprocessing
ln -s /fs/cbcb-lab/nelsayed/raw_data/speer/iprgc* .

I now have 129 symlinks in my preprocessing directory with one hole where a sample got changed and a couple new iprgc IDs due to the changes introduced by April’s reprocessing.

After staring at the screen for a moment, I decided to split these samples into subdirectories named after the sequencing year, thus:

mkdir 2019 2020 2022 2023

mv iprgc_0[1-8] 2019

mv iprgc_09 iprgc_1* iprgc_2* iprgc_3* iprgc_4* iprgc_5* iprgc_60 iprgc_61 2020

mv iprgc_6* iprgc_7* iprgc_8[0-4] 2022

mv iprgc* 2023

## haha I was a doofus and moved the 100s into 2020 directory because iprgc_1* matches 100+
mv 2020/iprgc_1[0-9][0-9] 2023

I am doing this because I am guessing there may be relevant differences in the various rounds of sequencing. In addition I am reasonably certain all the takara samples are in 2022 and 2023.

Ok, so now I should be able to comfortably process different sequencing platform data using for-loops without worrying too much about shenanigans.

I am therefore going to take a moment and peek in Theresa’s tree and see how the early samples were trimmed.

Looking in Theresa’s tree, I see that she likely kept my original trimmed data using cutadapt and also performed a trimomatic run. I took a momemt and see that I used cutadapt because the version of trimomatic at that time did not have the primers in it for this library type; but that has since been ameliorated. I therefore assume that I can comfortably trim with either trimomatic or fastp without problems. I will definitely need to use fastqc after trimming and make certain that none of the weirdo i3/i5(or whatever they are called) adapters remain.

Oh, an important note: I only configured fastp recently in my pipeline and it defaults to trying to do a UMI extraction, which is inappropriate for these first samples. I need to remember to tell it not to do that for these. Also, I do not fully understand UMIs yet.

module add cyoa/202402
cd 2019
start=$(pwd)
for i in $(/bin/ls -d iprgc*); do
    echo "Starting $i"
    cd $i
    inputs=$(/bin/ls unprocessed/*.fastq.gz | tr '\n' ':' | sed 's/:$//g')
    echo "The inputs are: ${inputs}"
    cyoa --method trim --input "${inputs}" --jprefix 01
    cyoa --method fastp --input "${inputs}" --jprefix 01
    cd $start
done

Next steps, which will have to wait a few minutes: download a new musculus assembly and check that my trimming results look like Theresa’s.

---
title: "Preprocessing the IPRGC samples de-novo."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
  font-size: 16px
}
body .main-container {
  max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library(hpgltools)
library(reticulate)
tt <- try(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202305"
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "preprocessing.Rmd"
```

# Introduction

This document aims to record the tasks performed to re-preprocess the
IPRGC samples of the Speer lab.  I have every confidence that Theresa
did everything perfectly, the only reason I wish to redo these tasks
lies in the fact that we reorganized slightly the data to represent
the new files produced by April following a reassessment of the
samples in ~202404.  It turns out that a small number of samples were
misnamed and therefore it is difficult to ensure with absolute
certainty that the existing state of Theresa's working tree is
representative of the new state of the samples.

Thus it is worth explaining slightly what I did before starting this
document:

1.  I spent some time with Najib and April in order to create the
    sample sheet '20240521_speer_consolidated_sample_sheet.xlsx'.
    This is mostly identical to Theresa's with a few caveats:

    a.  There is now a series of columns with the suffix 'AH'.  With
        Najib's help, I reconciled the existing IPRGC IDs created by
        Theresa with the extant samples and copied April's sample
        sheet information into these columns.  They have the names
        'Project # AH', 'Sample # AH', 'Rashmi's Sample Name AH',
        'Prep & Clean AH', 'Sequencing Run AH', 'Read 1 AH', 'Read 2
        AH', 'Reload & Map Again AH' and comprise columns AO-AV.
        These columns are only populated for samples iprgc62-iprgc130,
        because these are (I think) the only samples sequenced by
        April, and definitely the only samples potentially affected by
        the file re-copying.
    b.  Given these new columns, I created a series of columns with
        the suffix 'atb': 'genotype_atb', 'location_atb', 'time_atb',
        'time_geo_location_source_atb', and 'rashmi_code_atb'.  These
        columns are generally identical to Theresa's columns of the
        same information with potential exceptions in the cases of the
        files which were moved/reorganized.  The
        'time_geo_location_source' column is used to signify where I
        took the information for the previous columns, e.g. if it says
        TAA_meta, then I took the genotype etc classifications from
        Theresa's metadata without any interpretation/modification.
        When it says 'AH filename', then I took that, as one might
        expect, from April's newly provided filenames.
    c.  I later added Column AN when assaying Najib's raw_data tree
        for these samples.  Najib had previously not populated his raw
        directory tree with any sample >= iprgc_64; so I recorded in
        this column any actions I took.  These actions fell into three
        categories:  1.  Nothing (the first 2), 2. Almost nothing (the
        raw files were actually in Najib's tree, but he had not yet
        put them into the 'unprocessed' directory.  3.  Nothing
        existed yet, so I copied the files from my fresh tree
        (downloaded from the sequencer earlier in 202405) to the
        appropriate directories in Najib's tree.  I then copy/pasted
        my command into this column so that we may later double-check
        them.  I probably could have put this information into this
        file...
    d.  At this moment I added a column 'sequence_type_atb' which is
        either 'previous' or 'takara' so that I can record the likely
        required trimming process.  I am hopeful that this column will
        not be needed and that I will be able to use fastp to
        identically treat all samples old and new.  I hypothesize this
        is possible after spending a little time reading the takara
        cogent package code.  It seems to me that their trimming
        operations are either close-to or identical to what fastp will
        perform.  If that is not true, then I will do my default
        trimomatic/fastp trimming for the previous samples and the
        cogent method for the others.  It should also be noted that I
        am making the assumption that any sample < 2021 is not takara
        and vice versa (e.g. anything received with a date after
        20201221 is takara).

2.  It is hoped that 1c. above has left me in a consistent state and
    that I have every sample required in the appropriate location in
    Najib's tree and that they match the modified sample sheet.
3.  Thus I created a fresh working tree named 'mmusculus_iprgc_2024'
    and populated it with empty directories 'sample_sheets',
    'reference', and 'preprocessing'.

# Populating the raw data

From this point on, any command run will be performed in this text
document, so you will get to read along as I make my
decisions/mistakes.

I decided I will work using the read-only copies in Najib's tree.

```{bash, eval=FALSE}
cd preprocessing
ln -s /fs/cbcb-lab/nelsayed/raw_data/speer/iprgc* .
```

I now have 129 symlinks in my preprocessing directory with one hole
where a sample got changed and a couple new iprgc IDs due to the
changes introduced by April's reprocessing.

After staring at the screen for a moment, I decided to split these
samples into subdirectories named after the sequencing year, thus:

```{bash, eval=FALSE}
mkdir 2019 2020 2022 2023

mv iprgc_0[1-8] 2019

mv iprgc_09 iprgc_1* iprgc_2* iprgc_3* iprgc_4* iprgc_5* iprgc_60 iprgc_61 2020

mv iprgc_6* iprgc_7* iprgc_8[0-4] 2022

mv iprgc* 2023

## haha I was a doofus and moved the 100s into 2020 directory because iprgc_1* matches 100+
mv 2020/iprgc_1[0-9][0-9] 2023
```

I am doing this because I am guessing there may be relevant
differences in the various rounds of sequencing.  In addition I am
reasonably certain all the takara samples are in 2022 and 2023.

Ok, so now I should be able to comfortably process different
sequencing platform data using for-loops without worrying too much
about shenanigans.

I am therefore going to take a moment and peek in Theresa's tree and
see how the early samples were trimmed.

Looking in Theresa's tree, I see that she likely kept my original
trimmed data using cutadapt and also performed a trimomatic run.  I
took a momemt and see that I used cutadapt because the version of
trimomatic at that time did not have the primers in it for this
library type; but that has since been ameliorated.  I therefore assume
that I can comfortably trim with either trimomatic or fastp without
problems.  I will definitely need to use fastqc after trimming and
make certain that none of the weirdo i3/i5(or whatever they are
called) adapters remain.

Oh, an important note: I only configured fastp recently in my pipeline
and it defaults to trying to do a UMI extraction, which is
inappropriate for these first samples.  I need to remember to tell it
not to do that for these.  Also, I do not fully understand UMIs yet.

```{bash, eval=FALSE}
module add cyoa/202402
cd 2019
start=$(pwd)
for i in $(/bin/ls -d iprgc*); do
    echo "Starting $i"
    cd $i
    inputs=$(/bin/ls unprocessed/*.fastq.gz | tr '\n' ':' | sed 's/:$//g')
    echo "The inputs are: ${inputs}"
    cyoa --method trim --input "${inputs}" --jprefix 01
    cyoa --method fastp --input "${inputs}" --jprefix 01
    cd $start
done
```

Next steps, which will have to wait a few minutes: download a new
musculus assembly and check that my trimming results look like
Theresa's.
