1 Introduction

This document aims to record the tasks performed to re-preprocess the IPRGC samples of the Speer lab. I have every confidence that Theresa did everything perfectly, the only reason I wish to redo these tasks lies in the fact that we reorganized slightly the data to represent the new files produced by April following a reassessment of the samples in ~202404. It turns out that a small number of samples were misnamed and therefore it is difficult to ensure with absolute certainty that the existing state of Theresa’s working tree is representative of the new state of the samples.

Thus it is worth explaining slightly what I did before starting this document:

  1. I spent some time with Najib and April in order to create the sample sheet ‘20240521_speer_consolidated_sample_sheet.xlsx’. This is mostly identical to Theresa’s with a few caveats:

    1. There is now a series of columns with the suffix ‘AH’. With Najib’s help, I reconciled the existing IPRGC IDs created by Theresa with the extant samples and copied April’s sample sheet information into these columns. They have the names ‘Project # AH’, ‘Sample # AH’, ‘Rashmi’s Sample Name AH’, ‘Prep & Clean AH’, ‘Sequencing Run AH’, ‘Read 1 AH’, ‘Read 2 AH’, ‘Reload & Map Again AH’ and comprise columns AO-AV. These columns are only populated for samples iprgc62-iprgc130, because these are (I think) the only samples sequenced by April, and definitely the only samples potentially affected by the file re-copying.
    2. Given these new columns, I created a series of columns with the suffix ‘atb’: ‘genotype_atb’, ‘location_atb’, ‘time_atb’, ‘time_geo_location_source_atb’, and ‘rashmi_code_atb’. These columns are generally identical to Theresa’s columns of the same information with potential exceptions in the cases of the files which were moved/reorganized. The ‘time_geo_location_source’ column is used to signify where I took the information for the previous columns, e.g. if it says TAA_meta, then I took the genotype etc classifications from Theresa’s metadata without any interpretation/modification. When it says ‘AH filename’, then I took that, as one might expect, from April’s newly provided filenames.
    3. I later added Column AN when assaying Najib’s raw_data tree for these samples. Najib had previously not populated his raw directory tree with any sample >= iprgc_64; so I recorded in this column any actions I took. These actions fell into three categories: 1. Nothing (the first 2), 2. Almost nothing (the raw files were actually in Najib’s tree, but he had not yet put them into the ‘unprocessed’ directory. 3. Nothing existed yet, so I copied the files from my fresh tree (downloaded from the sequencer earlier in 202405) to the appropriate directories in Najib’s tree. I then copy/pasted my command into this column so that we may later double-check them. I probably could have put this information into this file…
    4. At this moment I added a column ‘sequence_type_atb’ which is either ‘previous’ or ‘takara’ so that I can record the likely required trimming process. I am hopeful that this column will not be needed and that I will be able to use fastp to identically treat all samples old and new. I hypothesize this is possible after spending a little time reading the takara cogent package code. It seems to me that their trimming operations are either close-to or identical to what fastp will perform. If that is not true, then I will do my default trimomatic/fastp trimming for the previous samples and the cogent method for the others. It should also be noted that I am making the assumption that any sample < 2021 is not takara and vice versa (e.g. anything received with a date after 20201221 is takara).
  2. It is hoped that 1c. above has left me in a consistent state and that I have every sample required in the appropriate location in Najib’s tree and that they match the modified sample sheet.

  3. Thus I created a fresh working tree named ‘mmusculus_iprgc_2024’ and populated it with empty directories ‘sample_sheets’, ‘reference’, and ‘preprocessing’.

2 Populating the raw data

From this point on, any command run will be performed in this text document, so you will get to read along as I make my decisions/mistakes.

I decided I will work using the read-only copies in Najib’s tree.

cd preprocessing
ln -s /fs/cbcb-lab/nelsayed/raw_data/speer/iprgc* .

I now have 129 symlinks in my preprocessing directory with one hole where a sample got changed and a couple new iprgc IDs due to the changes introduced by April’s reprocessing.

After staring at the screen for a moment, I decided to split these samples into subdirectories named after the sequencing year, thus:

mkdir 2019 2020 2022 2023

mv iprgc_0[1-8] 2019

mv iprgc_09 iprgc_1* iprgc_2* iprgc_3* iprgc_4* iprgc_5* iprgc_60 iprgc_61 2020

mv iprgc_6* iprgc_7* iprgc_8[0-4] 2022

mv iprgc* 2023

## haha I was a doofus and moved the 100s into 2020 directory because iprgc_1* matches 100+
mv 2020/iprgc_1[0-9][0-9] 2023

I am doing this because I am guessing there may be relevant differences in the various rounds of sequencing. In addition I am reasonably certain all the takara samples are in 2022 and 2023.

Ok, so now I should be able to comfortably process different sequencing platform data using for-loops without worrying too much about shenanigans.

I am therefore going to take a moment and peek in Theresa’s tree and see how the early samples were trimmed.

Looking in Theresa’s tree, I see that she likely kept my original trimmed data using cutadapt and also performed a trimomatic run. I took a momemt and see that I used cutadapt because the version of trimomatic at that time did not have the primers in it for this library type; but that has since been ameliorated. I therefore assume that I can comfortably trim with either trimomatic or fastp without problems. I will definitely need to use fastqc after trimming and make certain that none of the weirdo i3/i5(or whatever they are called) adapters remain.

Oh, an important note: I only configured fastp recently in my pipeline and it defaults to trying to do a UMI extraction, which is inappropriate for these first samples. I need to remember to tell it not to do that for these. Also, I do not fully understand UMIs yet.

module add cyoa/202402
cd 2019
start=$(pwd)
for i in $(/bin/ls -d iprgc*); do
    echo "Starting $i"
    cd $i
    inputs=$(/bin/ls unprocessed/*.fastq.gz | tr '\n' ':' | sed 's/:$//g')
    echo "The inputs are: ${inputs}"
    cyoa --method trim --input "${inputs}" --jprefix 01
    cyoa --method fastp --input "${inputs}" --jprefix 01
    cd $start
done

Next steps, which will have to wait a few minutes: download a new musculus assembly and check that my trimming results look like Theresa’s.

