SL-Seq
The spliced leader sequencing protocol used follows (Cuypers et al. (2017)) very closely. I would
like to write down my understanding of the protocol here:
Given: Capped mRNA
with SL>5’>CDS>3’>pA
One way to imagine our canonical protist RNA:
7mGCapnnnnnnnnnnnnnnnnnnnnnAGTTTCTGTACTTTATTGGxxxxxxxxxxxxxxxxxxxxxx5pUTRAUGnnnnnnnnnnnnnnnnnnnnnUAA3pUTRAAAAAAAAAAAAA
Note: I did not type out the complete SL sequence here because it is
somewhat variable (Gibson et al.
(2000)).
The full ‘canonical’ sequence looks like this:
7mGCapAACTAACGCTATTATTGATACAGTTTCTGTACTATATTG
with the caveat that I am pretty sure it leaves off the 2 or 3 nt at
the 5’ end which are methylated and modified. Also note that the n’s at
the 5’ end of my SL sequence above are all in the variant region of the
Trypanosome SL.
Here is the relevant alignment from the paper above:
Note the portion in the primer
With all that in mind; let us step through the protocol. Also, I am
reasonably certain that L.panamensis has a substitution at ~ position 17
in some but not all SL sequences (I think it has the AGA like T.brucei
but also ACA?). Either way, the multiplexed paper chose wisely to avoid
those shenanigans…
Step 1: 1st Strand
cDNA synthesis
First strand cDNA with 5’GTATAAGACACAGNNNNNNNNN3’: I am reasonably
certain the polyN sequence anneals semi-randomly and leaves a 5’ hanging
end which happens to coincide with the nextera handle; this will be
important later I assume. I assume without reading the next step that it
will add some RNAse H or T1 to blow up the nascent cDNA:RNA duplex.
So: Now we have a ssDNA which looks something like this:
5'GTATAAGACACAGNNNNNNNNN....CAATATAGTACAGAAACTGTATCAATAATAGCGTTA3'
^nextera handle (trans2) ^RC of the SL
with the caveat that the …. is likely quite long (mean ~ 1700 nt) and
those last ~ 18 nt are likely to have a few variants depending on
species.
Step 2: Destroy the
RNA strand and purify
We don’t want a bunch of weirdo concatamers to form.
Step 3: Second strand
synthesis
Second strand cDNA synthesis (I probably would have just done PCR
here, which I am guessing would not have worked for reasons I cannot
remember?) The noteworthy part: add 3ul of 10 uM Strand 2 Leishmania
primer. I am going to guess that this primer looks like the primer in
step 1 with another handle for indexing. Nope, I found it in the text:
5’TCAGTTTCTGTA3’
So, this will make the hybrid:
...CAATATAGTACAGAAACTGTATCAATAATAGCGTTA3'
.......ATGTCTTTGACT5' : once again missing the variant bases.
This must mean that following PCR steps add the overhangs for the
rest of the library adapters?
Step 4: Adapter
addition
I would note that the extension time is only 60 seconds; where does
this cap the dsDNA amplicon size given the HiFi hotstart mix? (I used to
know this, but no)
In any event, the following shows the regions of identity and where
they will therefore anneal on the opposite strand.
GTATAAGACACAG: I am the trans2 from step 1
Forward: 5'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG3'
Reverse: 5'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGATCAGTTTCTGTA3'
invariant SL region at 3' of the other strand :CAGTTTCTGTA: First time I wrote it I put the whole SL and was confused why they didn't match
TCAGTTTCTGTA3'
I typed these so they match, thus they will anneal on the opposite
strands, e.g. the forward primer will anneal to the 2nd strand copy of
the SL primer from step 1. The reverse primer thus must hit on the SL,
yay!
Step 5. Indexing
Finally, add the nextera primers and amplify again. So, those primers
will have yet another overhang with the chosen index. Depending on the
chosen indexes/primers, this may be important for how April sets up the
sequencing runs.
The final amplicon should therefore look something like:
I feel like these SL bases might benefit from dark cycles?
vvvvvvvvvvvv
nextera_index_sequence_TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG...3'endofrna...5'endofrnaTACAGAAACTGATCTGTCTCTTATACACATCTCCGAGCCCACGAGAC_nextera_index_sequence
TACAGAAACTGA
---
title: "Notes on our persistence project."
author: "atb abelew@gmail.com"
bibliography: /home/trey/scratch/zotero_library/atb.bib
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
  rmdformats::readthedown:
    code_download: true
    code_folding: show
    df_print: paged
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    width: 300
    keep_md: false
    mode: selfcontained
    toc_float: true
  BiocStyle::html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    toc_float: true
---

<style type="text/css">
body, td {
  font-size: 16px;
}
code.r {
  font-size: 16px;
}
pre {
  font-size: 16px
}
body .main-container {
  max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library(hpgltools)
library(reticulate)
tt <- try(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- "202305"
previous_file <- ""
ver <- format(Sys.Date(), "%Y%m%d")

##tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "template.Rmd"
```

# Introduction

Assuming we get to a place where we want to publish this work, this
will be an introductory document into what happened here.  In the
interim, it is a place for me to take some notes and lay out some
ideas.

# SL-Seq

The spliced leader sequencing protocol used follows
(@cuypersMultiplexedSplicedLeaderSequencing2017) very closely.  I
would like to write down my understanding of the protocol here:

## Given: Capped mRNA with SL>5'>CDS>3'>pA

One way to imagine our canonical protist RNA:

<pre>
7mGCapnnnnnnnnnnnnnnnnnnnnnAGTTTCTGTACTTTATTGGxxxxxxxxxxxxxxxxxxxxxx5pUTRAUGnnnnnnnnnnnnnnnnnnnnnUAA3pUTRAAAAAAAAAAAAA
</pre>

Note: I did not type out the complete SL sequence here because it is
somewhat variable (@gibsonStructureSequenceVariation2000).

The full 'canonical' sequence looks like this:

<pre>
7mGCapAACTAACGCTATTATTGATACAGTTTCTGTACTATATTG
</pre>

with the caveat that I am pretty sure it leaves off the 2 or 3 nt at
the 5' end which are methylated and modified.  Also note that the n's
at the 5' end of my SL sequence above are all in the variant region of
the Trypanosome SL.

Here is the relevant alignment from the paper above:

![Note the portion in the primer](1-s2.0-S0166685100001936-gr2_lrg.gif)

With all that in mind; let us step through the protocol.  Also, I am
reasonably certain that L.panamensis has a substitution at ~ position
17 in some but not all SL sequences (I think it has the AGA like
T.brucei but also ACA?).  Either way, the multiplexed paper chose
wisely to avoid those shenanigans...

## Step 1: 1st Strand cDNA synthesis

First strand cDNA with 5'GTATAAGACACAGNNNNNNNNN3': I am reasonably
certain the polyN sequence anneals semi-randomly and leaves a 5'
hanging end which happens to coincide with the nextera handle;
this will be important later I assume.  I assume without reading
the next step that it will add some RNAse H or T1 to blow up the
nascent cDNA:RNA duplex.

So: Now we have a ssDNA which looks something like this:

<pre>
5'GTATAAGACACAGNNNNNNNNN....CAATATAGTACAGAAACTGTATCAATAATAGCGTTA3'
  ^nextera handle (trans2)  ^RC of the SL
</pre>

with the caveat that the .... is likely quite long (mean ~ 1700
nt) and those last ~ 18 nt are likely to have a few variants depending
on species.

## Step 2: Destroy the RNA strand and purify

We don't want a bunch of weirdo concatamers to form.

## Step 3: Second strand synthesis

Second strand cDNA synthesis (I probably would have just done PCR
here, which I am guessing would not have worked for reasons I
cannot remember?)  The noteworthy part: add 3ul of 10 uM Strand 2
Leishmania primer.  I am going to guess that this primer looks
like the primer in step 1 with another handle for indexing.  Nope,
I found it in the text: 5'TCAGTTTCTGTA3'

So, this will make the hybrid:

<pre>
  ...CAATATAGTACAGAAACTGTATCAATAATAGCGTTA3'
      .......ATGTCTTTGACT5'  : once again missing the variant bases.
</pre>

This must mean that following PCR steps add the overhangs for the rest
of the library adapters?

## Step 4: Adapter addition

I would note that the extension time is only 60 seconds; where does
this cap the dsDNA amplicon size given the HiFi hotstart mix?  (I used
to know this, but no)

In any event, the following shows the regions of identity and where
they will therefore anneal on the opposite strand.

<pre>
                               GTATAAGACACAG: I am the trans2 from step 1
Forward: 5'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG3'
</pre>

<pre>
Reverse: 5'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGATCAGTTTCTGTA3'
invariant SL region at 3' of the other strand :CAGTTTCTGTA: First time I wrote it I put the whole SL and was confused why they didn't match
                                              TCAGTTTCTGTA3'

</pre>

I typed these so they match, thus they will anneal on the opposite
strands, e.g. the forward primer will anneal to the 2nd strand copy of
the SL primer from step 1.  The reverse primer thus must hit on the SL, yay!

## Step 5.  Indexing

Finally, add the nextera primers and amplify again.  So,
those primers will have yet another overhang with the chosen
index.  Depending on the chosen indexes/primers, this may be important
for how April sets up the sequencing runs.

The final amplicon should therefore look something like:

<pre>
                                                                     I feel like these SL bases might benefit from dark cycles?
                                                                                  vvvvvvvvvvvv
nextera_index_sequence_TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG...3'endofrna...5'endofrnaTACAGAAACTGATCTGTCTCTTATACACATCTCCGAGCCCACGAGAC_nextera_index_sequence
                                                                                  TACAGAAACTGA
</pre>

# Bibliography
