Tnseq

Overview

TN-seq provides high-throughput information about the essentiality of (nearly) the entirety of a bacterial genome. This is partially thanks to the unbelievable promiscuity of the mariner transposon, which will happily hop into any TA.

Making the libraries

I haven't. My only sense of this is Yoann's description and the couple of reviews I read.

Processing the raw sequencing data

Processing the sequencing data comes in a few steps:

Quality filtering
Demultiplexing
Alignment against a reference genome
Collecting reads

Quality filtering

When a library is successfully sequenced, the raw data arrives on our cluster as a set of gzipped fastq files. I have written a shell script to handle the quality filtering and adapter removal. It is called quality_filtering.sh.

Here is what a session would look like

bash/quality_filtering.sh -i tnseq1_8.fastq.gz

The result of this process is a large, still-multiplexed, gzipped fasta file. "03-tnseq1_8_completed.fasta.gz"

Demultiplexing

When this is complete, the resulting fasta files must be further demultiplexed. Each read starts with a 7mer which describes the sample.

For example I received the following from Yoann for the most recent libraries:

Tn-seq#1: Library NZ131 at T0, barcode 501: TATAGCCT.
Tn-seq#2: Library NZ131 at T1, barcode 502: ATAGAGGC.
Tn-seq#3: Library NZ131 at T2, barcode 503: CCTATCCT.
Tn-seq#4: Library NZ131 at T3, barcode 504: GGCTCTGA.
Tn-seq#5: Library Alab49 at T0, barcode 505: AGGCGAAG.
Tn-seq#6: Library Alab49 at T1, barcode 506: TAATCTTA.
Tn-seq#7: Library Alab49 at T2, barcode 507: CAGGACGT.
Tn-seq#8: Library Alab49 at T3, barcode 508: GTACTGAC.

I have a small perl script which handles this sort_indexes.pl. At this time I need to copy/paste in the information from Yoann for each run. I should probably change this. As a result, I just copy the perl script into the current working directory, edit, and run it there. I also drop the first letter of each index because it seems to help with the occasional mismatch on the first base.

A session looks like this:

./sort_indexes.pl 03-tnseq1_8_completed.fasta
#Opening: >NZ131_t1.fasta
#Opening: >Alab49_t3.fasta
#Opening: >Alab49_t1.fasta
#Opening: >NZ131_t0.fasta
#Opening: >NZ131_t3.fasta
#Opening: >Alab49_t2.fasta
#Opening: >Alab49_t0.fasta
#Opening: >NZ131_t2.fasta
## And takes a few hours to run...

Reference Alignment

Alignment against the reference genome(s) is handled via another shell script: tnseq.sh This script is nearly copied from a similar script for ribosome profiling. It handles mapping reads against a reference genome; creating a sorted, indexed bamfile; and generating a count table of the features from the genome.

The bam file is the raw input for essentiality. The count table is the raw input for cbcbSEQ. Together, they provide inputs for a few other tools including IGV.

It is also worth noting that the first time this is run, it asks some questions: It requires the locations of:

a .fasta genome corresponding to the species chosen (I put copies of these in the reference directory).
a .gff file containing the segments of interest in the genome
a directory name in which to put them (the default is generally good here.)

A sample session with tnseq.sh looks like the following:

tnseq.sh -s alab -f alab
## The -s argument sets the species (in this case the alab strain of group A streptococcus.
## -f alab tells it to look in the alab/ directory for a set of fasta files.  The alignments etc will be performed on all of them.

I haven't yet made a test data set. Until I do, note that it creates:

A set of text files containing alignment statistics.
A set of .bam files from bowtie's alignments
.fasta files for both the set of sequences which did and did not align to the chosen genome
a scripts/ directory containing small shell scripts for every step performed

Data Analysis

This is a fairly wide ranging set of topics. It makes use of the bam files and count tables generated above. As a result I will step through these analyses in separate pages, starting with how to load the data.

Data Loading

This is handled in data_input.html

cbcbSEQ

Handled in cbcbSEQ.html

DEseq analyses

Differential Expression analyses of RNAseq data are handled with a few tools, notably voom/limma. I used them to look at changes of 'fitness' of each gene over time.

essentiality

Handled in essentiality.html

tn-hmm

tn_hmm.html

Visualization

I am learning some new graph tools for visualizing a circular genome.

Usage

I am throwing a couple notes together on how to use this tree of data in usage.html.

References

The data

The excel sheets generated by the following data manipulations may be found here.

The source

To view this rmd document: view source
The archive of this data and its README

Return

Go Back , index, next

Ashton Trey Belew (2013-12-19)