1 RNAseq of ticks with and without borellia burgdorferi

This comprises my report of a recent windfall from Najib of some sequencing data of ticks which have and have not been infected with the parasite which causes lyme disease.

Much of the text/code following was plagarized from my rnaseq_cbf5.rmd file.

1.1 Setting up

The following block of text loads up my R code and may be used to save the data structures created by it as well as write pdf reports of the resulting data.

2 Preprocessing

The original data consists of single-ended 50 bp reads. Therefore, I preprocess it with trimomatic, bowtie, and htseq-count.

The genome for ixodes I downloaded from: http://vectordb.org/ along with the .gff annotations. The .gff required a quick sed filter to make it parseable by htseq (It doesn’t like strings with []).

The script used to handle these processes is process.pl and was invoked as follows:

cd preprocessing
start=$(pwd)
for i in $(/bin/ls HPGL*); do
  cd $start
  cd $i && ../process.pl --species ixodes_scapularis --input $i.fastq.gz
done

cd $start/..
mkdir -p processed_data/count_tables
for i in $(find . -name '*.count.xz'); do
  cp $i processed_data/count_tables
done

One that was finished, I spent a couple minutes writing all_samples.csv to include what meta-information I know at this time.

2.1 Alignment stats

I am just going to copy/paste the output stats for the libraries here.

  • hpgl0612:
  • reads processed: 23981162
  • reads with at least one reported alignment: 6956769 (29.01%)
  • reads that failed to align: 16041911 (66.89%)
  • reads with alignments sampled due to -M: 982482 (4.10%)
  • Reported 6956769 alignments to 1 output stream(s)

  • hpgl0613:
  • reads processed: 19650248
  • reads with at least one reported alignment: 1695721 (8.63%)
  • reads that failed to align: 17124763 (87.15%)
  • reads with alignments sampled due to -M: 829764 (4.22%)
  • Reported 1695721 alignments to 1 output stream(s)

  • hpgl0614:
  • reads processed: 23687463
  • reads with at least one reported alignment: 6938925 (29.29%)
  • reads that failed to align: 15770327 (66.58%)
  • reads with alignments sampled due to -M: 978211 (4.13%)
  • Reported 6938925 alignments to 1 output stream(s)

  • hpgl0615:
  • reads processed: 19996260
  • reads with at least one reported alignment: 5860058 (29.31%)
  • reads that failed to align: 13345455 (66.74%)
  • reads with alignments sampled due to -M: 790747 (3.95%)
  • Reported 5860058 alignments to 1 output stream(s)

  • hpgl0616:
  • reads processed: 26501869
  • reads with at least one reported alignment: 7877812 (29.73%)
  • reads that failed to align: 17596723 (66.40%)
  • reads with alignments sampled due to -M: 1027334 (3.88%)
  • Reported 7877812 alignments to 1 output stream(s)

  • hpgl0617:
  • reads processed: 26300269
  • reads with at least one reported alignment: 7820617 (29.74%)
  • reads that failed to align: 17459657 (66.39%)
  • reads with alignments sampled due to -M: 1019995 (3.88%)
  • Reported 7820617 alignments to 1 output stream(s)

3 Analyses using alignments with 0 mismatches!

3.1 Setting up

Here are the commands I invoke to get ready to play with new data, including everything required to install hpgltools, the software it uses, and the data.

## These first 4 lines are not needed once hpgltools is installed.
source("http://bioconductor.org/biocLite.R")
biocLite("devtools")
library(devtools)
library(hpgltools)

3.2 Reading data

I think the following is no longer valid?

annotation_file <- "reference/ixodes_exons.gff"
annot_df <- load_gff_annotations(annotation_file)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 13 columns and 89845 rows.
annot_df$ID <- gsub("\\-", "\\.", annot_df$ID, perl=TRUE)
annot_df$Parent <- gsub("\\-RA","", annot_df$Parent)
rownames(annot_df) <- annot_df$ID

description_file <- "reference/ixodes_mRNA.gff"
description_df <- load_gff_annotations(description_file)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 13 columns and 20486 rows.
descriptions <- merge(annot_df, description_df, by="Parent")
descriptions <- descriptions[,c("ID.x","width.x","description.y")]
rownames(descriptions) <- descriptions$ID.x
colnames(descriptions) <- c("ID","width","description")
isc_biomart <- load_biomart_annotations(species="iscapularis", host="metazoa.ensembl.org",
                                        include_lengths=TRUE)
## The biomart annotations file already exists, loading from it.
ixo_expt <- create_expt(metadata="all_samples.csv", gene_info=descriptions)
## Reading the sample metadata.
## The sample definitions comprises: 6, 6 rows, columns.
## Reading count tables.
## Reading count tables with read.table().
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0612-accepted.count.xz contains 89839 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0613-accepted.count.xz contains 89839 rows and merges to 89839 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0614-accepted.count.xz contains 89839 rows and merges to 89839 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0615-accepted.count.xz contains 89839 rows and merges to 89839 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0616-accepted.count.xz contains 89839 rows and merges to 89839 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/iscapularis_2016/processed_data/count_tables/hpgl0617-accepted.count.xz contains 89839 rows and merges to 89839 rows.
## Finished reading count tables.
## Matched 89549 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.

3.3 Normalizing and exploring data

There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.

ixo_written <- write_expt(ixo_expt, excel="excel/iscapularis_written.xlsx")
## Writing the legend.
## Writing the raw reads.
## Graphing the raw reads.

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the normalized reads.
## Graphing the normalized reads.

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Writing the median reads by factor.
## The factor uninfected has 3 rows.
## The factor infected has 6 rows.

## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
ixo_libsize <- plot_libsize(ixo_expt)

pp(file="images/libsize.tiff")
## Defaulting to tiff.
## Going to write the image to: images/libsize.tiff when dev.off() is called.
ixo_libsize$plot
dev.off()
## png 
##   2
## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
ixo_nonzero <- plot_nonzero(ixo_expt, title="nonzero vs. cpm")
ixo_nonzero$plot

3.3.1 An initial pca plot

In most cases, raw data does not cluster very well, lets see if that is also true for the ixo experiment. Assuming it doesn’t, lets normalize the data using the defaults (cpm, quantile, log2) and try again.

## Unsurprisingly, the raw data doesn't cluster well at all...
##ixo_rawpca = hpgl_pca(expt=ixo_expt, labels=ixo_expt$condition)
##ixo_rawpca = hpgl_pca(expt=ixo_expt, labels="fancy")
ixo_rawpca <- plot_pca(ixo_expt, title="Fun!")
ixo_rawpca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

3.3.2 Normalized PCA

Next lets do a quantile, cpm, log2 normalization and try again.

## So, normalize the data
norm_expt <- normalize_expt(ixo_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(hpgl(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 63749 low-count genes (26085 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Step 5: not doing batch correction.
normed <- graph_metrics(norm_expt)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

normed$pcaplot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

raw_data <- exprs(ixo_expt$expressionset)
head(raw_data)
##                   HPGL0612 HPGL0613 HPGL0614 HPGL0615 HPGL0616 HPGL0617
## ISCW000001.RA.E1A        0        0        0        0        0        0
## ISCW000002.RA.E1A       46        6       51       37       47       39
## ISCW000002.RA.E2A       28       15       33       19       33       41
## ISCW000002.RA.E3A      117       20      130      104      127      102
## ISCW000002.RA.E4A      120      107      120      120      124      119
## ISCW000003.RA.E1A       14        0       14       17       13       21
descriptions <- load_gff_annotations("reference/ixodes_mRNA.gff")
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)
## Returning a df with 13 columns and 20486 rows.
rownames(descriptions) <- descriptions$Parent

genelengths <- get_genesizes(annotation=annot_df, gene_type="exon")
## Taking only the 89834 exon rows.
rownames(genelengths) <- gsub("\\-","\\.", genelengths$ID, perl=TRUE)

genelengths <- merge(raw_data, genelengths, by="row.names")
rownames(genelengths) <- genelengths$Row.names
genelengths <- genelengths[, -1]
my_lenvec <- as.vector(genelengths$gene_size)
names(my_lenvec) <- genelengths$Row.names
short_annot <- annot_df
short_annot <- short_annot[,c("seqnames","description")]

raw_genes <- sum_exons(raw_data, annotdf=annot_df)
my_lenvec <- as.vector(raw_genes$width$width)
names(my_lenvec) <- rownames(raw_genes$gene_size)
my_genecounts <- raw_genes$counts
raw_rpkm <- edgeR::rpkm(my_genecounts, gene.length=my_lenvec)
uninf_rpkm <- rowMeans(raw_rpkm[,c(4,5,6)])
inf_rpkm <- rowMeans(raw_rpkm[,c(1,2,3)])

rpkm_values <- merge(raw_rpkm, as.data.frame(uninf_rpkm), by="row.names")
rpkm_values <- merge(rpkm_values, as.data.frame(inf_rpkm), by.x="Row.names", by.y="row.names")
rownames(rpkm_values) <- rpkm_values$Row.names
rpkm_values <- rpkm_values[-1]

rpkm_values <- merge(rpkm_values, descriptions, by="row.names")
rownames(rpkm_values) <- rpkm_values$Row.names
rpkm_values <- rpkm_values[-1]
rpkm_values <- rpkm_values[,c(1,2,3,4,5,6,7,8,12,21)]
write_xls(rpkm_values, sheet="rpkm_values", file="excel/rpkm_raw.xls")
## $workbook
## A Workbook object.
##  
## Worksheets:
##  Sheet 1: "rpkm_values"
##  
##  Custom column widths (column: width)
##    1: 20, 2: auto, 3: auto, 4: auto, 5: auto, 6: auto, 7: auto, 8: auto, 9: auto, 10: 30 
##  
## 
##  
##  Worksheet write order: 1
## $sheet
## [1] "rpkm_values"
## 
## $end_row
## [1] 20489
## 
## $end_col
## [1] 11

As you can see, the uninfected-2 sample is still deeply problematic.

3.3.3 Raw data with batch removal

That had a pretty profound effect, so lets try leaving the data alone and just removing the batch effect.

norm_expt_svaseq <- normalize_expt(ixo_expt, transform="raw",
                                   norm="raw", convert="cpm",
                                   filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## svaseq(cpm(hpgl(data)))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Step 1: performing count filter with option: hpgl
## Removing 63749 low-count genes (26085 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: doing batch correction with svaseq.
## In norm_batch, after testing logic of surrogate method/number, the
## number of surrogates is:  and the method is: be.
## Note to self:  If you get an error like 'x contains missing values'; I think this
##  means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 2372 entries 0<x<1.
## batch_counts: Before batch correction, 18018 entries are >= 0.
## After checking/setting the number of surrogates, it is: 1.
## batch_counts: Using sva::svaseq for batch correction.
## Note to self:  If you feed svaseq a data frame you will get an error like:
## data %*% (Id - mod %*% blah blah requires numeric/complex arguments.
## The number of elements which are < 0 after batch correction is: 276
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
pp(file="images/norm_pca.tiff")
## Defaulting to tiff.
## Going to write the image to: images/norm_pca.tiff when dev.off() is called.
plot_pca(norm_expt_svaseq)$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
dev.off()
## png 
##   2

That seems to be sufficient to smash the uninfected and infected samples into their own pairs.

norm_expt_sva <- normalize_expt(ixo_expt, transform="log2", norm="quant",
                               convert="cpm", filter=TRUE, batch="sva")
## This function will replace the expt$expressionset slot with:
## log2(sva(cpm(quant(hpgl(data)))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Warning in normalize_expt(ixo_expt, transform = "log2", norm = "quant", :
## Quantile normalization and sva do not always play well together.
## Step 1: performing count filter with option: hpgl
## Removing 63749 low-count genes (26085 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Step 5: doing batch correction with sva.
## In norm_batch, after testing logic of surrogate method/number, the
## number of surrogates is:  and the method is: be.
## Note to self:  If you get an error like 'x contains missing values'; I think this
##  means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 9869 entries 0<x<1.
## batch_counts: Before batch correction, 1255 entries are >= 0.
## After checking/setting the number of surrogates, it is: 1.
## Passing the batch method to get_model_adjust().
## It understands a few additional batch methods.
## Not able to discern the state of the data.
## Going to use a simplistic metric to guess if it is log scale.
## A specific number of surrogate variables was chosen: 1.
## Estimate type 'sva' is shorthand for 'sva_unsupervised'.
## Other sva options include: sva_supervised and svaseq.
## Attempting sva unsupervised surrogate estimation with 1 surrogates.
## The number of elements which are < 0 after batch correction is: 439
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
batchnorm_metrics <- graph_metrics(norm_expt_sva)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative.  We are on log scale, setting them to 0.
## Changed 439 negative features.
## Some entries are 0.  We are on log scale, adding 1 to the data.
## Changed 439 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.

## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some data are negative.  We are on log scale, setting them to 0.5.
## Changed 439 negative features.
## Plotting the representation of the top-n genes.
## Printing a color to condition legend.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

batchnorm_metrics$corheat

batchnorm_metrics$disheat

batchnorm_metrics$pcaplot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

3.4 Remove the problematic sample

## The above suggests to me that uninfected #2 is no good.
ixo2 <- subset_expt(ixo_expt, subset="name!='up2_c2'")
## There were 6, now there are 5 samples.
norm_ixo2 <- normalize_expt(ixo2, transform="log2", norm="quant", convert="cpm", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(hpgl(data))))
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 63856 low-count genes (25978 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Step 5: not doing batch correction.
norm_ixo2_pca <- plot_pca(norm_ixo2)
norm_ixo2_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

4 Some simple differential expression analyses

testing_de <- all_pairwise(ixo_expt)
## Using limma's removeBatchEffect to visualize before/after batch inclusion.
## Finished running DE analyses, collecting outputs.
## Comparing analyses 1/1: uninfected_vs_infected
testing_combined <- combine_de_tables(
  all_pairwise_result=testing_de, excel="excel/testing_de.xlsx")
## Deleting the file excel/testing_de.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Working on table 1/1: uninfected_vs_infected
## Adding venn plots for uninfected_vs_infected.

## Limma expression coefficients for uninfected_vs_infected; R^2: 0.968; equation: y = 1.49x + 1.31
## Edger expression coefficients for uninfected_vs_infected; R^2: 0.965; equation: y = 1.04x - 0.301
## DESeq2 expression coefficients for uninfected_vs_infected; R^2: 0.93; equation: y = 1.11x - 0.194
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Performing save of the workbook.

The following performs simple comparisons in the data using limma, EdgeR, and DESeq2.

ixo2_filt <- normalize_expt(ixo2, filter=TRUE)
## This function will replace the expt$expressionset slot with:
## hpgl(data)
## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
##  some metrics are easier to see when the data is log2 transformed, but
##  EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted.  It is often advisable to cpm/rpkm
##  the data to normalize for sampling differences, keep in mind though that rpkm
##  has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
##  will try to detect this).
## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.
## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: hpgl
## Removing 63856 low-count genes (25978 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
fun <- all_pairwise(ixo2_filt, model_batch="svaseq")
## The be method chose 1 surrogate variable(s).
## Attempting svaseq estimation with 1 surrogates.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## Finished running DE analyses, collecting outputs.
## Comparing analyses 1/1: uninfected_vs_infected
fun_combined <- combine_de_tables(fun, excel="excel/fun.xlsx")
## Deleting the file excel/fun.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Working on table 1/1: uninfected_vs_infected
## Adding venn plots for uninfected_vs_infected.

## Limma expression coefficients for uninfected_vs_infected; R^2: 0.978; equation: y = 0.985x + 0.0682
## Edger expression coefficients for uninfected_vs_infected; R^2: 0.977; equation: y = 0.993x + 0.0492
## DESeq2 expression coefficients for uninfected_vs_infected; R^2: 0.977; equation: y = 0.993x + 0.0501
## Writing summary information.
## Attempting to add the comparison plot to pairwise_summary at row: 19 and column: 1
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Performing save of the workbook.

fun$comparison$comp
##    uninfected_vs_infected
## le                 0.9711
## ld                 0.9730
## ed                 0.9975
## lb                 0.9318
## eb                 0.8958
## db                 0.8995
abundances <- get_pairwise_gene_abundances(fun, excel="excel/written_abundance_test.xlsx")
##expression_written <- write_xls(data=abundances[["expression_values"]],
##                                sheet="expression_values",
##                                title="Values making up the contrast logFCs")
##error_written <- write_xls(data=abundances[["error_values"]], start_col=18,
##                           title="Error value obtained by dividing the expression / t-statistic",
##                           sheet="expression_values",
##                           wb=expression_written[["workbook"]])
##openxlsx::saveWorkbook(wb=expression_written[["workbook"]],
##                       file="excel/written_abundance_test.xlsx", overwrite=TRUE)

4.1 Plot some metrics from EdgeR/limma/DESeq2

Lets compare infected/uninfected for the three tools.

##hpgl_volcano_plot(toptable_data=fun$limma$all_tables$uninfected_minus_infected)
something <- extract_de_plots(fun_combined)
something$volcano$plot

limma_toptable <- fun$limma$all_tables$uninfected_vs_infected
##limma_scatter <- limma_coefficient_scatter(fun$limma, toptable=limma_toptable, flip=TRUE)
limma_scatter <- extract_coefficient_scatter(fun, type="limma")
## This can do comparisons among the following columns in the pairwise result:
## infected, uninfected
## Actually comparing infected and uninfected.
limma_scatter$scatter

limma_scatter_coef <- limma_scatter$df

##deseq_scatter <- deseq_coefficient_scatter(fun$deseq, flip=TRUE)
deseq_scatter <- extract_coefficient_scatter(fun, type="deseq")
## This can do comparisons among the following columns in the pairwise result:
## infected, uninfected, SV1
## Actually comparing infected and uninfected.
deseq_scatter$scatter

deseq_scatter_coef <- deseq_scatter$df

4.2 Make some tables of the result

combined <- combine_de_tables(fun, table='uninfected_minus_infected', annot_df=annot_df)
## Writing a legend of columns.
## Working on table 1/1: uninfected_vs_infected
combined_table <- combined[["data"]][[1]]
stuff <- merge(combined_table, genelengths, by.x="row.names", by.y="ID")

write_xls(combined, sheet="uninfected_minus_infected", file="excel/combined_analyses.xls")
## Error in `colnames<-`(`*tmp*`, value = final_colnames): attempt to set 'colnames' on an object with less than two dimensions
annot_df <- annot_df[, c("ID","description")]
annotated_limma <- merge(annot_df, limma_toptable, by="row.names")
annotated_limma$FC <- 1 / (2 ^ annotated_limma$logFC)
rownames(annotated_limma) <- annotated_limma$Row.names
annotated_limma <- annotated_limma[-1]
write_xls(annotated_limma, sheet="uninfected_minus_infected", file="excel/annoated_limma.xls")
## $workbook
## A Workbook object.
##  
## Worksheets:
##  Sheet 1: "uninfected_minus_infected"
##  
##  Custom column widths (column: width)
##    1: 20, 2: auto, 3: auto, 4: auto, 5: auto, 6: auto, 7: auto, 8: auto, 9: auto 
##  
## 
##  
##  Worksheet write order: 1
## $sheet
## [1] "uninfected_minus_infected"
## 
## $end_row
## [1] 25981
## 
## $end_col
## [1] 10
significant_up <- annotated_limma
significant_up <- subset(significant_up, logFC <= -1)
significant_up <- subset(significant_up, P.Value <= 0.05)
write_xls(significant_up, sheet="sigup_p0.05", file="excel/sigup.xls", overwritefile=TRUE)
## $workbook
## A Workbook object.
##  
## Worksheets:
##  Sheet 1: "sigup_p0.05"
##  
##  Custom column widths (column: width)
##    1: 20, 2: auto, 3: auto, 4: auto, 5: auto, 6: auto, 7: auto, 8: auto, 9: auto 
##  
## 
##  
##  Worksheet write order: 1
## $sheet
## [1] "sigup_p0.05"
## 
## $end_row
## [1] 111
## 
## $end_col
## [1] 10
significant_down <- annotated_limma
significant_down <- subset(significant_down, logFC >= 1)
significant_down <- subset(significant_down, P.Value <= 0.05)
write_xls(significant_down, sheet="sigdown_p0.05", file="excel/sigdown.xls", overwritefile=TRUE)
## $workbook
## A Workbook object.
##  
## Worksheets:
##  Sheet 1: "sigdown_p0.05"
##  
##  Custom column widths (column: width)
##    1: 20, 2: auto, 3: auto, 4: auto, 5: auto, 6: auto, 7: auto, 8: auto, 9: auto 
##  
## 
##  
##  Worksheet write order: 1
## $sheet
## [1] "sigdown_p0.05"
## 
## $end_row
## [1] 296
## 
## $end_col
## [1] 10
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## R> packrat::restore()
## This is hpgltools commit: Thu Apr 12 22:08:53 2018 -0400: 7de4503f6bb5724c28cce24af5dbee22bb1c0cae
## Saving to rnaseq_ixodes-v20170511.rda.xz
---
title: "I.scapularis 2016: Initial RNAseq analyses."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
if (!isTRUE(get0("skip_load"))) {
  library(hpgltools)
  tt <- devtools::load_all("~/hpgltools")
  knitr::opts_knit$set(progress=TRUE,
                       verbose=TRUE,
                       width=90,
                       echo=TRUE)
  knitr::opts_chunk$set(error=TRUE,
                        fig.width=8,
                        fig.height=8,
                        dpi=96)
  old_options <- options(digits=4,
                         stringsAsFactors=FALSE,
                         knitr.duplicate.label="allow")
  ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
  ver <- "20170511"
  previous_file <- "index.Rmd"

  tmp <- try(sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz"))))
  rmd_file <- "rnaseq_ixodes.Rmd"
}
```

# RNAseq of ticks with and without borellia burgdorferi

This comprises my report of a recent windfall from Najib of some
sequencing data of ticks which have and have not been infected with
the parasite which causes lyme disease.

Much of the text/code following was plagarized from my rnaseq_cbf5.rmd file.

## Setting up

The following block of text loads up my R code and may be used to save
the data structures created by it as well as write pdf reports of the
resulting data.

# Preprocessing

The original data consists of single-ended 50 bp reads.  Therefore, I
preprocess it with trimomatic, bowtie, and htseq-count.

The genome for ixodes I downloaded from: http://vectordb.org/ along
with the .gff annotations.  The .gff required a quick sed filter to
make it parseable by htseq (It doesn't like strings with []).

The script used to handle these processes is process.pl and was
invoked as follows:

```{bash preprocessing, eval=FALSE}
cd preprocessing
start=$(pwd)
for i in $(/bin/ls HPGL*); do
  cd $start
  cd $i && ../process.pl --species ixodes_scapularis --input $i.fastq.gz
done

cd $start/..
mkdir -p processed_data/count_tables
for i in $(find . -name '*.count.xz'); do
  cp $i processed_data/count_tables
done
```

One that was finished, I spent a couple minutes writing
all_samples.csv to include what meta-information I know at this time.

## Alignment stats

I am just going to copy/paste the output stats for the libraries here.

* hpgl0612:
  + reads processed: 23981162
  + reads with at least one reported alignment: 6956769 (29.01%)
  + reads that failed to align: 16041911 (66.89%)
  + reads with alignments sampled due to -M: 982482 (4.10%)
  + Reported 6956769 alignments to 1 output stream(s)

* hpgl0613:
  + reads processed: 19650248
  + reads with at least one reported alignment: 1695721 (8.63%)
  + reads that failed to align: 17124763 (87.15%)
  + reads with alignments sampled due to -M: 829764 (4.22%)
  + Reported 1695721 alignments to 1 output stream(s)

* hpgl0614:
  + reads processed: 23687463
  + reads with at least one reported alignment: 6938925 (29.29%)
  + reads that failed to align: 15770327 (66.58%)
  + reads with alignments sampled due to -M: 978211 (4.13%)
  + Reported 6938925 alignments to 1 output stream(s)

* hpgl0615:
  + reads processed: 19996260
  + reads with at least one reported alignment: 5860058 (29.31%)
  + reads that failed to align: 13345455 (66.74%)
  + reads with alignments sampled due to -M: 790747 (3.95%)
  + Reported 5860058 alignments to 1 output stream(s)

* hpgl0616:
  + reads processed: 26501869
  + reads with at least one reported alignment: 7877812 (29.73%)
  + reads that failed to align: 17596723 (66.40%)
  + reads with alignments sampled due to -M: 1027334 (3.88%)
  + Reported 7877812 alignments to 1 output stream(s)

* hpgl0617:
  + reads processed: 26300269
  + reads with at least one reported alignment: 7820617 (29.74%)
  + reads that failed to align: 17459657 (66.39%)
  + reads with alignments sampled due to -M: 1019995 (3.88%)
  + Reported 7820617 alignments to 1 output stream(s)

# Analyses using alignments with 0 mismatches!

## Setting up

Here are the commands I invoke to get ready to play with new data, including everything
required to install hpgltools, the software it uses, and the data.

```{r setup, include=TRUE, eval=FALSE}
## These first 4 lines are not needed once hpgltools is installed.
source("http://bioconductor.org/biocLite.R")
biocLite("devtools")
library(devtools)
library(hpgltools)
```

## Reading data

I think the following is no longer valid?

```{r previous_annot}
annotation_file <- "reference/ixodes_exons.gff"
annot_df <- load_gff_annotations(annotation_file)
annot_df$ID <- gsub("\\-", "\\.", annot_df$ID, perl=TRUE)
annot_df$Parent <- gsub("\\-RA","", annot_df$Parent)
rownames(annot_df) <- annot_df$ID

description_file <- "reference/ixodes_mRNA.gff"
description_df <- load_gff_annotations(description_file)

descriptions <- merge(annot_df, description_df, by="Parent")
descriptions <- descriptions[,c("ID.x","width.x","description.y")]
rownames(descriptions) <- descriptions$ID.x
colnames(descriptions) <- c("ID","width","description")
```

```{r data_import}
isc_biomart <- load_biomart_annotations(species="iscapularis", host="metazoa.ensembl.org",
                                        include_lengths=TRUE)

ixo_expt <- create_expt(metadata="all_samples.csv", gene_info=descriptions)
```

## Normalizing and exploring data

There are lots of toys we have learned to use to play with with raw data and explore stuff like
batch effects or non-canonical distributions or skewed counts.  hpgltools provides some functionality
to make this process easier.  The graphs shown below and many more are generated with the wrapper
'graph_metrics()' but that takes away the chance to explain the graphs as I generate them.

```{r norm_explore}
ixo_written <- write_expt(ixo_expt, excel="excel/iscapularis_written.xlsx")

## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
ixo_libsize <- plot_libsize(ixo_expt)

pp(file="images/libsize.tiff")
ixo_libsize$plot
dev.off()

## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
ixo_nonzero <- plot_nonzero(ixo_expt, title="nonzero vs. cpm")
ixo_nonzero$plot
```

### An initial pca plot

In most cases, raw data does not cluster very well, lets see if that is also true for the ixo experiment.
Assuming it doesn't, lets normalize the data using the defaults (cpm, quantile, log2) and try again.

```{r pca}
## Unsurprisingly, the raw data doesn't cluster well at all...
##ixo_rawpca = hpgl_pca(expt=ixo_expt, labels=ixo_expt$condition)
##ixo_rawpca = hpgl_pca(expt=ixo_expt, labels="fancy")
ixo_rawpca <- plot_pca(ixo_expt, title="Fun!")
ixo_rawpca$plot
```

### Normalized PCA

Next lets do a quantile, cpm, log2 normalization and try again.

```{r pca_norm}

## So, normalize the data
norm_expt <- normalize_expt(ixo_expt, transform="log2", norm="quant", convert="cpm", filter=TRUE)
normed <- graph_metrics(norm_expt)
normed$pcaplot

raw_data <- exprs(ixo_expt$expressionset)
head(raw_data)

descriptions <- load_gff_annotations("reference/ixodes_mRNA.gff")
rownames(descriptions) <- descriptions$Parent

genelengths <- get_genesizes(annotation=annot_df, gene_type="exon")
rownames(genelengths) <- gsub("\\-","\\.", genelengths$ID, perl=TRUE)

genelengths <- merge(raw_data, genelengths, by="row.names")
rownames(genelengths) <- genelengths$Row.names
genelengths <- genelengths[, -1]
my_lenvec <- as.vector(genelengths$gene_size)
names(my_lenvec) <- genelengths$Row.names
short_annot <- annot_df
short_annot <- short_annot[,c("seqnames","description")]

raw_genes <- sum_exons(raw_data, annotdf=annot_df)
my_lenvec <- as.vector(raw_genes$width$width)
names(my_lenvec) <- rownames(raw_genes$gene_size)
my_genecounts <- raw_genes$counts
raw_rpkm <- edgeR::rpkm(my_genecounts, gene.length=my_lenvec)
uninf_rpkm <- rowMeans(raw_rpkm[,c(4,5,6)])
inf_rpkm <- rowMeans(raw_rpkm[,c(1,2,3)])

rpkm_values <- merge(raw_rpkm, as.data.frame(uninf_rpkm), by="row.names")
rpkm_values <- merge(rpkm_values, as.data.frame(inf_rpkm), by.x="Row.names", by.y="row.names")
rownames(rpkm_values) <- rpkm_values$Row.names
rpkm_values <- rpkm_values[-1]

rpkm_values <- merge(rpkm_values, descriptions, by="row.names")
rownames(rpkm_values) <- rpkm_values$Row.names
rpkm_values <- rpkm_values[-1]
rpkm_values <- rpkm_values[,c(1,2,3,4,5,6,7,8,12,21)]
write_xls(rpkm_values, sheet="rpkm_values", file="excel/rpkm_raw.xls")
```

As you can see, the uninfected-2 sample is still deeply problematic.

### Raw data with batch removal

That had a pretty profound effect, so lets try leaving the data alone and just removing the batch effect.

```{r pca_batchraw}
norm_expt_svaseq <- normalize_expt(ixo_expt, transform="raw",
                                   norm="raw", convert="cpm",
                                   filter=TRUE, batch="svaseq")

pp(file="images/norm_pca.tiff")
plot_pca(norm_expt_svaseq)$plot
dev.off()
```

That seems to be sufficient to smash the uninfected and infected samples into their own pairs.

```{r pca_batchnorm}
norm_expt_sva <- normalize_expt(ixo_expt, transform="log2", norm="quant",
                               convert="cpm", filter=TRUE, batch="sva")
batchnorm_metrics <- graph_metrics(norm_expt_sva)
batchnorm_metrics$corheat
batchnorm_metrics$disheat
batchnorm_metrics$pcaplot
```

## Remove the problematic sample

```{r sample_removal}
## The above suggests to me that uninfected #2 is no good.
ixo2 <- subset_expt(ixo_expt, subset="name!='up2_c2'")
norm_ixo2 <- normalize_expt(ixo2, transform="log2", norm="quant", convert="cpm", filter=TRUE)
norm_ixo2_pca <- plot_pca(norm_ixo2)
norm_ixo2_pca$plot
```

# Some simple differential expression analyses

```{r initial_de}
testing_de <- all_pairwise(ixo_expt)
testing_combined <- combine_de_tables(
  all_pairwise_result=testing_de, excel="excel/testing_de.xlsx")
```

The following performs simple comparisons in the data using limma, EdgeR, and DESeq2.

```{r simple_de}
ixo2_filt <- normalize_expt(ixo2, filter=TRUE)
fun <- all_pairwise(ixo2_filt, model_batch="svaseq")
fun_combined <- combine_de_tables(fun, excel="excel/fun.xlsx")
fun$comparison$comp
abundances <- get_pairwise_gene_abundances(fun, excel="excel/written_abundance_test.xlsx")
##expression_written <- write_xls(data=abundances[["expression_values"]],
##                                sheet="expression_values",
##                                title="Values making up the contrast logFCs")
##error_written <- write_xls(data=abundances[["error_values"]], start_col=18,
##                           title="Error value obtained by dividing the expression / t-statistic",
##                           sheet="expression_values",
##                           wb=expression_written[["workbook"]])
##openxlsx::saveWorkbook(wb=expression_written[["workbook"]],
##                       file="excel/written_abundance_test.xlsx", overwrite=TRUE)
```

## Plot some metrics from EdgeR/limma/DESeq2

Lets compare infected/uninfected for the three tools.

```{r plot_pairwise}
##hpgl_volcano_plot(toptable_data=fun$limma$all_tables$uninfected_minus_infected)
something <- extract_de_plots(fun_combined)
something$volcano$plot

limma_toptable <- fun$limma$all_tables$uninfected_vs_infected
##limma_scatter <- limma_coefficient_scatter(fun$limma, toptable=limma_toptable, flip=TRUE)
limma_scatter <- extract_coefficient_scatter(fun, type="limma")
limma_scatter$scatter
limma_scatter_coef <- limma_scatter$df

##deseq_scatter <- deseq_coefficient_scatter(fun$deseq, flip=TRUE)
deseq_scatter <- extract_coefficient_scatter(fun, type="deseq")
deseq_scatter$scatter
deseq_scatter_coef <- deseq_scatter$df
```

## Make some tables of the result

```{r tables}
combined <- combine_de_tables(fun, table='uninfected_minus_infected', annot_df=annot_df)
combined_table <- combined[["data"]][[1]]
stuff <- merge(combined_table, genelengths, by.x="row.names", by.y="ID")

write_xls(combined, sheet="uninfected_minus_infected", file="excel/combined_analyses.xls")

annot_df <- annot_df[, c("ID","description")]
annotated_limma <- merge(annot_df, limma_toptable, by="row.names")
annotated_limma$FC <- 1 / (2 ^ annotated_limma$logFC)
rownames(annotated_limma) <- annotated_limma$Row.names
annotated_limma <- annotated_limma[-1]
write_xls(annotated_limma, sheet="uninfected_minus_infected", file="excel/annoated_limma.xls")
significant_up <- annotated_limma
significant_up <- subset(significant_up, logFC <= -1)
significant_up <- subset(significant_up, P.Value <= 0.05)
write_xls(significant_up, sheet="sigup_p0.05", file="excel/sigup.xls", overwritefile=TRUE)

significant_down <- annotated_limma
significant_down <- subset(significant_down, logFC >= 1)
significant_down <- subset(significant_down, P.Value <= 0.05)
write_xls(significant_down, sheet="sigdown_p0.05", file="excel/sigdown.xls", overwritefile=TRUE)
```

```{r saveme}
if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  message(paste0("Saving to ", this_save))
  tmp <- sm(saveme(filename=this_save))
}
```
