1 Sample Estimation, Macrophages: 20170820

This document is concerned with analyzing RNAseq data of human and parasite during the infection of human macrophages. A few different strains of L. panamensis were used; the experiment is therefore segregated into groups named ‘self-healing’, ‘chronic’, and ‘uninfected’. Two separate sets of libraries were generated, one earlier set with greater coverage, and a later set including only 1 uninfected sample, and 2-3 chronic samples.

1.1 Extract the macrophage experiment

The following subset operation extract the samples used for the macrophage experiment. The next three lines then change the colors from the defaults.

hs_expt <- sm(create_expt(metadata="sample_sheets/all_samples-combined.xlsx",
                          gene_info=hs_annotations,
                          file_column="humanfile"))

hs_macr <- subset_expt(hs_expt, subset="experimentname=='macrophage'")
## There were 50, now there are 11 samples.
new_colors <- c("#009900", "#990000", "#000099")
names(new_colors) <- c("uninf", "chr", "sh")

hs_macr <- set_expt_colors(hs_macr, colors=new_colors)
## The new colors are a character, changing according to condition.
labels <- as.character(pData(hs_macr)[["label"]])
hs_macr <- set_expt_samplenames(hs_macr, labels)

hs_cds_expt <- exclude_genes_expt(hs_expt, column='Type', patterns="protein_coding")
## Before removal, there were 51041 entries.
## Now there are 32233 entries.
## Percent kept: 6.800, 4.913, 4.463, 3.420, 4.095, 3.758, 4.250, 3.901, 2.844, 2.354, 3.398, 2.979, 3.360, 2.824, 3.827, 3.483, 5.058, 5.009, 4.342, 4.748, 5.045, 5.140, 5.233, 3.496, 3.612, 4.589, 2.427, 2.134, 2.082, 1.978, 12.185, 10.054, 9.846, 9.990, 16.555, 7.559, 5.374, 3.647, 3.770, 4.198, 3.756, 3.777, 3.723, 6.072, 5.246, 5.389, 5.451, 5.071, 5.186, 4.931
## Percent removed: 93.200, 95.087, 95.537, 96.580, 95.905, 96.242, 95.750, 96.099, 97.156, 97.646, 96.602, 97.021, 96.640, 97.176, 96.173, 96.517, 94.942, 94.991, 95.658, 95.252, 94.955, 94.860, 94.767, 96.504, 96.388, 95.411, 97.573, 97.866, 97.918, 98.022, 87.815, 89.946, 90.154, 90.010, 83.445, 92.441, 94.626, 96.353, 96.230, 95.802, 96.244, 96.223, 96.277, 93.928, 94.754, 94.611, 94.549, 94.929, 94.814, 95.069
hs_cds_macr <- subset_expt(hs_cds_expt, subset="experimentname=='macrophage'")
## There were 50, now there are 11 samples.
hs_cds_macr <- set_expt_colors(hs_cds_macr, colors=new_colors)
## The new colors are a character, changing according to condition.
hs_cds_macr <- set_expt_samplenames(hs_cds_macr, labels)

2 Figure S1

Figure S1 should include nice versions of the sample metrics. The normalization chosen is batch-in-model.

First, however, we will make some plots of the raw data.

Sample names are going to be ‘infectionstate_strainnumber’ : chr_7721

  • Panel A: Library sizes.
  • Panel B: Heatmap distance raw.
  • Panel C: PCA
  • Panel D: TSNE
fig_s1 <- sm(write_expt(hs_macr, norm="quant", violin=FALSE, convert="cpm",
                        transform="log2", batch="pca", filter=TRUE,
                        excel=paste0("excel/figure_s1_sample_estimation-v", ver, ".xlsx")))
fig_s1_cds <- sm(write_expt(hs_cds_macr, norm="quant", violin=FALSE, convert="cpm",
                            transform="log2", batch="pca", filter=TRUE,
                            excel=paste0("excel/figure_s1_sample_estimation_cds_only-v", ver, ".xlsx")))

First look for clustering patterns in the raw data, in all data followed by only the CDS regions.

plot_tsne(hs_macr)$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

plot_tsne(hs_cds_macr)$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

2.1 Write S1 images

Once the experiment has been written with the various normalizations in place, we can use that to print and view some representative plots.

pp(file="images/figure_s1a.pdf", image=fig_s1$raw_libsize)
## Writing the image to: images/figure_s1a.pdf and calling dev.off().

## Show the library sizes of all features.

pp(file="images/figure_s1b.pdf", image=fig_s1$raw_disheat)
## Writing the image to: images/figure_s1b.pdf and calling dev.off().

## Show the clustering by euclidean distance of the raw data all features.

pp(file="images/figure_s1c.pdf", image=fig_s1$raw_scaled_pca)
## Writing the image to: images/figure_s1c.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Do a PCA plot of the log2(normalized(data)) without any batch-in-model

pp(file="images/figure_s1d.pdf", image=fig_s1$norm_tsne)
## Writing the image to: images/figure_s1d.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Do a TSNE of the normalized data after condition is in the model.

3 Figure 1

  • Figure 1a distance heatmap of the normalized data.
  • Figure 1b is a PCA of the normalized data.

In a similar fashion to what is above, I am printing the figure 1 images.

pp(file="images/figure_1a.pdf", fig_s1$norm_disheat)
## Writing the image to: images/figure_1a.pdf and calling dev.off().

## Show the changes in clustering after adding batch to the model and normalizing.

pp(file="images/figure_1a_cds.pdf", image=fig_s1_cds$norm_disheat)
## Writing the image to: images/figure_1a_cds.pdf and calling dev.off().

## Observe that the clustering is nearly identical when only looking at the CDS features.

pp(file="images/figure_1b.pdf", image=fig_s1$norm_pca)
## Writing the image to: images/figure_1b.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Visualize the clustering after normalization via PCA.
knitr::kable(fig_s1$norm_pca_table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.3001 0.5411
HPGL0242 sh a 1 #000099 sh_2271 -0.1826 -0.1392
HPGL0243 sh a 1 #000099 sh_2272 -0.1623 -0.0610
HPGL0244 chr a 1 #990000 chr_5433 0.5607 0.1367
HPGL0245 chr a 1 #990000 chr_1320 0.1739 -0.2052
HPGL0246 chr a 1 #990000 chr_2504 0.3305 -0.1978
HPGL0247 chr a 1 #990000 chr_5430 -0.0119 -0.1666
HPGL0248 chr a 1 #990000 chr_5397 0.2537 -0.2304
HPGL0637 uninf b 2 #009900 uninf 0.0142 0.5954
HPGL0638 sh b 2 #000099 sh_2189 -0.1111 0.0995
HPGL0639 sh b 2 #000099 sh_1022 -0.5651 -0.3727
write.csv(fig_s1$norm_pca_table, file="images/fig_1a.csv")

pp(file="images/figure_1b_cds.pdf", image=fig_s1_cds$norm_pca)
## Writing the image to: images/figure_1b_cds.pdf and calling dev.off().
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Once again, see that it is mostly unchanged when using only CDS features.
knitr::kable(fig_s1_cds$norm_pca_table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.7944 0.1829
HPGL0242 sh a 1 #000099 sh_2271 0.1567 -0.2831
HPGL0243 sh a 1 #000099 sh_2272 0.1705 -0.4056
HPGL0244 chr a 1 #990000 chr_5433 0.1587 0.7099
HPGL0245 chr a 1 #990000 chr_1320 0.1713 0.2040
HPGL0246 chr a 1 #990000 chr_2504 0.1925 0.1750
HPGL0247 chr a 1 #990000 chr_5430 0.1967 0.1279
HPGL0248 chr a 1 #990000 chr_5397 0.1991 -0.1948
HPGL0637 uninf b 2 #009900 uninf -0.3759 -0.2131
HPGL0638 sh b 2 #000099 sh_2189 -0.0127 -0.1561
HPGL0639 sh b 2 #000099 sh_1022 -0.0624 -0.1470
write.csv(fig_s1_cds$norm_pca_table, file="images/fig_1a_cds.csv")

4 Sample metrics

The following is mostly used to compare different methodologies and is therefore not likely to be useful for most.

The various metrics used to describe and examine the data come once before, and once after normalization.

4.1 Initial sample metrics

graph_metrics creates a set of plots we can use to look at the data before messing with it.

written <- sm(write_expt(hs_macr,
                         excel=paste0("excel/macrophage_human_data-v", ver, ".xlsx"),
                         violin=TRUE))
hs_macr_metrics <- sm(graph_metrics(hs_macr))

The two most likely plots one will want to see immediately are the library sizes and data densities.

hs_macr_metrics$libsize

## The libraries > 600 in HPGL id have lower coverage as they were sequenced many more to a lane.
hs_macr_metrics$density

## This same phenomenon is visible in the density plot
hs_macr_metrics$nonzero

## and the non-zero gene / cpm plot.

4.2 Normalization effects on the macrophage data

The plotted library sizes suggest that differences observed in the raw data are due at least partially to less saturation in the new samples. Therefore, low-count filtering is almost certainly to prove an important parameter. Similarly, a batch correction should help alleviate these differences.

hs_macr_nonil <- subset_expt(hs_macr, subset="condition=='sh'|condition=='chr'")
## There were 11, now there are 9 samples.
hs_macr_cpmlow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm"))
hs_macr_l2cpmlow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm", transform="log2"))
hs_macr_l2cpmpofa <- sm(normalize_expt(hs_macr, filter="pofa", convert="cpm", transform="log2"))
hs_macr_cpmrlelow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm", norm="rle"))
hs_macr_l2cpmrlelow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm",
                                         norm="rle", transform="log2"))

The normalizations in the block above do not attempt any batch correction. Now take a moment to view how they affect pca plots.

hs_macr_pca <- plot_pca(hs_macr)
## Start with the unmodified data
## It is clearly a mess.
hs_macr_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

knitr::kable(hs_macr_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.5683 -0.3528
HPGL0242 sh a 1 #000099 sh_2271 -0.1753 -0.2831
HPGL0243 sh a 1 #000099 sh_2272 -0.2902 -0.1149
HPGL0244 chr a 1 #990000 chr_5433 -0.0763 0.6999
HPGL0245 chr a 1 #990000 chr_1320 0.0355 0.0086
HPGL0246 chr a 1 #990000 chr_2504 -0.2197 0.3669
HPGL0247 chr a 1 #990000 chr_5430 -0.0294 -0.0494
HPGL0248 chr a 1 #990000 chr_5397 0.1368 0.0229
HPGL0637 uninf b 2 #009900 uninf 0.3007 0.1848
HPGL0638 sh b 2 #000099 sh_2189 0.4894 -0.1976
HPGL0639 sh b 2 #000099 sh_1022 0.3966 -0.2854
hs_macr_cpmlow_pca <- plot_pca(hs_macr_cpmlow)
hs_macr_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## low-filtering and cpm is insufficient to clarify the data, but it is a start
knitr::kable(hs_macr_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.5683 -0.3528
HPGL0242 sh a 1 #000099 sh_2271 -0.1753 -0.2831
HPGL0243 sh a 1 #000099 sh_2272 -0.2902 -0.1149
HPGL0244 chr a 1 #990000 chr_5433 -0.0763 0.6999
HPGL0245 chr a 1 #990000 chr_1320 0.0355 0.0086
HPGL0246 chr a 1 #990000 chr_2504 -0.2197 0.3669
HPGL0247 chr a 1 #990000 chr_5430 -0.0294 -0.0494
HPGL0248 chr a 1 #990000 chr_5397 0.1368 0.0229
HPGL0637 uninf b 2 #009900 uninf 0.3007 0.1848
HPGL0638 sh b 2 #000099 sh_2189 0.4894 -0.1976
HPGL0639 sh b 2 #000099 sh_1022 0.3966 -0.2854
hs_macr_l2cpmlow_pca <- plot_pca(hs_macr_l2cpmlow)
hs_macr_l2cpmlow_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Moving to the log scale makes much clearer the batch effect
## and shows that the data is actually pretty well separated by condition already.
knitr::kable(hs_macr_l2cpmlow_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.2740 -0.6084
HPGL0242 sh a 1 #000099 sh_2271 -0.2364 -0.3174
HPGL0243 sh a 1 #000099 sh_2272 -0.1964 -0.1616
HPGL0244 chr a 1 #990000 chr_5433 -0.1603 0.4705
HPGL0245 chr a 1 #990000 chr_1320 -0.1480 0.2027
HPGL0246 chr a 1 #990000 chr_2504 -0.1549 0.2445
HPGL0247 chr a 1 #990000 chr_5430 -0.1323 0.1467
HPGL0248 chr a 1 #990000 chr_5397 -0.1605 0.2113
HPGL0637 uninf b 2 #009900 uninf 0.5074 0.1015
HPGL0638 sh b 2 #000099 sh_2189 0.5063 0.0315
HPGL0639 sh b 2 #000099 sh_1022 0.4492 -0.3213
hs_macr_l2cpmpofa_pca <- plot_pca(hs_macr_l2cpmpofa)
hs_macr_l2cpmpofa_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## I have access to a few low-count filters, but they appear to make little or no difference.
knitr::kable(hs_macr_l2cpmpofa_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.2784 -0.6197
HPGL0242 sh a 1 #000099 sh_2271 -0.2380 -0.3080
HPGL0243 sh a 1 #000099 sh_2272 -0.1973 -0.1527
HPGL0244 chr a 1 #990000 chr_5433 -0.1575 0.4586
HPGL0245 chr a 1 #990000 chr_1320 -0.1471 0.2057
HPGL0246 chr a 1 #990000 chr_2504 -0.1526 0.2483
HPGL0247 chr a 1 #990000 chr_5430 -0.1320 0.1522
HPGL0248 chr a 1 #990000 chr_5397 -0.1590 0.2123
HPGL0637 uninf b 2 #009900 uninf 0.5043 0.0919
HPGL0638 sh b 2 #000099 sh_2189 0.5071 0.0361
HPGL0639 sh b 2 #000099 sh_1022 0.4506 -0.3249
hs_macr_cpmrlelow_pca <- plot_pca(hs_macr_cpmrlelow)
hs_macr_cpmrlelow_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## RLE normalization has an interesting effect on the data
knitr::kable(hs_macr_cpmrlelow_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.3996 0.1372
HPGL0242 sh a 1 #000099 sh_2271 -0.3481 0.0650
HPGL0243 sh a 1 #000099 sh_2272 -0.3362 -0.0977
HPGL0244 chr a 1 #990000 chr_5433 0.3113 -0.5311
HPGL0245 chr a 1 #990000 chr_1320 -0.1105 -0.1660
HPGL0246 chr a 1 #990000 chr_2504 0.0868 -0.2796
HPGL0247 chr a 1 #990000 chr_5430 -0.1558 -0.1185
HPGL0248 chr a 1 #990000 chr_5397 0.0940 -0.1828
HPGL0637 uninf b 2 #009900 uninf 0.5731 0.1938
HPGL0638 sh b 2 #000099 sh_2189 0.3514 0.4037
HPGL0639 sh b 2 #000099 sh_1022 -0.0665 0.5759
hs_macr_l2cpmrlelow_pca <- plot_pca(hs_macr_l2cpmrlelow, size_column="pctcategory")
hs_macr_l2cpmrlelow_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Still the batch effect is pretty ridiculous
## This plot resizes the glyphs to indicate the % mapping.
knitr::kable(hs_macr_l2cpmrlelow_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2 pctcategory size
HPGL0241 uninf a 1 #009900 uninf_1 -0.2742 -0.5998 0 1
HPGL0242 sh a 1 #000099 sh_2271 -0.2458 -0.3404 3 4
HPGL0243 sh a 1 #000099 sh_2272 -0.1979 -0.1365 3 4
HPGL0244 chr a 1 #990000 chr_5433 -0.1559 0.4444 1 2
HPGL0245 chr a 1 #990000 chr_1320 -0.1431 0.2312 4 5
HPGL0246 chr a 1 #990000 chr_2504 -0.1560 0.2440 2 3
HPGL0247 chr a 1 #990000 chr_5430 -0.1326 0.1528 2 3
HPGL0248 chr a 1 #990000 chr_5397 -0.1557 0.2075 4 5
HPGL0637 uninf b 2 #009900 uninf 0.5065 0.1159 0 1
HPGL0638 sh b 2 #000099 sh_2189 0.5091 0.0200 5 6
HPGL0639 sh b 2 #000099 sh_1022 0.4456 -0.3390 3 4

4.2.1 Look at the principal components

hs_macr_pca_info <- sm(pca_information(hs_macr,
                                       expt_factors=c("condition","batch","pctmappedparasite",
                                                      "state","pathogenstrain","rnangul")))

##write.csv(macro_pca_info$rsquared_table, file="images/rsquared.csv")
##write.csv(macro_pca_info$anova_p, file="images/anova_p.csv")
hs_macr_pca_info$anova_fstat_heatmap

I could continue the above normalizations and plots, but it quickly becomes boring and redundant. The batch effect is too significant to continue without addressing it. One thing that was brought up in previous discussions: does the mapping percentage act as a surrogate variable? This suggests that if it does, it is pretty weak, and the last PCA plot shows that is cosegregates with the existing conditions: uninfected has the lowest mapping percent to the parasite [0%]; while chronic has the lower percentage of the infections, but not so much lower than self-healing in general.

4.2.2 Normalization with batch effect minimization

We have good information that suggests a strong batch effect in this data. Let us put that information to use in the following block.

How do these modifications affect the separation of the data?

hs_macr_cpmlowcbcb <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm", batch=TRUE))
hs_macr_l2cpmlowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                         batch="combat_scale", transform="log2"))
hs_macr_l2cpmrlelowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                            batch="combat_scale", transform="log2", norm="rle"))
hs_macr_l2cpmquantlowcmod <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                               batch="combatmod", transform="log2", norm="quant"))
hs_macr_l2cpmquantlowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="combat_scale", transform="log2", norm="quant"))
hs_macr_l2cpmquantlowsva <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="svaseq", norm="quant"))
hs_macr_l2cpmquantlowfsva <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                               batch="fsva", norm="quant"))
hs_macr_l2cpmquantlowruv <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="combat_scale", transform="log2", norm="quant"))
tt <- sm(graph_metrics(hs_macr_l2cpmquantlowruv))
hs_macr_cpmlowcbcb_pca <- plot_pca(hs_macr_cpmlowcbcb)
hs_macr_cpmlowcbcb_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## cbcb's batch correction is insufficient, at least without log2
hs_macr_cpmlowcbcb_pca$table
##    sampleid condition batch batch_int  colors   labels      PC1      PC2
## 1  HPGL0241     uninf     a         1 #009900  uninf_1 -0.09337  0.66888
## 2  HPGL0242        sh     a         1 #000099  sh_2271 -0.23685 -0.15485
## 3  HPGL0243        sh     a         1 #000099  sh_2272 -0.06401 -0.01482
## 4  HPGL0244       chr     a         1 #990000 chr_5433  0.60514 -0.04000
## 5  HPGL0245       chr     a         1 #990000 chr_1320  0.09059 -0.20559
## 6  HPGL0246       chr     a         1 #990000 chr_2504  0.21245 -0.28906
## 7  HPGL0247       chr     a         1 #990000 chr_5430 -0.05050 -0.20619
## 8  HPGL0248       chr     a         1 #990000 chr_5397  0.19782 -0.22250
## 9  HPGL0637     uninf     b         2 #009900    uninf  0.13221  0.51916
## 10 HPGL0638        sh     b         2 #000099  sh_2189 -0.13449  0.11016
## 11 HPGL0639        sh     b         2 #000099  sh_1022 -0.65901 -0.16518
hs_macr_l2cpmlowcom_pca <- plot_pca(hs_macr_l2cpmlowcom)
hs_macr_l2cpmlowcom_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## oooo pretty
knitr::kable(hs_macr_l2cpmlowcom_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.5445 0.0803
HPGL0242 sh a 1 #000099 sh_2271 -0.2561 0.2708
HPGL0243 sh a 1 #000099 sh_2272 -0.0981 -0.2656
HPGL0244 chr a 1 #990000 chr_5433 0.4802 0.1559
HPGL0245 chr a 1 #990000 chr_1320 0.2490 -0.3045
HPGL0246 chr a 1 #990000 chr_2504 0.2955 0.4898
HPGL0247 chr a 1 #990000 chr_5430 0.2064 -0.5245
HPGL0248 chr a 1 #990000 chr_5397 0.2660 0.2246
HPGL0637 uninf b 2 #009900 uninf -0.2121 -0.2582
HPGL0638 sh b 2 #000099 sh_2189 -0.1054 -0.1486
HPGL0639 sh b 2 #000099 sh_1022 -0.2810 0.2800
hs_macr_l2cpmrlelowcom_pca <- plot_pca(hs_macr_l2cpmrlelowcom)
hs_macr_l2cpmrlelowcom_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## also looking good
knitr::kable(hs_macr_l2cpmrlelowcom_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.5606 0.0681
HPGL0242 sh a 1 #000099 sh_2271 -0.2444 -0.2858
HPGL0243 sh a 1 #000099 sh_2272 -0.0863 0.0807
HPGL0244 chr a 1 #990000 chr_5433 0.4560 0.1062
HPGL0245 chr a 1 #990000 chr_1320 0.2571 0.0781
HPGL0246 chr a 1 #990000 chr_2504 0.2825 -0.3644
HPGL0247 chr a 1 #990000 chr_5430 0.2205 0.4784
HPGL0248 chr a 1 #990000 chr_5397 0.2767 -0.3788
HPGL0637 uninf b 2 #009900 uninf -0.2143 0.4161
HPGL0638 sh b 2 #000099 sh_2189 -0.0971 0.2104
HPGL0639 sh b 2 #000099 sh_1022 -0.2902 -0.4090
hs_macr_l2cpmquantlowcmod_pca <- plot_pca(hs_macr_l2cpmquantlowcmod)
hs_macr_l2cpmquantlowcmod_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## not as good
knitr::kable(hs_macr_l2cpmquantlowcmod_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.6808 0.1257
HPGL0242 sh a 1 #000099 sh_2271 -0.3746 -0.1335
HPGL0243 sh a 1 #000099 sh_2272 -0.2021 0.2806
HPGL0244 chr a 1 #990000 chr_5433 0.3934 -0.1607
HPGL0245 chr a 1 #990000 chr_1320 0.1805 0.2191
HPGL0246 chr a 1 #990000 chr_2504 0.1726 -0.4126
HPGL0247 chr a 1 #990000 chr_5430 0.1340 0.3985
HPGL0248 chr a 1 #990000 chr_5397 0.1736 -0.2283
HPGL0637 uninf b 2 #009900 uninf 0.2009 0.3443
HPGL0638 sh b 2 #000099 sh_2189 0.1591 0.1152
HPGL0639 sh b 2 #000099 sh_1022 -0.1566 -0.5484
hs_macr_l2cpmquantlowfsva_pca <- plot_pca(hs_macr_l2cpmquantlowfsva)
hs_macr_l2cpmquantlowfsva_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pretty_pca <- plot_pca(hs_macr_l2cpmquantlowcom)
pretty_pca$plot
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## perhaps the best?
knitr::kable(pretty_pca$table)
sampleid condition batch batch_int colors labels PC1 PC2
HPGL0241 uninf a 1 #009900 uninf_1 -0.5479 0.0811
HPGL0242 sh a 1 #000099 sh_2271 -0.2444 0.2656
HPGL0243 sh a 1 #000099 sh_2272 -0.0984 -0.2749
HPGL0244 chr a 1 #990000 chr_5433 0.4800 0.1569
HPGL0245 chr a 1 #990000 chr_1320 0.2584 -0.3114
HPGL0246 chr a 1 #990000 chr_2504 0.2798 0.5005
HPGL0247 chr a 1 #990000 chr_5430 0.2043 -0.5226
HPGL0248 chr a 1 #990000 chr_5397 0.2774 0.2092
HPGL0637 uninf b 2 #009900 uninf -0.2271 -0.2421
HPGL0638 sh b 2 #000099 sh_2189 -0.1144 -0.1436
HPGL0639 sh b 2 #000099 sh_1022 -0.2676 0.2813
##write.csv(pretty_pca$table, file="excel/pretty_pca.csv")

4.2.3 Surrogate estimations

## Now try and combine this with
## https://github.com/jtleek/svaseq/blob/master/recount.Rmd
## To see if things become clearer
hs_macr_filt <- sm(normalize_expt(expt=hs_macr, filter=TRUE))
hs_macr_se <- sm(compare_surrogate_estimates(hs_macr_filt))

## I think my implementation of ruv_empirical may be borken.
hs_macr_se$pca_plots$ruvsup
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

hs_macr_se$ruv_supervised_adjust$svs_sample

hs_macr_se$pca_plots$svasup
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

hs_macr_se$sva_supervised_adjust$svs_sample

## The new uninfected sample still seems oddly placed
hs_macr_se$pca_plots$pca
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

hs_macr_se$pca_adjust$svs_sample

## Ditto here
hs_macr_se$pca_plots$svaunsup
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## This seems pretty reasonable to me
hs_macr_se$pca_plots$ruvresid
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## This looks a lot like pca and ruvsup, but the corrplot suggested to me that it would look weirder.

4.2.4 Effect on other metrics by batch correction

Lets see how these modification affect the visualizations

hs_macr_batchnorm_metrics <- sm(graph_metrics(hs_macr_l2cpmquantlowcom))
print(hs_macr_batchnorm_metrics$corheat)

## The combat adjusted counts seem to split  relatively clearly between types.
print(hs_macr_batchnorm_metrics$smc)

## This one sample is still problematic, though.
print(hs_macr_batchnorm_metrics$pcaplot)
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## But quantile normalization really does smash them into the same distribution.

Some other analyses performed suggest a possible switch between two samples. We can artifically see what the data would look like if that is true by switching those two samples in a separate experimental design.

5 TODO 201611 #4

PBMC: variance partition human, “Are there any genes with variance we can ascribe to condition?” That set of genes is the most interesting. – use the variance parition table and pull the top 200. Report back the % variance by condition for all of these genes.

6 Variance Partition

In the following, I will attempt to find the variance associated with different experimental factors in the macrophage data with mappings against the human transcriptome.

6.1 Setting up

As in macrophage_estimation_human.html, I need to pull out the relevant samples:

7 Perform variancePartition analyses

7.1 Condition + batch

Start out with just condition and batch

## The default arguments are to query ~ condition + (1|batch)
## Which, because condition is categorical, will end badly.
hs_macr_vp_cb <- sm(varpart(hs_macr_filt,
                    predictor=NULL,
                    factors=c("condition","batch")))
hs_macr_vp_cb$percent_plot

hs_macr_vp_cb$partition_plot

## Batch is dominant, but there is still quite a lot left in residuals.

7.2 Condition + batch + snpclades

hs_macr_vp_cbs3 <- sm(varpart(hs_macr_l2cpmquantlowsva,
                      predictor=NULL,
                      factors=c("condition","batch","snpcladev3")))
hs_macr_vp_cbs3$percent_plot

hs_macr_vp_cbs3$partition_plot

8 Repeat analyses, Leishmania samples

9 Sample metrics

I just realized that what I need to do is to take the surrogate estimates from the human experiment. Using those I think I will be able to get a more realistic estimate of what is going on.

9.1 But first, attempt a normal estimation

parasite_expt <- create_expt(metadata="sample_sheets/all_samples-combined.xlsx",
                                gene_info=lp_annotations,
                                file_column="parasitefile")
## Reading the sample metadata.
## The sample definitions comprises: 50, 37 rows, columns.
## Reading count tables.
## Reading count tables with read.table().
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0242/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0243/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0244/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0245/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0246/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0247/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0248/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0316/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0318/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0320/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0322/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0631/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0632/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0633/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0634/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0635/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0636/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0638/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0639/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0641/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0643/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0651/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0652/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0653/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0654/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0655/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0656/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0658/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0659/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0660/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0661/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0662/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## /cbcb/nelsayed-scratch/atb/rnaseq/lpanamensis_2016/preprocessing/hpgl0663/outputs/tophat_lpanamensis/accepted_paired.count.xz contains 8846 rows and merges to 8846 rows.
## Finished reading count tables.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
lp_macr <- subset_expt(parasite_expt, subset="experimentname=='macrophage'")
## There were 33, now there are 9 samples.
chosen_colors <- c("#990000", "#000099")
names(chosen_colors) <- c("chr", "sh")
lp_macr <- set_expt_colors(lp_macr, colors=chosen_colors)
## The new colors are a character, changing according to condition.
lp_testing <- sm(write_expt(lp_macr,
                            excel=paste0("excel/macrophage_parasite_data-v", ver, ".xlsx"),
                            violin=TRUE))
lp_testing$norm_pca

Now lets try removing some of the surrogates, the two primary candidates are the strains which I proxy with 3 columns: snpclade, snpcladev2, and snpcladev3; and the batch. In this data set batch is either a or b.

In theory, sva should pick up both of those in one invocation.

## Perform a 'normal' sva
lp_macr_sva <- sm(normalize_expt(lp_macr, batch="svaseq", filter=TRUE))
## Play with the normalization of it
lp_macr_sva_norm <- sm(normalize_expt(lp_macr_sva, filter=TRUE, convert="cpm",
                                      norm="quant", transform="log2"))
lp_macr_sva_norm_metr <- sm(graph_metrics(lp_macr_sva_norm))
## Try limma's method
lp_macr_limma_fcqlr <- sm(normalize_expt(lp_macr, batch="limma", filter=TRUE,
                                        batch2="snpcladev3", convert="cpm", norm="quant"))

9.2 Start with a pca from sva

lp_macr_sva_norm_metr$pcaplot

9.3 Try limma’s removebatcheffect

Another method might be to try using limmaresid to explicitly pull both columns.

plot_pca(lp_macr_limma_fcqlr)$plot

Another method might be to try pca on two separate invocations.

## frrpr rprrf
## raw(pca(raw(raw(filter(data)))))
## thus fcqsl would be log2(sva(quant(cpm(filter(data)))))
lp_macr_rprrf <- sm(normalize_expt(lp_macr, batch="pca", filter=TRUE))
lp_macr_rprrf <- set_expt_batches(expt=lp_macr_rprrf, fact="snpcladev3")
lp_macr_rprrf <- sm(normalize_expt(lp_macr_rprrf, batch="pca"))
lp_macr_lpqcf <- sm(normalize_expt(lp_macr_rprrf, convert="cpm",
                                   norm="quant", transform="log2",
                                   filter=TRUE))
lp_macr_lpqcf_metr <- sm(graph_metrics(lp_macr_lpqcf))

lp_macr_lpqcf_metr$pcaplot

9.4 Attempting count extraction from human surrogates

If indeed the human surrogates are a primary determinant in the parasite data, then we should be able to see that in some sample estimations of the parasite data. Let us try!

summary(hs_macr_se)
##                         Length Class        Mode   
## pca_adjust               7     -none-       list   
## sva_supervised_adjust    7     -none-       list   
## sva_unsupervised_adjust  7     -none-       list   
## ruv_supervised_adjust    7     -none-       list   
## ruv_residual_adjust      7     -none-       list   
## ruv_empirical_adjust     7     -none-       list   
## adjustments              8     -none-       list   
## correlations            64     -none-       numeric
## plot                     3     recordedplot list   
## pca_plots                7     -none-       list   
## catplots                 0     -none-       NULL
adjusts <- hs_macr_se$sva_supervised_adjust$model_adjust
## Remove surrogate estimates #1 and #9
new_adjusts <- as.matrix(adjusts[c(-1, -9), ])
## The question of course,  how do I use these estimations to adjust the data?

test_adjusted <- counts_from_surrogates(lp_macr, new_adjusts)
sv_adjusted <- lp_macr
Biobase::exprs(sv_adjusted$expressionset) <- test_adjusted

hs_adjusted <- subset_expt(sv_adjusted, subset="sampleid!='HPGL0248'")
## There were 9, now there are 8 samples.
hs_adjusted_fcqsl <- sm(normalize_expt(hs_adjusted, convert="cpm", filter=TRUE,
                                       batch="sva", norm="quant", transform="log2"))
hs_adjusted_fcqsl_metr <- sm(graph_metrics(hs_adjusted_fcqsl))
hs_sv_adjusted <- subset_expt(sv_adjusted, subset="sampleid!='HPGL0248'")
## There were 9, now there are 8 samples.
hs_sv_adjusted_fcqsl <- sm(normalize_expt(hs_sv_adjusted, convert="cpm", filter=TRUE,
                                          batch="sva", norm="quant", transform="log2"))
hs_sv_adjusted_fcqsl_metr <- sm(graph_metrics(hs_sv_adjusted_fcqsl))
lp_macr_rrrrr_metr <- sm(graph_metrics(lp_macr))
lp_macr_fcqrr <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE, norm="quant"))
lp_macr_fcqrr_metr <- sm(graph_metrics(lp_macr_fcqrr))
lp_macr_fcqfl <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE, norm="quant",
                                   batch="fsva", transform="log2"))
lp_macr_fcqfl_metr <- sm(graph_metrics(lp_macr_fcqfl))
lp_macr_fcqcl <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE,
                                   norm="quant", batch="combat_scale", transform="log2"))
lp_macr_fcqcl_metr <- sm(graph_metrics(lp_macr_fcqcl))
lp_macr_fcrfl <- sm(normalize_expt(lp_macr, filter=TRUE, norm="quant",
                                   transform="log2", batch="fsva"))
lp_macr_fcrfl_metr <- sm(graph_metrics(lp_macr_fcrfl))

10 Compare results

Now lets compare the results of using the human data to inform the parasite estimations.

10.1 Before including the new estimates

Lets first print a few using different metrics before using the human SV estimates. It looks like sva etc are not too bad.

lp_macr_fcqrr_metr$pcaplot

## Just normalizing shows the central concern.

hs_sv_adjusted_fcqsl_metr$pcaplot

## The human surrogates work pretty well

lp_macr_fcqcl_metr$pcaplot

## Combat cannot deal with this data very well.

lp_macr_fcrfl_metr$pcaplot

## fsva gets a little confused, but not bad.

11 Parasite variance partition

11.1 Condition + batch

Start out with just condition and batch

## The default arguments are to query ~ condition + (1|batch)
## Which, because condition is categorical, will end badly.
lp_macr_vp_cb <- sm(varpart(lp_macr, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_cb$percent_plot

lp_macr_vp_cb$partition_plot

lp_macr_vp_lpqcf <- sm(varpart(lp_macr_lpqcf, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_lpqcf$percent_plot

lp_macr_vp_lpqcf$partition_plot

lp_macr_vp_hsfcqsl <- sm(varpart(hs_adjusted_fcqsl, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_hsfcqsl$percent_plot

lp_macr_vp_hsfcqsl$partition_plot

pander::pander(sessionInfo())

R version 3.4.4 (2018-03-15)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: ruv(v.0.9.7) and hpgltools(v.2018.03)

loaded via a namespace (and not attached): backports(v.1.1.2), corrplot(v.0.84), aroma.light(v.3.8.0), plyr(v.1.8.4), lazyeval(v.0.2.1), survJamda.data(v.1.0.2), splines(v.3.4.4), BiocParallel(v.1.12.0), GenomeInfoDb(v.1.14.0), ggplot2(v.2.2.1), sva(v.3.26.0), digest(v.0.6.15), SuppDists(v.1.1-9.4), foreach(v.1.4.4), BiocInstaller(v.1.28.0), htmltools(v.0.3.6), gdata(v.2.18.0), magrittr(v.1.5), memoise(v.1.1.0), survJamda(v.1.1.4), doParallel(v.1.0.11), openxlsx(v.4.0.17), limma(v.3.34.9), Biostrings(v.2.46.0), annotate(v.1.56.2), matrixStats(v.0.53.1), R.utils(v.2.6.0), prettyunits(v.1.0.2), colorspace(v.1.3-2), blob(v.1.1.1), ggrepel(v.0.7.0), RCurl(v.1.95-4.10), graph(v.1.56.0), roxygen2(v.6.0.1), genefilter(v.1.60.0), lme4(v.1.1-17), survival(v.2.42-3), iterators(v.1.0.9), gtable(v.0.2.0), zlibbioc(v.1.24.0), XVector(v.0.18.0), DelayedArray(v.0.4.1), BiocGenerics(v.0.24.0), scales(v.0.5.0.9000), DESeq(v.1.30.0), DBI(v.0.8), edgeR(v.3.20.9), Rcpp(v.0.12.16), xtable(v.1.8-2), progress(v.1.1.2), bit(v.1.1-12), OrganismDbi(v.1.20.0), preprocessCore(v.1.40.0), stats4(v.3.4.4), lava(v.1.6.1), prodlim(v.2018.04.18), ecodist(v.2.0.1), httr(v.1.3.1), gplots(v.3.0.1), RColorBrewer(v.1.1-2), XML(v.3.98-1.11), R.methodsS3(v.1.7.1), locfit(v.1.5-9.1), labeling(v.0.3), rlang(v.0.2.0.9001), reshape2(v.1.4.3), AnnotationDbi(v.1.40.0), munsell(v.0.4.3), tools(v.3.4.4), RSQLite(v.2.1.0), devtools(v.1.13.5), evaluate(v.0.10.1), stringr(v.1.3.0), yaml(v.2.1.19), bootstrap(v.2017.2), knitr(v.1.20), bit64(v.0.9-7), pander(v.0.6.1), caTools(v.1.17.1), EDASeq(v.2.12.0), RBGL(v.1.54.0), nlme(v.3.1-137), R.oo(v.1.22.0), xml2(v.1.2.0), biomaRt(v.2.34.2), compiler(v.3.4.4), pbkrtest(v.0.4-7), testthat(v.2.0.0), variancePartition(v.1.8.1), tibble(v.1.4.2), statmod(v.1.4.30), geneplotter(v.1.56.0), stringi(v.1.1.7), highr(v.0.6), GenomicFeatures(v.1.30.3), lattice(v.0.20-35), Matrix(v.1.2-14), commonmark(v.1.5), survcomp(v.1.28.5), nloptr(v.1.0.4), survivalROC(v.1.0.3), pillar(v.1.2.2), data.table(v.1.10.4-3), bitops(v.1.0-6), corpcor(v.1.6.9), rtracklayer(v.1.38.3), GenomicRanges(v.1.30.3), colorRamps(v.2.3), R6(v.2.2.2), latticeExtra(v.0.6-28), directlabels(v.2017.03.31), hwriter(v.1.3.2), RMySQL(v.0.10.14), ShortRead(v.1.36.1), KernSmooth(v.2.23-15), gridExtra(v.2.3), IRanges(v.2.12.0), codetools(v.0.2-15), MASS(v.7.3-50), gtools(v.3.5.0), assertthat(v.0.2.0), SummarizedExperiment(v.1.8.1), rprojroot(v.1.3-2), withr(v.2.1.2), RUVSeq(v.1.12.0), GenomicAlignments(v.1.14.2), Rsamtools(v.1.30.0), S4Vectors(v.0.16.0), GenomeInfoDbData(v.1.0.0), mgcv(v.1.8-23), parallel(v.3.4.4), quadprog(v.1.5-5), grid(v.3.4.4), minqa(v.1.2.4), rmarkdown(v.1.9), Rtsne(v.0.13), Biobase(v.2.38.0), base64enc(v.0.1-3) and rmeta(v.3.0)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 19d49d10f887097fff276c8f5a1db2dbc9168e8d
## R> packrat::restore()
## This is hpgltools commit: Wed May 2 10:24:05 2018 -0400: 19d49d10f887097fff276c8f5a1db2dbc9168e8d
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 02_estimation_macrophage-v20170820.rda.xz
tmp <- sm(saveme(filename=this_save))
---
title: "L.panamensis 2016: Sample estimation of Human macrophages."
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
}
</style>

```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(
  digits=4,
  stringsAsFactors=FALSE,
  knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
previous_file <- "01_annotation.Rmd"
ver <- "20170820"

tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "02_estimation_macrophage.Rmd"
```

# Sample Estimation, Macrophages: `r ver`

This document is concerned with analyzing RNAseq data of human and parasite during the infection of
human macrophages.  A few different strains of L. panamensis were used; the experiment is therefore
segregated into groups named 'self-healing', 'chronic', and 'uninfected'.  Two separate sets of
libraries were generated, one earlier set with greater coverage, and a later set including only 1
uninfected sample, and 2-3 chronic samples.

## Extract the macrophage experiment

The following subset operation extract the samples used for the macrophage experiment. The next
three lines then change the colors from the defaults.

```{r macrophages}
hs_expt <- sm(create_expt(metadata="sample_sheets/all_samples-combined.xlsx",
                          gene_info=hs_annotations,
                          file_column="humanfile"))

hs_macr <- subset_expt(hs_expt, subset="experimentname=='macrophage'")
new_colors <- c("#009900", "#990000", "#000099")
names(new_colors) <- c("uninf", "chr", "sh")

hs_macr <- set_expt_colors(hs_macr, colors=new_colors)
labels <- as.character(pData(hs_macr)[["label"]])
hs_macr <- set_expt_samplenames(hs_macr, labels)

hs_cds_expt <- exclude_genes_expt(hs_expt, column='Type', patterns="protein_coding")
hs_cds_macr <- subset_expt(hs_cds_expt, subset="experimentname=='macrophage'")
hs_cds_macr <- set_expt_colors(hs_cds_macr, colors=new_colors)
hs_cds_macr <- set_expt_samplenames(hs_cds_macr, labels)
```

# Figure S1

Figure S1 should include nice versions of the sample metrics.
The normalization chosen is batch-in-model.

First, however, we will make some plots of the raw data.

Sample names are going to be 'infectionstate_strainnumber' : chr_7721

* Panel A: Library sizes.
* Panel B: Heatmap distance raw.
* Panel C: PCA
* Panel D: TSNE

```{r fig_s1_write, fig.show="hide"}
fig_s1 <- sm(write_expt(hs_macr, norm="quant", violin=FALSE, convert="cpm",
                        transform="log2", batch="pca", filter=TRUE,
                        excel=paste0("excel/figure_s1_sample_estimation-v", ver, ".xlsx")))
fig_s1_cds <- sm(write_expt(hs_cds_macr, norm="quant", violin=FALSE, convert="cpm",
                            transform="log2", batch="pca", filter=TRUE,
                            excel=paste0("excel/figure_s1_sample_estimation_cds_only-v", ver, ".xlsx")))
```

First look for clustering patterns in the raw data, in all data followed by only the CDS regions.

```{r figure_s1}
plot_tsne(hs_macr)$plot
plot_tsne(hs_cds_macr)$plot
```

## Write S1 images

Once the experiment has been written with the various normalizations in place,
we can use that to print and view some representative plots.

```{r view_fig_s1}
pp(file="images/figure_s1a.pdf", image=fig_s1$raw_libsize)
## Show the library sizes of all features.

pp(file="images/figure_s1b.pdf", image=fig_s1$raw_disheat)
## Show the clustering by euclidean distance of the raw data all features.

pp(file="images/figure_s1c.pdf", image=fig_s1$raw_scaled_pca)
## Do a PCA plot of the log2(normalized(data)) without any batch-in-model

pp(file="images/figure_s1d.pdf", image=fig_s1$norm_tsne)
## Do a TSNE of the normalized data after condition is in the model.
```

# Figure 1

 * Figure 1a distance heatmap of the normalized data.
 * Figure 1b is a PCA of the normalized data.

In a similar fashion to what is above, I am printing the figure 1 images.

```{r figure_1}
pp(file="images/figure_1a.pdf", fig_s1$norm_disheat)
## Show the changes in clustering after adding batch to the model and normalizing.

pp(file="images/figure_1a_cds.pdf", image=fig_s1_cds$norm_disheat)
## Observe that the clustering is nearly identical when only looking at the CDS features.

pp(file="images/figure_1b.pdf", image=fig_s1$norm_pca)
## Visualize the clustering after normalization via PCA.
knitr::kable(fig_s1$norm_pca_table)
write.csv(fig_s1$norm_pca_table, file="images/fig_1a.csv")

pp(file="images/figure_1b_cds.pdf", image=fig_s1_cds$norm_pca)
## Once again, see that it is mostly unchanged when using only CDS features.
knitr::kable(fig_s1_cds$norm_pca_table)
write.csv(fig_s1_cds$norm_pca_table, file="images/fig_1a_cds.csv")
```

# Sample metrics

The following is mostly used to compare different methodologies and is therefore
not likely to be useful for most.

The various metrics used to describe and examine the data come once before, and once after normalization.

## Initial sample metrics

graph_metrics creates a set of plots we can use to look at the data before messing with it.

```{r initial_metrics, fig.show="hide"}
written <- sm(write_expt(hs_macr,
                         excel=paste0("excel/macrophage_human_data-v", ver, ".xlsx"),
                         violin=TRUE))
hs_macr_metrics <- sm(graph_metrics(hs_macr))
```

The two most likely plots one will want to see immediately are the library sizes and data densities.

```{r vis_initial_metrics}
hs_macr_metrics$libsize
## The libraries > 600 in HPGL id have lower coverage as they were sequenced many more to a lane.
hs_macr_metrics$density
## This same phenomenon is visible in the density plot
hs_macr_metrics$nonzero
## and the non-zero gene / cpm plot.
```

## Normalization effects on the macrophage data

The plotted library sizes suggest that differences observed in the raw data are due at least
partially to less saturation in the new samples.  Therefore, low-count filtering is almost certainly
to prove an important parameter.  Similarly, a batch correction should help alleviate these
differences.

```{r macrophage_normalization}
hs_macr_nonil <- subset_expt(hs_macr, subset="condition=='sh'|condition=='chr'")
hs_macr_cpmlow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm"))
hs_macr_l2cpmlow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm", transform="log2"))
hs_macr_l2cpmpofa <- sm(normalize_expt(hs_macr, filter="pofa", convert="cpm", transform="log2"))
hs_macr_cpmrlelow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm", norm="rle"))
hs_macr_l2cpmrlelow <- sm(normalize_expt(hs_macr, filter="cbcb", convert="cpm",
                                         norm="rle", transform="log2"))
```

The normalizations in the block above do not attempt any batch correction.  Now take a moment to
view how they affect pca plots.

```{r pca_normalized}
hs_macr_pca <- plot_pca(hs_macr)
## Start with the unmodified data
## It is clearly a mess.
hs_macr_pca$plot
knitr::kable(hs_macr_pca$table)

hs_macr_cpmlow_pca <- plot_pca(hs_macr_cpmlow)
hs_macr_pca$plot
## low-filtering and cpm is insufficient to clarify the data, but it is a start
knitr::kable(hs_macr_pca$table)

hs_macr_l2cpmlow_pca <- plot_pca(hs_macr_l2cpmlow)
hs_macr_l2cpmlow_pca$plot
## Moving to the log scale makes much clearer the batch effect
## and shows that the data is actually pretty well separated by condition already.
knitr::kable(hs_macr_l2cpmlow_pca$table)

hs_macr_l2cpmpofa_pca <- plot_pca(hs_macr_l2cpmpofa)
hs_macr_l2cpmpofa_pca$plot
## I have access to a few low-count filters, but they appear to make little or no difference.
knitr::kable(hs_macr_l2cpmpofa_pca$table)

hs_macr_cpmrlelow_pca <- plot_pca(hs_macr_cpmrlelow)
hs_macr_cpmrlelow_pca$plot
## RLE normalization has an interesting effect on the data
knitr::kable(hs_macr_cpmrlelow_pca$table)

hs_macr_l2cpmrlelow_pca <- plot_pca(hs_macr_l2cpmrlelow, size_column="pctcategory")
hs_macr_l2cpmrlelow_pca$plot
## Still the batch effect is pretty ridiculous
## This plot resizes the glyphs to indicate the % mapping.
knitr::kable(hs_macr_l2cpmrlelow_pca$table)
```

### Look at the principal components

```{r pca_info, show.fig="hide"}
hs_macr_pca_info <- sm(pca_information(hs_macr,
                                       expt_factors=c("condition","batch","pctmappedparasite",
                                                      "state","pathogenstrain","rnangul")))
##write.csv(macro_pca_info$rsquared_table, file="images/rsquared.csv")
##write.csv(macro_pca_info$anova_p, file="images/anova_p.csv")
```

```{r pca_info_heatmap}
hs_macr_pca_info$anova_fstat_heatmap
```

I could continue the above normalizations and plots, but it quickly becomes boring and redundant.
The batch effect is too significant to continue without addressing it.  One thing that was brought
up in previous discussions: does the mapping percentage act as a surrogate variable?  This suggests
that if it does, it is pretty weak, and the last PCA plot shows that is cosegregates with the
existing conditions: uninfected has the lowest mapping percent to the parasite [0%]; while chronic
has the lower percentage of the infections, but not so much lower than self-healing in general.

### Normalization with batch effect minimization

We have good information that suggests a strong batch effect in this data.  Let us put that
information to use in the following block.

How do these modifications affect the separation of the data?

```{r macrobatch_normalization, fig.show="hide"}
hs_macr_cpmlowcbcb <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm", batch=TRUE))
hs_macr_l2cpmlowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                         batch="combat_scale", transform="log2"))
hs_macr_l2cpmrlelowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                            batch="combat_scale", transform="log2", norm="rle"))
hs_macr_l2cpmquantlowcmod <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                               batch="combatmod", transform="log2", norm="quant"))
hs_macr_l2cpmquantlowcom <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="combat_scale", transform="log2", norm="quant"))
hs_macr_l2cpmquantlowsva <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="svaseq", norm="quant"))
hs_macr_l2cpmquantlowfsva <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                               batch="fsva", norm="quant"))
hs_macr_l2cpmquantlowruv <- sm(normalize_expt(hs_macr, filter="hpgl", convert="cpm",
                                              batch="combat_scale", transform="log2", norm="quant"))
tt <- sm(graph_metrics(hs_macr_l2cpmquantlowruv))
```

```{r macrobatch_norm_plot}
hs_macr_cpmlowcbcb_pca <- plot_pca(hs_macr_cpmlowcbcb)
hs_macr_cpmlowcbcb_pca$plot
## cbcb's batch correction is insufficient, at least without log2
hs_macr_cpmlowcbcb_pca$table

hs_macr_l2cpmlowcom_pca <- plot_pca(hs_macr_l2cpmlowcom)
hs_macr_l2cpmlowcom_pca$plot
## oooo pretty
knitr::kable(hs_macr_l2cpmlowcom_pca$table)

hs_macr_l2cpmrlelowcom_pca <- plot_pca(hs_macr_l2cpmrlelowcom)
hs_macr_l2cpmrlelowcom_pca$plot
## also looking good
knitr::kable(hs_macr_l2cpmrlelowcom_pca$table)

hs_macr_l2cpmquantlowcmod_pca <- plot_pca(hs_macr_l2cpmquantlowcmod)
hs_macr_l2cpmquantlowcmod_pca$plot
## not as good
knitr::kable(hs_macr_l2cpmquantlowcmod_pca$table)

hs_macr_l2cpmquantlowfsva_pca <- plot_pca(hs_macr_l2cpmquantlowfsva)
hs_macr_l2cpmquantlowfsva_pca$plot

pretty_pca <- plot_pca(hs_macr_l2cpmquantlowcom)
pretty_pca$plot
## perhaps the best?
knitr::kable(pretty_pca$table)
##write.csv(pretty_pca$table, file="excel/pretty_pca.csv")
```

### Surrogate estimations

```{r surrogates_estimates}
## Now try and combine this with
## https://github.com/jtleek/svaseq/blob/master/recount.Rmd
## To see if things become clearer
hs_macr_filt <- sm(normalize_expt(expt=hs_macr, filter=TRUE))
hs_macr_se <- sm(compare_surrogate_estimates(hs_macr_filt))
## I think my implementation of ruv_empirical may be borken.
hs_macr_se$pca_plots$ruvsup
hs_macr_se$ruv_supervised_adjust$svs_sample
hs_macr_se$pca_plots$svasup
hs_macr_se$sva_supervised_adjust$svs_sample
## The new uninfected sample still seems oddly placed
hs_macr_se$pca_plots$pca
hs_macr_se$pca_adjust$svs_sample
## Ditto here
hs_macr_se$pca_plots$svaunsup
## This seems pretty reasonable to me
hs_macr_se$pca_plots$ruvresid
## This looks a lot like pca and ruvsup, but the corrplot suggested to me that it would look weirder.
```

### Effect on other metrics by batch correction

Lets see how these modification affect the visualizations

```{r visualize_norma, fig.show="hide"}
hs_macr_batchnorm_metrics <- sm(graph_metrics(hs_macr_l2cpmquantlowcom))
```

```{r see_norm}
print(hs_macr_batchnorm_metrics$corheat)
## The combat adjusted counts seem to split  relatively clearly between types.
print(hs_macr_batchnorm_metrics$smc)
## This one sample is still problematic, though.
print(hs_macr_batchnorm_metrics$pcaplot)
## But quantile normalization really does smash them into the same distribution.
```

Some other analyses performed suggest a possible switch between two samples.  We can artifically see
what the data would look like if that is true by switching those two samples in a separate
experimental design.

# TODO 201611 #4

PBMC: variance partition human, "Are there any genes with variance we can ascribe to condition?"
That set of genes is the most interesting. -- use the variance parition table and pull the top 200.
Report back the % variance by condition for all of these genes.

# Variance Partition

In the following, I will attempt to find the variance associated with different experimental factors
in the macrophage data with mappings against the human transcriptome.

## Setting up

As in [macrophage_estimation_human.html](macrophage_estimation_human.html), I need to pull out the
relevant samples:

# Perform variancePartition analyses

## Condition + batch

Start out with just condition and batch

```{r varpart_condbatch1}
## The default arguments are to query ~ condition + (1|batch)
## Which, because condition is categorical, will end badly.
hs_macr_vp_cb <- sm(varpart(hs_macr_filt,
                    predictor=NULL,
                    factors=c("condition","batch")))
hs_macr_vp_cb$percent_plot
hs_macr_vp_cb$partition_plot
## Batch is dominant, but there is still quite a lot left in residuals.
```

## Condition + batch + snpclades

```{r varpart_condbatchsnp}
hs_macr_vp_cbs3 <- sm(varpart(hs_macr_l2cpmquantlowsva,
                      predictor=NULL,
                      factors=c("condition","batch","snpcladev3")))
hs_macr_vp_cbs3$percent_plot
hs_macr_vp_cbs3$partition_plot
```

# Repeat analyses, Leishmania samples

# Sample metrics

I just realized that what I need to do is to take the surrogate estimates from the human experiment.
Using those I think I will be able to get a more realistic estimate of what is going on.

## But first, attempt a normal estimation

```{r parasite_estimates_normal, fig.show="hide"}
parasite_expt <- create_expt(metadata="sample_sheets/all_samples-combined.xlsx",
                                gene_info=lp_annotations,
                                file_column="parasitefile")

lp_macr <- subset_expt(parasite_expt, subset="experimentname=='macrophage'")
chosen_colors <- c("#990000", "#000099")
names(chosen_colors) <- c("chr", "sh")
lp_macr <- set_expt_colors(lp_macr, colors=chosen_colors)
lp_testing <- sm(write_expt(lp_macr,
                            excel=paste0("excel/macrophage_parasite_data-v", ver, ".xlsx"),
                            violin=TRUE))
lp_testing$norm_pca
```

Now lets try removing some of the surrogates, the two primary candidates are the strains
which I proxy with 3 columns: snpclade, snpcladev2, and snpcladev3; and the batch.  In this
data set batch is either a or b.

In theory, sva should pick up both of those in one invocation.

```{r generate_graphs, fig.show="hide"}
## Perform a 'normal' sva
lp_macr_sva <- sm(normalize_expt(lp_macr, batch="svaseq", filter=TRUE))
## Play with the normalization of it
lp_macr_sva_norm <- sm(normalize_expt(lp_macr_sva, filter=TRUE, convert="cpm",
                                      norm="quant", transform="log2"))
lp_macr_sva_norm_metr <- sm(graph_metrics(lp_macr_sva_norm))
## Try limma's method
lp_macr_limma_fcqlr <- sm(normalize_expt(lp_macr, batch="limma", filter=TRUE,
                                        batch2="snpcladev3", convert="cpm", norm="quant"))
```

## Start with a pca from sva

```{r sva_attempt}
lp_macr_sva_norm_metr$pcaplot
```

## Try limma's removebatcheffect

Another method might be to try using limmaresid to explicitly pull both columns.

```{r resid_attempt}
plot_pca(lp_macr_limma_fcqlr)$plot
```

Another method might be to try pca on two separate invocations.

```{r perform_residuals, fig.show=FALSE}
## frrpr rprrf
## raw(pca(raw(raw(filter(data)))))
## thus fcqsl would be log2(sva(quant(cpm(filter(data)))))
lp_macr_rprrf <- sm(normalize_expt(lp_macr, batch="pca", filter=TRUE))
lp_macr_rprrf <- set_expt_batches(expt=lp_macr_rprrf, fact="snpcladev3")
lp_macr_rprrf <- sm(normalize_expt(lp_macr_rprrf, batch="pca"))
lp_macr_lpqcf <- sm(normalize_expt(lp_macr_rprrf, convert="cpm",
                                   norm="quant", transform="log2",
                                   filter=TRUE))
lp_macr_lpqcf_metr <- sm(graph_metrics(lp_macr_lpqcf))
```

```{r pca_attempt}
lp_macr_lpqcf_metr$pcaplot
```

## Attempting count extraction from human surrogates

If indeed the human surrogates are a primary determinant in the parasite data, then we should
be able to see that in some sample estimations of the parasite data.  Let us try!

```{r parasite_estimation_human_surrogates, fig.show="hide"}
summary(hs_macr_se)
adjusts <- hs_macr_se$sva_supervised_adjust$model_adjust
## Remove surrogate estimates #1 and #9
new_adjusts <- as.matrix(adjusts[c(-1, -9), ])
## The question of course,  how do I use these estimations to adjust the data?

test_adjusted <- counts_from_surrogates(lp_macr, new_adjusts)
sv_adjusted <- lp_macr
Biobase::exprs(sv_adjusted$expressionset) <- test_adjusted

hs_adjusted <- subset_expt(sv_adjusted, subset="sampleid!='HPGL0248'")
hs_adjusted_fcqsl <- sm(normalize_expt(hs_adjusted, convert="cpm", filter=TRUE,
                                       batch="sva", norm="quant", transform="log2"))
hs_adjusted_fcqsl_metr <- sm(graph_metrics(hs_adjusted_fcqsl))

hs_sv_adjusted <- subset_expt(sv_adjusted, subset="sampleid!='HPGL0248'")
hs_sv_adjusted_fcqsl <- sm(normalize_expt(hs_sv_adjusted, convert="cpm", filter=TRUE,
                                          batch="sva", norm="quant", transform="log2"))
hs_sv_adjusted_fcqsl_metr <- sm(graph_metrics(hs_sv_adjusted_fcqsl))

lp_macr_rrrrr_metr <- sm(graph_metrics(lp_macr))
lp_macr_fcqrr <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE, norm="quant"))
lp_macr_fcqrr_metr <- sm(graph_metrics(lp_macr_fcqrr))

lp_macr_fcqfl <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE, norm="quant",
                                   batch="fsva", transform="log2"))
lp_macr_fcqfl_metr <- sm(graph_metrics(lp_macr_fcqfl))

lp_macr_fcqcl <- sm(normalize_expt(lp_macr, convert="cpm", filter=TRUE,
                                   norm="quant", batch="combat_scale", transform="log2"))
lp_macr_fcqcl_metr <- sm(graph_metrics(lp_macr_fcqcl))

lp_macr_fcrfl <- sm(normalize_expt(lp_macr, filter=TRUE, norm="quant",
                                   transform="log2", batch="fsva"))
lp_macr_fcrfl_metr <- sm(graph_metrics(lp_macr_fcrfl))
```

# Compare results

Now lets compare the results of using the human data to inform the parasite estimations.

## Before including the new estimates

Lets first print a few using different metrics before using the human SV estimates.
It looks like sva etc are not too bad.

```{r compare_pca_among_estimates}
lp_macr_fcqrr_metr$pcaplot
## Just normalizing shows the central concern.

hs_sv_adjusted_fcqsl_metr$pcaplot
## The human surrogates work pretty well

lp_macr_fcqcl_metr$pcaplot
## Combat cannot deal with this data very well.

lp_macr_fcrfl_metr$pcaplot
## fsva gets a little confused, but not bad.
```

# Parasite variance partition

## Condition + batch

Start out with just condition and batch

```{r varpart_condbatch}
## The default arguments are to query ~ condition + (1|batch)
## Which, because condition is categorical, will end badly.
lp_macr_vp_cb <- sm(varpart(lp_macr, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_cb$percent_plot
lp_macr_vp_cb$partition_plot

lp_macr_vp_lpqcf <- sm(varpart(lp_macr_lpqcf, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_lpqcf$percent_plot
lp_macr_vp_lpqcf$partition_plot

lp_macr_vp_hsfcqsl <- sm(varpart(hs_adjusted_fcqsl, predictor=NULL, factors=c("condition","batch")))
lp_macr_vp_hsfcqsl$percent_plot
lp_macr_vp_hsfcqsl$partition_plot
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```
