This document turns to the infection of INFECTION cells with L.panamensis. This data is particularly strangely affected by the different strains used to infect the cells, and as a result is both useful and troubling.
“Changes during infection hpgl0630-0636 and hpgl0650-hpgl0663”
Start out by creating the expt and poking at it to see how well/badly behaved the data is.
## Reread the sample sheet because I am fiddling with other possible surrogates (like strain)
## In fact, copy it to a separate sheet because these samples are a mess
infection_human <- expt_subset(human_expt, subset="experimentname=='infection'")
chosen_colors <- c("#009900","#990000", "#000099")
names(chosen_colors) <- c("pbmc_nil","pbmc_ch","pbmc_sh")
uninf_human <- set_expt_colors(infection_human, colors=chosen_colors)
inf_human <- expt_subset(uninf_human, subset="condition!='pbmc_nil'")
##infection_model_test <- model_test(infection_human$design)
The following creates all the metric plots of the raw data.
uninf_human_metrics <- sm(graph_metrics(uninf_human))
inf_human_metrics <- sm(graph_metrics(inf_human))
Now visualize some relevant metrics.
uninf_human_metrics$libsize
## Wow there is a pretty big range in counts observed in this data!!
uninf_human_metrics$density
## Given the big split in counts, this is surprisingly clean
uninf_human_metrics$boxplot
## Ditto
uninf_data <- sm(write_expt(uninf_human,
excel=paste0("excel/infection_human_with_uninfected_data-v", ver, ".xlsx"),
violin=TRUE))
inf_data <- sm(write_expt(inf_human,
excel=paste0("excel/infection_human_without_uninfected_data-v", ver, ".xlsx"),
violin=TRUE))
Now perform the ‘default’ normalization we use in the lab and look again.
uninf_cqf <- sm(normalize_expt(uninf_human, convert="cpm", filter=TRUE, norm="quant"))
uninf_cff_met <- sm(graph_metrics(uninf_human))
inf_cqf <- sm(normalize_expt(inf_human, convert="cpm", filter=TRUE, norm="quant"))
inf_cqf_met <- sm(graph_metrics(inf_human))
And visualize some of the relevant resulting metrics.
The goal of the following 2 blocks is to explore a few different possiblities vis a vis condition/batch in the data. In addition, Maria Adelaida is interested in the differences introduced when the uninfected samples are/aren’t included.
I think for each of the resulting 6 data sets, we want to perform a simple differential expression analysis and see what happens.
In this first iteration, we will log2(cpm(quant(filter()))) the data and leave the experimental humanmeters as the default: condition == the 6 strains, 3 chronic 3 self-healing batch == the three patients p107
## Start with the non-uninfected, no batch correction
inf_lqcf <- sm(normalize_expt(inf_human, filter=TRUE, convert="cpm", transform="log2", norm="quant"))
pca_inf_lqcf <- plot_pca(inf_lqcf)
## 3 three patients are super obvious I think
## The patients 107/108 are on the left while 110 is on the right.
knitr::kable(pca_inf_lqcf$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0631 | HPGL0631 | pbmc_ch | d108 | 2 | -0.1611 | -0.3489 | #990000 | HPGL0631 |
hpgl0632 | HPGL0632 | pbmc_ch | d108 | 2 | -0.1370 | -0.3164 | #990000 | HPGL0632 |
hpgl0633 | HPGL0633 | pbmc_ch | d108 | 2 | -0.0604 | -0.2326 | #990000 | HPGL0633 |
hpgl0634 | HPGL0634 | pbmc_sh | d108 | 2 | -0.2063 | -0.3506 | #000099 | HPGL0634 |
hpgl0635 | HPGL0635 | pbmc_sh | d108 | 2 | -0.0563 | -0.2715 | #000099 | HPGL0635 |
hpgl0636 | HPGL0636 | pbmc_sh | d108 | 2 | -0.0981 | -0.2975 | #000099 | HPGL0636 |
hpgl0651 | HPGL0651 | pbmc_ch | d110 | 3 | 0.2928 | -0.0198 | #990000 | HPGL0651 |
hpgl0652 | HPGL0652 | pbmc_ch | d110 | 3 | 0.3038 | 0.0114 | #990000 | HPGL0652 |
hpgl0653 | HPGL0653 | pbmc_ch | d110 | 3 | 0.3359 | 0.0550 | #990000 | HPGL0653 |
hpgl0654 | HPGL0654 | pbmc_sh | d110 | 3 | 0.3020 | 0.0295 | #000099 | HPGL0654 |
hpgl0655 | HPGL0655 | pbmc_sh | d110 | 3 | 0.3685 | 0.1066 | #000099 | HPGL0655 |
hpgl0656 | HPGL0656 | pbmc_sh | d110 | 3 | 0.3447 | 0.0639 | #000099 | HPGL0656 |
hpgl0658 | HPGL0658 | pbmc_ch | d107 | 1 | -0.2482 | 0.1907 | #990000 | HPGL0658 |
hpgl0659 | HPGL0659 | pbmc_ch | d107 | 1 | -0.2296 | 0.2646 | #990000 | HPGL0659 |
hpgl0660 | HPGL0660 | pbmc_ch | d107 | 1 | -0.1644 | 0.2975 | #990000 | HPGL0660 |
hpgl0661 | HPGL0661 | pbmc_sh | d107 | 1 | -0.2235 | 0.2505 | #000099 | HPGL0661 |
hpgl0662 | HPGL0662 | pbmc_sh | d107 | 1 | -0.1456 | 0.2711 | #000099 | HPGL0662 |
hpgl0663 | HPGL0663 | pbmc_sh | d107 | 1 | -0.2169 | 0.2965 | #000099 | HPGL0663 |
pca_inf_lqcf$variance
## [1] 50.59 22.59 5.22 4.36 3.62 2.86 1.57 1.43 1.35 1.23 0.95
## [12] 0.86 0.77 0.71 0.66 0.65 0.58
write.csv(pca_inf_lqcf$pcatable, file="csv/infection_nouninfected_no_batch.csv")
## This shows clean sehumantion by patient
## Therefore we will now add patient as a surrogate variable and minimize it
For the second iteration, use the same normalization, but add a combat correction in an attempt to minimize patient’s effect in the variance.
inf_lqcf_cbdonor <- sm(normalize_expt(inf_lqcf, batch="combat_scale"))
## Here the split is semi chronic/self-healing, but not quite
pca_inf_lqcf_cbdonor <- plot_pca(inf_lqcf_cbdonor)
pca_inf_lqcf_cbdonor$plot
## There are 2 sh and 1 chr on the right vs. 2 chr and 1 sh on the left
knitr::kable(pca_inf_lqcf_cbdonor$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0631 | HPGL0631 | pbmc_ch | d108 | 2 | -0.1777 | 0.0481 | #990000 | HPGL0631 |
hpgl0632 | HPGL0632 | pbmc_ch | d108 | 2 | -0.2128 | 0.0873 | #990000 | HPGL0632 |
hpgl0633 | HPGL0633 | pbmc_ch | d108 | 2 | 0.3086 | -0.1935 | #990000 | HPGL0633 |
hpgl0634 | HPGL0634 | pbmc_sh | d108 | 2 | -0.1571 | 0.0483 | #000099 | HPGL0634 |
hpgl0635 | HPGL0635 | pbmc_sh | d108 | 2 | 0.1694 | -0.2950 | #000099 | HPGL0635 |
hpgl0636 | HPGL0636 | pbmc_sh | d108 | 2 | -0.0142 | 0.3201 | #000099 | HPGL0636 |
hpgl0651 | HPGL0651 | pbmc_ch | d110 | 3 | -0.3159 | 0.0953 | #990000 | HPGL0651 |
hpgl0652 | HPGL0652 | pbmc_ch | d110 | 3 | -0.3219 | -0.2135 | #990000 | HPGL0652 |
hpgl0653 | HPGL0653 | pbmc_ch | d110 | 3 | 0.1262 | -0.5327 | #990000 | HPGL0653 |
hpgl0654 | HPGL0654 | pbmc_sh | d110 | 3 | -0.0520 | 0.2060 | #000099 | HPGL0654 |
hpgl0655 | HPGL0655 | pbmc_sh | d110 | 3 | 0.4338 | 0.1549 | #000099 | HPGL0655 |
hpgl0656 | HPGL0656 | pbmc_sh | d110 | 3 | 0.2165 | 0.2696 | #000099 | HPGL0656 |
hpgl0658 | HPGL0658 | pbmc_ch | d107 | 1 | -0.3017 | -0.1341 | #990000 | HPGL0658 |
hpgl0659 | HPGL0659 | pbmc_ch | d107 | 1 | -0.1394 | -0.0908 | #990000 | HPGL0659 |
hpgl0660 | HPGL0660 | pbmc_ch | d107 | 1 | 0.1983 | -0.3629 | #990000 | HPGL0660 |
hpgl0661 | HPGL0661 | pbmc_sh | d107 | 1 | -0.2310 | 0.1243 | #000099 | HPGL0661 |
hpgl0662 | HPGL0662 | pbmc_sh | d107 | 1 | 0.2756 | 0.2341 | #000099 | HPGL0662 |
hpgl0663 | HPGL0663 | pbmc_sh | d107 | 1 | 0.1951 | 0.2347 | #000099 | HPGL0663 |
pca_inf_lqcf_cbdonor$variance
## [1] 22.56 16.20 12.69 11.00 5.57 4.77 4.59 4.25 3.21 2.91 2.68
## [12] 2.53 2.32 2.24 2.13 0.22 0.14
write.csv(pca_inf_lqcf_cbdonor$pcatable, file="csv/infection_nouninfected_batch_patient.csv")
testme <- pca_information(inf_human, expt_factors=c("condition","batch","pathogenstrain","state","donor","rnangul"), plot_pcas=TRUE)
## More shallow curves in these plots suggest more genes in this principle component.
## condition
## batch
## pathogenstrain
## state
## donor
## rnangul
## Warning: closing unused connection 11 (<-localhost:11540)
## Warning: closing unused connection 10 (<-localhost:11540)
## Warning: closing unused connection 9 (<-localhost:11540)
## Warning: closing unused connection 8 (<-localhost:11540)
## Warning: closing unused connection 7 (<-localhost:11540)
## Warning: closing unused connection 6 (<-localhost:11540)
testme$pca_cor
## PC1 PC2 PC3 PC4 PC5 PC6
## condition 0.1554 -0.10056 -0.1369 -0.09314 0.065202 -0.63636
## batch 0.1576 0.89625 -0.3762 -0.03606 -0.013765 0.05455
## pathogenstrain -0.1226 0.08304 0.1068 0.38144 -0.076040 0.52608
## state 0.1554 -0.10056 -0.1369 -0.09314 0.065202 -0.63636
## donor 0.1576 0.89625 -0.3762 -0.03606 -0.013765 0.05455
## rnangul 0.2730 -0.07793 -0.1544 0.25435 -0.006972 -0.22779
Here we will set the batch to the humansite strains and condition to a combination of the patient and state state; then perform the pca again.
new_condition <- paste0(inf_human$design$state, '_', inf_human$design$donor)
inf_hsstr <- set_expt_factors(inf_human, batch="pathogenstrain", condition=new_condition)
inf_lqcf_cbstr <- sm(normalize_expt(inf_hsstr, transform="log2", convert="cpm", norm="quant", filter=TRUE, batch="combat_scale"))
pca_inf_lqcf_cbstr <- plot_pca(inf_lqcf_cbstr)
pca_inf_lqcf_cbstr$plot
## Doing that kind of sucked the variance out of the data, but it did cause the samples to split by strain quite strongly
knitr::kable(pca_inf_lqcf_cbstr$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0631 | HPGL0631 | chronic_d108 | s5430 | 6 | -0.2389 | -0.3389 | #1B9E77 | HPGL0631 |
hpgl0632 | HPGL0632 | chronic_d108 | s5397 | 5 | -0.2331 | -0.3252 | #1B9E77 | HPGL0632 |
hpgl0633 | HPGL0633 | chronic_d108 | s2504 | 4 | -0.2282 | -0.3041 | #1B9E77 | HPGL0633 |
hpgl0634 | HPGL0634 | self_heal_d108 | s2272 | 3 | 0.1600 | -0.3724 | #D95F02 | HPGL0634 |
hpgl0635 | HPGL0635 | self_heal_d108 | s1022 | 1 | 0.1783 | -0.3036 | #D95F02 | HPGL0635 |
hpgl0636 | HPGL0636 | self_heal_d108 | s2189 | 2 | 0.1795 | -0.3411 | #D95F02 | HPGL0636 |
hpgl0651 | HPGL0651 | chronic_d110 | s5430 | 6 | -0.0390 | 0.1866 | #7570B3 | HPGL0651 |
hpgl0652 | HPGL0652 | chronic_d110 | s5397 | 5 | -0.0410 | 0.1663 | #7570B3 | HPGL0652 |
hpgl0653 | HPGL0653 | chronic_d110 | s2504 | 4 | -0.0520 | 0.1531 | #7570B3 | HPGL0653 |
hpgl0654 | HPGL0654 | self_heal_d110 | s2272 | 3 | 0.3676 | 0.1360 | #E7298A | HPGL0654 |
hpgl0655 | HPGL0655 | self_heal_d110 | s1022 | 1 | 0.3540 | 0.1337 | #E7298A | HPGL0655 |
hpgl0656 | HPGL0656 | self_heal_d110 | s2189 | 2 | 0.3670 | 0.1323 | #E7298A | HPGL0656 |
hpgl0658 | HPGL0658 | chronic_d107 | s5430 | 6 | -0.3328 | 0.1979 | #66A61E | HPGL0658 |
hpgl0659 | HPGL0659 | chronic_d107 | s5397 | 5 | -0.3338 | 0.2137 | #66A61E | HPGL0659 |
hpgl0660 | HPGL0660 | chronic_d107 | s2504 | 4 | -0.3277 | 0.2162 | #66A61E | HPGL0660 |
hpgl0661 | HPGL0661 | self_heal_d107 | s2272 | 3 | 0.0812 | 0.1698 | #E6AB02 | HPGL0661 |
hpgl0662 | HPGL0662 | self_heal_d107 | s1022 | 1 | 0.0747 | 0.1251 | #E6AB02 | HPGL0662 |
hpgl0663 | HPGL0663 | self_heal_d107 | s2189 | 2 | 0.0642 | 0.1545 | #E6AB02 | HPGL0663 |
pca_inf_lqcf_cbstr$variance
## [1] 87.37 8.22 0.88 0.69 0.56 0.53 0.43 0.35 0.27 0.24 0.23
## [12] 0.21 0.01 0.00 0.00 0.00 0.00
write.csv(pca_inf_lqcf_cbstr$pcatable, file="csv/infection_nouninfected_batch_strain.csv")
Now change only the condition to self/chronic and make super-explicit the split in the samples.
inf_lqcf_cbstrv2 <- set_expt_condition(inf_lqcf_cbstr, fact="state")
inf_lqcf_cbstrv2 <- set_expt_colors(inf_lqcf_cbstrv2, colors=c("#880000","#000088"))
pca_inf_lqcf_cbstrv2 <- plot_pca(inf_lqcf_cbstrv2)
pca_inf_lqcf_cbstrv2$plot
## Thus 3 runs of chronic on the right and self-state on the left
knitr::kable(pca_inf_lqcf_cbstrv2$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0631 | HPGL0631 | chronic | s5430 | 6 | -0.2389 | -0.3389 | #880000 | HPGL0631 |
hpgl0632 | HPGL0632 | chronic | s5397 | 5 | -0.2331 | -0.3252 | #880000 | HPGL0632 |
hpgl0633 | HPGL0633 | chronic | s2504 | 4 | -0.2282 | -0.3041 | #880000 | HPGL0633 |
hpgl0634 | HPGL0634 | self_heal | s2272 | 3 | 0.1600 | -0.3724 | #000088 | HPGL0634 |
hpgl0635 | HPGL0635 | self_heal | s1022 | 1 | 0.1783 | -0.3036 | #000088 | HPGL0635 |
hpgl0636 | HPGL0636 | self_heal | s2189 | 2 | 0.1795 | -0.3411 | #000088 | HPGL0636 |
hpgl0651 | HPGL0651 | chronic | s5430 | 6 | -0.0390 | 0.1866 | #880000 | HPGL0651 |
hpgl0652 | HPGL0652 | chronic | s5397 | 5 | -0.0410 | 0.1663 | #880000 | HPGL0652 |
hpgl0653 | HPGL0653 | chronic | s2504 | 4 | -0.0520 | 0.1531 | #880000 | HPGL0653 |
hpgl0654 | HPGL0654 | self_heal | s2272 | 3 | 0.3676 | 0.1360 | #000088 | HPGL0654 |
hpgl0655 | HPGL0655 | self_heal | s1022 | 1 | 0.3540 | 0.1337 | #000088 | HPGL0655 |
hpgl0656 | HPGL0656 | self_heal | s2189 | 2 | 0.3670 | 0.1323 | #000088 | HPGL0656 |
hpgl0658 | HPGL0658 | chronic | s5430 | 6 | -0.3328 | 0.1979 | #880000 | HPGL0658 |
hpgl0659 | HPGL0659 | chronic | s5397 | 5 | -0.3338 | 0.2137 | #880000 | HPGL0659 |
hpgl0660 | HPGL0660 | chronic | s2504 | 4 | -0.3277 | 0.2162 | #880000 | HPGL0660 |
hpgl0661 | HPGL0661 | self_heal | s2272 | 3 | 0.0812 | 0.1698 | #000088 | HPGL0661 |
hpgl0662 | HPGL0662 | self_heal | s1022 | 1 | 0.0747 | 0.1251 | #000088 | HPGL0662 |
hpgl0663 | HPGL0663 | self_heal | s2189 | 2 | 0.0642 | 0.1545 | #000088 | HPGL0663 |
tt <- normalize_expt(inf_human, filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## svaseq(cbcb(data))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 37484 low-count genes (13557 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values'; I think this
## means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 357 entries are >= 0.
## batch_counts: Using sva::svaseq for batch correction.
## Note to self: If you feed svaseq a data frame you will get an error like:
## data %*% (Id - mod %*% blah blah requires numeric/complex arguments.
## The number of elements which are < 0 after batch correction is: 269
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
plot_pca(tt)$plot
inf_lqcf_svstr <- sm(normalize_expt(inf_lqcf_svstr, transform="log2", convert="cpm", norm="quant"))
## Error in normalize_expt(inf_lqcf_svstr, transform = "log2", convert = "cpm", : object 'inf_lqcf_svstr' not found
plot_pca(inf_lqcf_svstr)$plot
## Error in plot_pca(inf_lqcf_svstr): object 'inf_lqcf_svstr' not found
For the next few blocks we will just repeat what we did but include the uninfected samples. Ideally doing so will have ~0 effect on the positions of the sample types.
In this first example, we see why the uninfected samples were initially removed from the analyses I think.
## Start with the non-uninfected, no batch correction
uninf_lqcf <- sm(normalize_expt(uninf_human, filter=TRUE, convert="cpm", transform="log2", norm="quant"))
pca_uninf_lqcf <- plot_pca(uninf_lqcf)
pca_uninf_lqcf$plot
## In this case, the uninfected samples cause the p107/p108 samples to smoosh together
knitr::kable(pca_uninf_lqcf$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0630 | HPGL0630 | pbmc_nil | d108 | 2 | -0.1192 | -0.4892 | #009900 | HPGL0630 |
hpgl0631 | HPGL0631 | pbmc_ch | d108 | 2 | 0.1684 | 0.0236 | #990000 | HPGL0631 |
hpgl0632 | HPGL0632 | pbmc_ch | d108 | 2 | 0.1455 | 0.0326 | #990000 | HPGL0632 |
hpgl0633 | HPGL0633 | pbmc_ch | d108 | 2 | 0.0704 | 0.0260 | #990000 | HPGL0633 |
hpgl0634 | HPGL0634 | pbmc_sh | d108 | 2 | 0.1907 | -0.0507 | #000099 | HPGL0634 |
hpgl0635 | HPGL0635 | pbmc_sh | d108 | 2 | 0.0678 | 0.0404 | #000099 | HPGL0635 |
hpgl0636 | HPGL0636 | pbmc_sh | d108 | 2 | 0.1270 | 0.0982 | #000099 | HPGL0636 |
hpgl0650 | HPGL0650 | pbmc_nil | d110 | 3 | -0.2896 | -0.4798 | #009900 | HPGL0650 |
hpgl0651 | HPGL0651 | pbmc_ch | d110 | 3 | -0.2280 | 0.2430 | #990000 | HPGL0651 |
hpgl0652 | HPGL0652 | pbmc_ch | d110 | 3 | -0.2350 | 0.2643 | #990000 | HPGL0652 |
hpgl0653 | HPGL0653 | pbmc_ch | d110 | 3 | -0.3191 | 0.0913 | #990000 | HPGL0653 |
hpgl0654 | HPGL0654 | pbmc_sh | d110 | 3 | -0.2469 | 0.2080 | #000099 | HPGL0654 |
hpgl0655 | HPGL0655 | pbmc_sh | d110 | 3 | -0.3272 | 0.1504 | #000099 | HPGL0655 |
hpgl0656 | HPGL0656 | pbmc_sh | d110 | 3 | -0.2999 | 0.1656 | #000099 | HPGL0656 |
hpgl0657 | HPGL0657 | pbmc_nil | d107 | 1 | -0.0682 | -0.5184 | #009900 | HPGL0657 |
hpgl0658 | HPGL0658 | pbmc_ch | d107 | 1 | 0.2617 | 0.0056 | #990000 | HPGL0658 |
hpgl0659 | HPGL0659 | pbmc_ch | d107 | 1 | 0.2621 | 0.0748 | #990000 | HPGL0659 |
hpgl0660 | HPGL0660 | pbmc_ch | d107 | 1 | 0.1763 | -0.0055 | #990000 | HPGL0660 |
hpgl0661 | HPGL0661 | pbmc_sh | d107 | 1 | 0.2601 | 0.0889 | #000099 | HPGL0661 |
hpgl0662 | HPGL0662 | pbmc_sh | d107 | 1 | 0.1732 | 0.0381 | #000099 | HPGL0662 |
hpgl0663 | HPGL0663 | pbmc_sh | d107 | 1 | 0.2298 | -0.0072 | #000099 | HPGL0663 |
pca_uninf_lqcf$variance
## [1] 33.49 29.97 16.66 4.29 2.63 2.42 1.73 1.32 1.07 0.86 0.81
## [12] 0.70 0.64 0.59 0.58 0.50 0.48 0.45 0.43 0.39
write.csv(pca_uninf_lqcf$pcatable, file="csv/infection_withuninfected_with_batch.csv")
For the second iteration, use the same normalization, but add a combat correction in an attempt to minimize patient’s effect in the variance.
uninf_lqcf_cbdonor <- sm(normalize_expt(uninf_lqcf, batch="combat_scale"))
## Here the split is semi chronic/self-state, but not quite
pca_uninf_lqcf_cbdonor <- plot_pca(uninf_lqcf_cbdonor)
pca_uninf_lqcf_cbdonor$plot
## Now we have weak sehumantion between strains, I thought for a moment it might be few snps vs. many but that is not true.
## There are 2 sh and 1 chr on the right vs. 2 chr and 1 sh on the left
knitr::kable(pca_uninf_lqcf_cbdonor$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0630 | HPGL0630 | pbmc_nil | d108 | 2 | -0.5259 | 0.0585 | #009900 | HPGL0630 |
hpgl0631 | HPGL0631 | pbmc_ch | d108 | 2 | 0.0965 | -0.1762 | #990000 | HPGL0631 |
hpgl0632 | HPGL0632 | pbmc_ch | d108 | 2 | 0.0997 | -0.1757 | #990000 | HPGL0632 |
hpgl0633 | HPGL0633 | pbmc_ch | d108 | 2 | 0.0614 | 0.2327 | #990000 | HPGL0633 |
hpgl0634 | HPGL0634 | pbmc_sh | d108 | 2 | 0.0313 | -0.3145 | #000099 | HPGL0634 |
hpgl0635 | HPGL0635 | pbmc_sh | d108 | 2 | 0.0775 | 0.1209 | #000099 | HPGL0635 |
hpgl0636 | HPGL0636 | pbmc_sh | d108 | 2 | 0.1569 | 0.0785 | #000099 | HPGL0636 |
hpgl0650 | HPGL0650 | pbmc_nil | d110 | 3 | -0.5018 | -0.4780 | #009900 | HPGL0650 |
hpgl0651 | HPGL0651 | pbmc_ch | d110 | 3 | 0.1536 | -0.0342 | #990000 | HPGL0651 |
hpgl0652 | HPGL0652 | pbmc_ch | d110 | 3 | 0.1715 | -0.0739 | #990000 | HPGL0652 |
hpgl0653 | HPGL0653 | pbmc_ch | d110 | 3 | -0.0155 | 0.0683 | #990000 | HPGL0653 |
hpgl0654 | HPGL0654 | pbmc_sh | d110 | 3 | 0.1141 | 0.0845 | #000099 | HPGL0654 |
hpgl0655 | HPGL0655 | pbmc_sh | d110 | 3 | 0.0261 | 0.3803 | #000099 | HPGL0655 |
hpgl0656 | HPGL0656 | pbmc_sh | d110 | 3 | 0.0522 | 0.2617 | #000099 | HPGL0656 |
hpgl0657 | HPGL0657 | pbmc_nil | d107 | 1 | -0.5365 | 0.3032 | #009900 | HPGL0657 |
hpgl0658 | HPGL0658 | pbmc_ch | d107 | 1 | 0.0806 | -0.2934 | #990000 | HPGL0658 |
hpgl0659 | HPGL0659 | pbmc_ch | d107 | 1 | 0.1409 | -0.1537 | #990000 | HPGL0659 |
hpgl0660 | HPGL0660 | pbmc_ch | d107 | 1 | 0.0357 | 0.0370 | #990000 | HPGL0660 |
hpgl0661 | HPGL0661 | pbmc_sh | d107 | 1 | 0.1525 | -0.1874 | #000099 | HPGL0661 |
hpgl0662 | HPGL0662 | pbmc_sh | d107 | 1 | 0.0788 | 0.2502 | #000099 | HPGL0662 |
hpgl0663 | HPGL0663 | pbmc_sh | d107 | 1 | 0.0503 | 0.0113 | #000099 | HPGL0663 |
pca_uninf_lqcf_cbdonor$variance
## [1] 57.64 11.13 5.88 4.82 3.47 2.63 1.79 1.66 1.52 1.27 1.25
## [12] 1.18 1.12 0.99 0.90 0.86 0.82 0.70 0.20 0.17
write.csv(pca_uninf_lqcf_cbdonor$pcatable, file="csv/infection_withuninfected_batch_patient.csv")
Including the uninfected samples and changing the condition should not much matter
## Here we will set the batch to the humansite strains and condition to a combination of the patient and state state; then perform the pca.
new_condition <- paste0(uninf_human$design$state, '_', uninf_human$design$donor)
uninf_humanv2 <- set_expt_factors(uninf_human, condition=new_condition, batch="pathogenstrain")
uninfv2_lqcf_cbstr <- sm(normalize_expt(uninf_humanv2, transform="log2", convert="cpm", norm="quant", filter=TRUE, batch="combat_scale"))
pca_uninfv2_lqcf_cbstr <- plot_pca(uninfv2_lqcf_cbstr)
pca_uninfv2_lqcf_cbstr$plot
## This is a surprise to me, I would have expected the uninfected to still push the other samples off to a side or somesuch.
knitr::kable(pca_uninfv2_lqcf_cbstr$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0630 | HPGL0630 | uninfected_d108 | none | 1 | -0.3841 | -0.2182 | #1B9E77 | HPGL0630 |
hpgl0631 | HPGL0631 | chronic_d108 | s5430 | 7 | 0.2235 | -0.3225 | #C16610 | HPGL0631 |
hpgl0632 | HPGL0632 | chronic_d108 | s5397 | 6 | 0.2205 | -0.3165 | #C16610 | HPGL0632 |
hpgl0633 | HPGL0633 | chronic_d108 | s2504 | 5 | 0.2179 | -0.3007 | #C16610 | HPGL0633 |
hpgl0634 | HPGL0634 | self_heal_d108 | s2272 | 4 | -0.0450 | -0.3487 | #8D6B86 | HPGL0634 |
hpgl0635 | HPGL0635 | self_heal_d108 | s1022 | 2 | -0.0614 | -0.2824 | #8D6B86 | HPGL0635 |
hpgl0636 | HPGL0636 | self_heal_d108 | s2189 | 3 | -0.0613 | -0.3120 | #8D6B86 | HPGL0636 |
hpgl0650 | HPGL0650 | uninfected_d110 | none | 1 | -0.4535 | -0.0394 | #BC4399 | HPGL0650 |
hpgl0651 | HPGL0651 | chronic_d110 | s5430 | 7 | 0.0834 | 0.1983 | #A66753 | HPGL0651 |
hpgl0652 | HPGL0652 | chronic_d110 | s5397 | 6 | 0.0847 | 0.1813 | #A66753 | HPGL0652 |
hpgl0653 | HPGL0653 | chronic_d110 | s2504 | 5 | 0.0894 | 0.1639 | #A66753 | HPGL0653 |
hpgl0654 | HPGL0654 | self_heal_d110 | s2272 | 4 | -0.1892 | 0.1684 | #96A713 | HPGL0654 |
hpgl0655 | HPGL0655 | self_heal_d110 | s1022 | 2 | -0.1827 | 0.1791 | #96A713 | HPGL0655 |
hpgl0656 | HPGL0656 | self_heal_d110 | s2189 | 3 | -0.1912 | 0.1702 | #96A713 | HPGL0656 |
hpgl0657 | HPGL0657 | uninfected_d107 | none | 1 | -0.3103 | 0.1750 | #D59D08 | HPGL0657 |
hpgl0658 | HPGL0658 | chronic_d107 | s5430 | 7 | 0.2977 | 0.1540 | #9D7426 | HPGL0658 |
hpgl0659 | HPGL0659 | chronic_d107 | s5397 | 6 | 0.2994 | 0.1704 | #9D7426 | HPGL0659 |
hpgl0660 | HPGL0660 | chronic_d107 | s2504 | 5 | 0.2953 | 0.1784 | #9D7426 | HPGL0660 |
hpgl0661 | HPGL0661 | self_heal_d107 | s2272 | 4 | 0.0174 | 0.1420 | #666666 | HPGL0661 |
hpgl0662 | HPGL0662 | self_heal_d107 | s1022 | 2 | 0.0204 | 0.1237 | #666666 | HPGL0662 |
hpgl0663 | HPGL0663 | self_heal_d107 | s2189 | 3 | 0.0289 | 0.1357 | #666666 | HPGL0663 |
pca_uninfv2_lqcf_cbstr$variance
## [1] 90.94 5.44 0.79 0.58 0.35 0.34 0.29 0.27 0.20 0.18 0.16
## [12] 0.15 0.14 0.13 0.02 0.01 0.01 0.00 0.00 0.00
write.csv(pca_uninfv2_lqcf_cbstr$pcatable, file="csv/infection_withuninfected_batch_strain.csv")
uninfv2_lqcf_cbstr <- set_expt_condition(uninfv2_lqcf_cbstr, fact="state")
uninfv2_lqcf_cbstr <- set_expt_colors(uninfv2_lqcf_cbstr, colors=c("#008800","#880000","#000088"))
pca_uninfv2_lqcf_cbstr <- plot_pca(uninfv2_lqcf_cbstr)
pca_uninfv2_lqcf_cbstr$plot
knitr::kable(pca_uninfv2_lqcf_cbstr$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0630 | HPGL0630 | uninfected | none | 1 | -0.3841 | -0.2182 | #000088 | HPGL0630 |
hpgl0631 | HPGL0631 | chronic | s5430 | 7 | 0.2235 | -0.3225 | #008800 | HPGL0631 |
hpgl0632 | HPGL0632 | chronic | s5397 | 6 | 0.2205 | -0.3165 | #008800 | HPGL0632 |
hpgl0633 | HPGL0633 | chronic | s2504 | 5 | 0.2179 | -0.3007 | #008800 | HPGL0633 |
hpgl0634 | HPGL0634 | self_heal | s2272 | 4 | -0.0450 | -0.3487 | #880000 | HPGL0634 |
hpgl0635 | HPGL0635 | self_heal | s1022 | 2 | -0.0614 | -0.2824 | #880000 | HPGL0635 |
hpgl0636 | HPGL0636 | self_heal | s2189 | 3 | -0.0613 | -0.3120 | #880000 | HPGL0636 |
hpgl0650 | HPGL0650 | uninfected | none | 1 | -0.4535 | -0.0394 | #000088 | HPGL0650 |
hpgl0651 | HPGL0651 | chronic | s5430 | 7 | 0.0834 | 0.1983 | #008800 | HPGL0651 |
hpgl0652 | HPGL0652 | chronic | s5397 | 6 | 0.0847 | 0.1813 | #008800 | HPGL0652 |
hpgl0653 | HPGL0653 | chronic | s2504 | 5 | 0.0894 | 0.1639 | #008800 | HPGL0653 |
hpgl0654 | HPGL0654 | self_heal | s2272 | 4 | -0.1892 | 0.1684 | #880000 | HPGL0654 |
hpgl0655 | HPGL0655 | self_heal | s1022 | 2 | -0.1827 | 0.1791 | #880000 | HPGL0655 |
hpgl0656 | HPGL0656 | self_heal | s2189 | 3 | -0.1912 | 0.1702 | #880000 | HPGL0656 |
hpgl0657 | HPGL0657 | uninfected | none | 1 | -0.3103 | 0.1750 | #000088 | HPGL0657 |
hpgl0658 | HPGL0658 | chronic | s5430 | 7 | 0.2977 | 0.1540 | #008800 | HPGL0658 |
hpgl0659 | HPGL0659 | chronic | s5397 | 6 | 0.2994 | 0.1704 | #008800 | HPGL0659 |
hpgl0660 | HPGL0660 | chronic | s2504 | 5 | 0.2953 | 0.1784 | #008800 | HPGL0660 |
hpgl0661 | HPGL0661 | self_heal | s2272 | 4 | 0.0174 | 0.1420 | #880000 | HPGL0661 |
hpgl0662 | HPGL0662 | self_heal | s1022 | 2 | 0.0204 | 0.1237 | #880000 | HPGL0662 |
hpgl0663 | HPGL0663 | self_heal | s2189 | 3 | 0.0289 | 0.1357 | #880000 | HPGL0663 |
As per a conversation with Maria Adelaida on skype, lets remove all samples except those for one patient, then see if some aspect of the data jumps out (strain:strain variation, for example)
single_patient <- expt_subset(inf_human, subset="donor=='d107'")
single_patient <- set_expt_batch(single_patient, fact="state")
single_norm <- sm(normalize_expt(single_patient, transform="log2", norm="quant", convert="cpm", filter=TRUE))
single_norm_pca <- plot_pca(single_norm)
single_norm_pca$plot
## hmmm so the answer is, no -- I think.
knitr::kable(single_norm_pca$table)
sampleid | condition | batch | batch_int | PC1 | PC2 | colors | labels | |
---|---|---|---|---|---|---|---|---|
hpgl0658 | HPGL0658 | pbmc_ch | chronic | 1 | -0.0880 | 0.8150 | #990000 | HPGL0658 |
hpgl0659 | HPGL0659 | pbmc_ch | chronic | 1 | 0.3821 | -0.0160 | #990000 | HPGL0659 |
hpgl0660 | HPGL0660 | pbmc_ch | chronic | 1 | -0.2252 | -0.1863 | #990000 | HPGL0660 |
hpgl0661 | HPGL0661 | pbmc_sh | self_heal | 2 | 0.5411 | 0.0270 | #000099 | HPGL0661 |
hpgl0662 | HPGL0662 | pbmc_sh | self_heal | 2 | -0.7029 | -0.1015 | #000099 | HPGL0662 |
hpgl0663 | HPGL0663 | pbmc_sh | self_heal | 2 | 0.0929 | -0.5383 | #000099 | HPGL0663 |
In our previous discussion, Hector suggested that sample ‘HPGL0635’ is sufficiently dis-similar to its cohort samples that it might actually be a member of strain ‘2504’ rather than ‘1022’. Let us look and see what happens if that is changed.
I am going to leave out the uninfected samples to avoid the confusion they generate.
lqcf_noswitch <- sm(normalize_expt(inf_human, transform="log2", convert="cpm", norm="quant", filter=TRUE))
plot_pca(lqcf_noswitch)$plot
## This is just to note that the original color for sample 635 was orange to match strain '1022'
##switch_one <- set_expt_condition(no_uninfected, ids=c("sHPGL0635"), fact="ch2504")
switch_one <- set_expt_condition(inf_human, ids=c("HPGL0635"), fact="pbmc_ch")
switcher <- list("HPGL0635" = "pink")
names(switcher) <- c("HPGL0635")
switch_one <- set_expt_colors(expt=switch_one, colors=switcher)
## Error in names(chosen_colors) <- sample_ids: 'names' attribute [18] must be the same length as the vector [0]
lqcf_switch <- sm(normalize_expt(switch_one, transform="log2", convert="cpm", norm="quant", filter=TRUE, batch="combat_scale"))
plot_pca(lqcf_switch)$plot
## Note that now it is pink, matching 'ch2504'
I may be biased, but I think this suggests that the samples were not switched.
combined_condition <- paste0(inf_human$design$state, '_', inf_human$design$donor)
##with_uninfected_combined <- set_expt_factors(with_uninfected, batch="pathogenstrain", condition=combined_condition)
inf_combined <- set_expt_factors(inf_human, batch="donor", condition="state")
head(Biobase::exprs(inf_combined$expressionset))
## HPGL0631 HPGL0632 HPGL0633 HPGL0634 HPGL0635 HPGL0636
## ENSG00000000003 5 12 9 13 2 6
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 198 213 233 228 186 183
## ENSG00000000457 258 337 309 342 245 287
## ENSG00000000460 138 123 108 119 96 84
## ENSG00000000938 5277 5829 5304 4648 4192 4467
## HPGL0651 HPGL0652 HPGL0653 HPGL0654 HPGL0655 HPGL0656
## ENSG00000000003 2 5 6 5 2 2
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 264 254 253 202 156 236
## ENSG00000000457 300 299 339 312 216 280
## ENSG00000000460 119 105 110 72 57 69
## ENSG00000000938 10135 8159 5546 7147 5875 6998
## HPGL0658 HPGL0659 HPGL0660 HPGL0661 HPGL0662 HPGL0663
## ENSG00000000003 4 9 16 7 4 14
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 143 189 174 280 161 446
## ENSG00000000457 241 332 316 483 271 786
## ENSG00000000460 87 108 73 108 82 194
## ENSG00000000938 4043 4236 3567 5929 4310 8393
head(Biobase::exprs(normalize_expt(inf_human, convert="cpm", filter=TRUE)$expressionset))
## This function will replace the expt$expressionset slot with:
## cpm(cbcb(data))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 37484 low-count genes (13557 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
## HPGL0631 HPGL0632 HPGL0633 HPGL0634 HPGL0635 HPGL0636
## ENSG00000000419 19.929 19.091 16.926 19.812 17.835 17.854
## ENSG00000000457 25.969 30.204 22.447 29.718 23.492 28.000
## ENSG00000000460 13.890 11.024 7.846 10.340 9.205 8.195
## ENSG00000000938 531.151 522.436 385.306 403.883 401.960 435.812
## ENSG00000000971 4.429 4.929 4.359 4.605 4.698 3.415
## ENSG00000001036 38.852 37.195 38.792 35.627 40.369 44.001
## HPGL0651 HPGL0652 HPGL0653 HPGL0654 HPGL0655 HPGL0656
## ENSG00000000419 16.113 17.212 15.027 13.011 12.678 15.247
## ENSG00000000457 18.311 20.261 20.135 20.096 17.554 18.090
## ENSG00000000460 7.263 7.115 6.534 4.638 4.632 4.458
## ENSG00000000938 618.598 552.883 329.412 460.340 477.464 452.119
## ENSG00000000971 2.564 1.762 3.029 2.834 1.544 2.972
## ENSG00000001036 46.021 39.303 35.935 37.358 35.596 35.921
## HPGL0658 HPGL0659 HPGL0660 HPGL0661 HPGL0662 HPGL0663
## ENSG00000000419 16.314 14.825 13.676 16.461 14.623 15.871
## ENSG00000000457 27.494 26.042 24.837 28.395 24.614 27.970
## ENSG00000000460 9.925 8.472 5.738 6.349 7.448 6.903
## ENSG00000000938 461.239 332.276 280.359 348.559 391.470 298.665
## ENSG00000000971 4.221 7.844 5.816 6.584 3.724 3.808
## ENSG00000001036 31.715 23.611 24.758 27.043 30.518 20.995
combined_pca1 <- normalize_expt(inf_combined, filter=TRUE, batch="pca", convert="cpm")
## This function will replace the expt$expressionset slot with:
## pca(cpm(cbcb(data)))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 37484 low-count genes (13557 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: doing batch correction with pca.
## Note to self: If you get an error like 'x contains missing values'; I think this
## means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 5111 entries 0<x<1.
## batch_counts: Before batch correction, 357 entries are >= 0.
## Passing the batch method to get_model_adjust().
## It understands a few additional batch methods.
## Not able to discern the state of the data.
## Going to use a simplistic metric to guess if it is log scale.
## The be method chose 2 surrogate variable(s).
## Attempting pca surrogate estimation.
## The number of elements which are < 0 after batch correction is: 169
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
combined_pca1 <- set_expt_colors(combined_pca1, colors=c("#880000", "#000088"))
plot_pca(normalize_expt(combined_pca1, filter=TRUE, transform="log2", convert="cpm", norm="quant"))$plot
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 344 low-count genes (13213 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 234 values less than 0.
## Warning in transform_counts(count_table, ...): NaNs produced
## Removing 146 NaN containing rows (13067 remaining).
## Step 5: not doing batch correction.
combined_pca2 <- set_expt_factors(combined_pca1, batch="pathogenstrain", condition="state")
combined_pca2 <- set_expt_colors(combined_pca2, colors=c("#880000", "#000088"))
combined_pca3 <- normalize_expt(combined_pca2, filter=TRUE, batch="pca")
## This function will replace the expt$expressionset slot with:
## pca(cbcb(data))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 344 low-count genes (13213 remaining).
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: not transforming the data.
## Step 5: doing batch correction with pca.
## Note to self: If you get an error like 'x contains missing values'; I think this
## means that the data has too many 0's and needs to have a better low-count filter applied.
## batch_counts: Before batch correction, 946 entries 0<x<1.
## batch_counts: Before batch correction, 132 entries are >= 0.
## Passing the batch method to get_model_adjust().
## It understands a few additional batch methods.
## Not able to discern the state of the data.
## Going to use a simplistic metric to guess if it is log scale.
## The be method chose 5 surrogate variable(s).
## Attempting pca surrogate estimation.
## The number of elements which are < 0 after batch correction is: 380
## The variable low_to_zero sets whether to change <0 values to 0 and is: FALSE
l2cq_combined_pca3 <- normalize_expt(combined_pca3, filter=TRUE, transform="log2", convert="cpm", norm="quant")
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 968 low-count genes (12161 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 18 values less than 0.
## Warning in transform_counts(count_table, ...): NaNs produced
## Removing 16 NaN containing rows (12145 remaining).
## Step 5: not doing batch correction.
plot_pca(l2cq_combined_pca3)$plot
donor_strain_varpart <- varpart(expt=inf_human, predictor=NULL, factors=c("condition","pathogenstrain","donor"))
## Attempting mixed linear model with: ~ (1|condition) + (1|pathogenstrain) + (1|donor)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(expt = inf_human, predictor = NULL, factors = c("condition", : An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
donor_strain_varpart$percent_plot
## Error in eval(expr, envir, enclos): object 'donor_strain_varpart' not found
png(file="images/varpart_donor_strain.png")
donor_strain_varpart$partition_plot
## Error in eval(expr, envir, enclos): object 'donor_strain_varpart' not found
dev.off()
## png
## 2
png(file="images/varpart_donor_strain_pct.png")
replot_varpart_percent(donor_strain_varpart, n=40)
## Error in replot_varpart_percent(donor_strain_varpart, n = 40): object 'donor_strain_varpart' not found
dev.off()
## png
## 2
test_data <- normalize_expt(inf_combined, convert="cpm", norm="quant", filter=TRUE)
## This function will replace the expt$expressionset slot with:
## cpm(quant(cbcb(data)))
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data in its current base format, keep in mind that
## some metrics are easier to see when the data is log2 transformed, but
## EdgeR/DESeq do not accept transformed data.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 37484 low-count genes (13557 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: not transforming the data.
## Step 5: not doing batch correction.
head(Biobase::exprs(test_data$expressionset))
## HPGL0631 HPGL0632 HPGL0633 HPGL0634 HPGL0635 HPGL0636
## ENSG00000000419 18.366 17.777 16.072 17.753 16.24 17.35
## ENSG00000000457 24.279 28.343 21.537 27.325 21.90 27.22
## ENSG00000000460 12.787 10.246 7.519 9.192 8.36 8.09
## ENSG00000000938 549.144 534.558 392.130 419.440 416.15 454.66
## ENSG00000000971 4.077 4.582 4.146 4.152 4.25 3.51
## ENSG00000001036 36.656 35.197 37.625 33.154 37.72 42.72
## HPGL0651 HPGL0652 HPGL0653 HPGL0654 HPGL0655 HPGL0656
## ENSG00000000419 16.61 18.394 15.344 13.969 13.724 16.378
## ENSG00000000457 18.78 21.549 20.507 21.272 18.768 19.321
## ENSG00000000460 7.66 7.897 6.804 5.010 4.994 4.950
## ENSG00000000938 612.00 531.789 328.156 450.426 463.897 437.878
## ENSG00000000971 2.74 1.982 3.132 3.112 1.745 3.293
## ENSG00000001036 47.01 41.198 36.073 38.787 36.998 37.575
## HPGL0658 HPGL0659 HPGL0660 HPGL0661 HPGL0662 HPGL0663
## ENSG00000000419 15.15 15.033 13.547 16.495 14.910 16.141
## ENSG00000000457 26.25 26.628 24.927 28.944 25.037 28.389
## ENSG00000000460 9.10 8.557 5.390 6.277 7.491 6.739
## ENSG00000000938 466.66 334.237 281.135 348.528 396.802 295.643
## ENSG00000000971 3.80 7.893 5.461 6.538 3.784 3.707
## ENSG00000001036 30.54 24.129 24.866 27.319 30.727 21.211
query_model_string <- "~ condition:pathogenstrain + donor"
query_design <- inf_combined[["design"]]
query_conditions <- as.factor(query_design[["condition"]])
##query_batches <- as.factor(query_design[["anotherbatch"]])
query_batches <- as.factor(x=c("a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c"))
query_strains <- as.factor(query_design[["pathogenstrain"]])
query_donors <- as.factor(query_design[["donor"]])
data_mtrx <- as.data.frame(Biobase::exprs(test_data[["expressionset"]]))
query_model <- model.matrix(~ 0 + query_conditions + query_donors + query_strains, data=query_design)
combined_voom <- limma::voom(counts=data_mtrx, design=query_model, normalize.method="quantile")
## Coefficients not estimable: query_strainss5430
## Warning: Partial NA coefficients for 13557 probe(s)
combined_fit <- limma::lmFit(combined_voom, query_model, robust=TRUE)
## Coefficients not estimable: query_strainss5430
## Warning: Partial NA coefficients for 13557 probe(s)
combined_contrast <- limma::makeContrasts(
chsh=query_conditionschronic-query_conditionsself_heal,
levels=query_model)
combined_cfit <- limma::contrasts.fit(combined_fit, combined_contrast)
combined_bayes <- limma::eBayes(combined_cfit, robust=TRUE)
combined_table <- limma::topTable(combined_bayes, number=nrow(combined_bayes), resort.by="logFC")
hist(combined_table$adj.P.Val)
min(combined_table$adj.P.Val)
## [1] 1.331e-08
test_ma <- plot_ma_de(table=combined_table, expr_col="AveExpr", fc_col="logFC", p_col="adj.P.Val", logfc_cutoff=0.6)
test_ma$plot
head(combined_table)
## logFC AveExpr t P.Value adj.P.Val B
## ENSG00000125144 2.700 2.2519 13.756 2.366e-11 1.604e-07 14.1458
## ENSG00000122641 2.257 4.8639 11.096 1.584e-08 5.368e-05 9.8081
## ENSG00000166923 2.247 1.3688 5.079 2.397e-04 2.405e-02 0.7976
## ENSG00000137673 2.145 4.2251 6.272 5.686e-05 9.491e-03 2.2119
## ENSG00000205362 2.076 -0.3415 10.307 3.103e-09 1.402e-05 8.8525
## ENSG00000113657 1.934 5.6923 5.621 1.689e-04 2.009e-02 1.1213
head(Biobase::exprs(combined_pca3$expressionset))
## HPGL0631 HPGL0632 HPGL0633 HPGL0634 HPGL0635 HPGL0636
## ENSG00000000419 15.476 15.134 15.328 15.296 15.428 14.787
## ENSG00000000457 21.048 25.641 21.908 23.139 22.666 23.926
## ENSG00000000460 8.770 7.179 6.555 5.486 7.203 5.462
## ENSG00000000938 427.402 479.363 440.816 393.051 430.250 399.559
## ENSG00000000971 3.667 3.458 3.044 2.851 3.351 1.915
## ENSG00000001036 32.508 32.005 34.457 32.956 33.997 36.015
## HPGL0651 HPGL0652 HPGL0653 HPGL0654 HPGL0655 HPGL0656
## ENSG00000000419 15.259 16.612 15.652 13.219 14.401 16.305
## ENSG00000000457 21.962 23.959 23.525 23.709 23.111 22.404
## ENSG00000000460 6.890 7.881 7.854 6.021 6.023 6.078
## ENSG00000000938 440.917 439.247 424.036 382.273 444.110 410.876
## ENSG00000000971 3.602 1.915 2.629 3.222 3.111 3.835
## ENSG00000001036 35.603 30.906 33.638 33.868 33.832 33.959
## HPGL0658 HPGL0659 HPGL0660 HPGL0661 HPGL0662 HPGL0663
## ENSG00000000419 14.756 14.490 14.445 15.694 15.840 16.510
## ENSG00000000457 23.072 21.134 23.336 22.875 24.135 24.269
## ENSG00000000460 6.619 8.272 6.119 5.748 6.822 7.052
## ENSG00000000938 441.141 454.232 418.284 430.132 403.524 410.030
## ENSG00000000971 2.701 4.467 3.460 3.414 3.034 1.495
## ENSG00000001036 34.910 31.347 33.779 35.345 32.069 34.686
testing <- all_pairwise(combined_pca3, model_batch=FALSE, force=TRUE)
## Finished running DE analyses, collecting outputs.
## Comparing analyses 1/1: self_heal_vs_chronic
test_table <- combine_de_tables(testing, excel=FALSE)
## Writing a legend of columns.
## Working on table 1/1: self_heal_vs_chronic
## This can do comparisons among the following columns in the pairwise result:
## chronic, self_heal, self_heal_vs_chronic
## Actually comparing self_heal and chronic.
## This can do comparisons among the following columns in the pairwise result:
## chronic, self_heal
## Actually comparing self_heal and chronic.
## This can do comparisons among the following columns in the pairwise result:
## chronic, self_heal
## Actually comparing self_heal and chronic.
test_sig <- extract_significant_genes(test_table, excel=FALSE)
## Writing excel data sheet 0/1: self_heal_vs_chronic
## Assuming the fold changes are on the log scale and so taking >< 0
## After (adj)p filter, the up genes table has 1327 genes.
## After (adj)p filter, the down genes table has 695 genes.
## Assuming the fold changes are on the log scale and so taking -1.0 * fc
## After fold change filter, the up genes table has 8 genes.
## After fold change filter, the down genes table has 7 genes.
## Not printing excel sheets for the significant genes.
## Writing excel data sheet 1/1: self_heal_vs_chronic
## Assuming the fold changes are on the log scale and so taking >< 0
## After (adj)p filter, the up genes table has 11 genes.
## After (adj)p filter, the down genes table has 24 genes.
## Assuming the fold changes are on the log scale and so taking -1.0 * fc
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Not printing excel sheets for the significant genes.
## Writing excel data sheet 2/1: self_heal_vs_chronic
## Assuming the fold changes are on the log scale and so taking >< 0
## After (adj)p filter, the up genes table has 0 genes.
## After (adj)p filter, the down genes table has 4 genes.
## Assuming the fold changes are on the log scale and so taking -1.0 * fc
## After fold change filter, the up genes table has 0 genes.
## After fold change filter, the down genes table has 0 genes.
## Not printing excel sheets for the significant genes.
## Writing excel data sheet 3/1: self_heal_vs_chronic
## Assuming the fold changes are on the log scale and so taking >< 0
## After (adj)p filter, the up genes table has 1081 genes.
## After (adj)p filter, the down genes table has 601 genes.
## Assuming the fold changes are on the log scale and so taking -1.0 * fc
## After fold change filter, the up genes table has 15 genes.
## After fold change filter, the down genes table has 8 genes.
## Not printing excel sheets for the significant genes.
##test_ma <- extract_de_ma(test_sig, type="limma", table=1)
pair_ma <- plot_ma_de(table=test_table$data$self_heal_vs_chronic, fc_col="limma_logfc", p_col="limma_adjp", expr_col="limma_ave", logfc_cutoff=0.6)
pair_ma$plot
In our recent meeting, Hector expressed interest that I play with the variance partition package in an attempt to see if there are models which best explain the visible variance in the data. It is hoped/expected that doing so will allow us to properly model the data without the unfortunate expedient of using combat/sva in ways which may not be appropriate.
With that in mind, I have a TODO entry somewhere which tells me pretty much exactly the models Hector wanted to test. hmm ah check it, #1:
ok, with that digression, lets play with variancePartition, then try re-evaluating the sample metrics using these new groupings by snp clade. Also, include some of the other samples to see if clades become clearer.
So, on the left side of my screen I have this document while on the right I have the variancePartition help document. Note that I loaded the infection_estimation data set above, this provides me the pbmc_para raw data.
human_expt$notes
## [1] "New experimental design factors by snp added 2016-09-20"
colnames(human_expt$design)
## [1] "sampleid" "experimentname"
## [3] "tubelabel" "alias"
## [5] "condition" "batch"
## [7] "anotherbatch" "snpclade"
## [9] "snpcladev2" "snpcladev3"
## [11] "pathogenstrain" "donor"
## [13] "time" "pctmappedparasite"
## [15] "pctcategory" "state"
## [17] "sourcelab" "expperson"
## [19] "pathogen" "host"
## [21] "hostcelltype" "noofhostcells"
## [23] "infectionperiodhpitimeofharvest" "moiexposure"
## [25] "parasitespercell" "pctinf"
## [27] "rnangul" "rnaqcpassed"
## [29] "libraryconst" "libqcpassed"
## [31] "index" "descriptonandremarks"
## [33] "observation" "lowercaseid"
## [35] "humanfile" "parasitefile"
## [37] "file"
## make sure that I am getting the material from ~ 2016-09-20
infection_human <- expt_subset(human_expt, subset="experimentname=='infection'")
infection_human <- set_expt_colors(infection_human, colors=c("#005500","#880000","#0000FF"))
infection_human <- sm(normalize_expt(infection_human, filter=TRUE))
Hector kindly suggested appropriate model formulae for this data.
The variancePartition package introduces new (to me) syntax when making model matrices. A fixed effect factor in the experimental design is stated normally. categorical/random effect factors get the (1|factor) syntax. The example in the text therefore looks like:
form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch)
So in my ignorant way of thinking, you put the factor you want to learn about first and without the (1|) and then the ones you don’t want following with (1|).
# specify formula to model within/between individual variance
# separately for each tissue
# Note that including +0 ensures each tissue is modeled explicitly
# Otherwise, the first tissue would be used as baseline
form <- ~ (Tissue+0|Individual) + Age + (1|Tissue) + (1|Batch)
# fit model and extract variance percents
varPart <- fitExtractVarPartModel(geneExpr, form, info)
# violin plot
plotVarPart(sortCols(varPart), label.angle=60)
So in Maria Adelaida’s case, we would ideally have something like:
~ healing + (1|donor) + (1|strain)
and therefore be able to discern across healing states. However, we know a priori that the healing is confounded with strain, so this might not be kosher… Attempting ~ healing + xxx results in an error: “Error in checkModelStatus(fitInit, showWarnings = showWarnings, colinearityCutoff) : Categorical variables modeled as fixed effect: healing The results will not behave as expected and may be very wrong!!”
So I will try (1|healing)
Start out just seeing what donor looks like vs. residuals:
vp <- varpart(infection_human, predictor=NULL, factors=c("condition", "batch")) ## Batch in this case is donor.
## Attempting mixed linear model with: ~ (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(infection_human, predictor = NULL, factors = c("condition", : An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
vp$percent_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
vp$partition_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
It looks to me that condition and batch(donor) have similar amounts of variance in this data and that there is a lot more unaccounted.
vp <- varpart(infection_human, predictor=NULL, factors=c("pathogenstrain"))
## Attempting mixed linear model with: ~ (1|pathogenstrain)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(infection_human, predictor = NULL, factors = c("pathogenstrain")): An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
vp$percent_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
vp$partition_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
Looking at strain alone, we see some variance there.
vp <- varpart(infection_human, predictor=NULL, factors=c("condition","batch", "pathogenstrain"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch) + (1|pathogenstrain)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(infection_human, predictor = NULL, factors = c("condition", : An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
vp$percent_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
vp$partition_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
In order of apparent variance: residuals > batch > condition > strain
vp <- varpart(infection_human, predictor=NULL, factors=c("condition","batch", "snpclade3"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch) + (1|snpclade3)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(infection_human, predictor = NULL, factors = c("condition", : An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
vp$percent_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
vp$partition_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
residuals > batch > condition > snp3
vp <- varpart(infection_human, predictor=NULL, factors=c("condition","batch", "snpclade5"))
## Attempting mixed linear model with: ~ (1|condition) + (1|batch) + (1|snpclade5)
## Fitting the expressionset to the model, this is slow.
## Error in varpart(infection_human, predictor = NULL, factors = c("condition", : An error like 'vtv downdated' may be because there are too many 0s, try and filter the data and rerun.
vp$percent_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
vp$partition_plot
## Error in eval(expr, envir, enclos): object 'vp' not found
residuals > batch > condition > snp5