First things first, get some human annotation data. My hpgltools package has a function for doing just that from ensembl.
## The biomart annotations file already exists, loading from it.
## Length Class Mode
## annotation 12 data.frame list
## mart 1 -none- character
## host 1 -none- character
## mart_name 1 -none- character
## rows 1 -none- character
## dataset 1 -none- character
The result of load_biomart_annotations() is a list with 7 elements. The first one is probably the only one of significant interest. The rest are useful if you want to load other data or figure out what happened if biomart is not responding as you expect it to.
## ensembl_transcript_id ensembl_gene_id version
## ENST00000000233 ENST00000000233 ENSG00000004059 11
## ENST00000000412 ENST00000000412 ENSG00000003056 8
## ENST00000000442 ENST00000000442 ENSG00000173153 15
## ENST00000001008 ENST00000001008 ENSG00000004478 8
## ENST00000001146 ENST00000001146 ENSG00000003137 8
## ENST00000002125 ENST00000002125 ENSG00000003509 16
## transcript_version hgnc_symbol
## ENST00000000233 10 ARF5
## ENST00000000412 8 M6PR
## ENST00000000442 11 ESRRA
## ENST00000001008 6 FKBP4
## ENST00000001146 6 CYP26B1
## ENST00000002125 9 NDUFAF7
## description
## ENST00000000233 ADP ribosylation factor 5 [Source:HGNC Symbol;Acc:HGNC:658]
## ENST00000000412 mannose-6-phosphate receptor, cation dependent [Source:HGNC Symbol;Acc:HGNC:6752]
## ENST00000000442 estrogen related receptor alpha [Source:HGNC Symbol;Acc:HGNC:3471]
## ENST00000001008 FKBP prolyl isomerase 4 [Source:HGNC Symbol;Acc:HGNC:3720]
## ENST00000001146 cytochrome P450 family 26 subfamily B member 1 [Source:HGNC Symbol;Acc:HGNC:20581]
## ENST00000002125 NADH:ubiquinone oxidoreductase complex assembly factor 7 [Source:HGNC Symbol;Acc:HGNC:28816]
## gene_biotype cds_length chromosome_name strand start_position
## ENST00000000233 protein_coding 543 7 + 127588386
## ENST00000000412 protein_coding 834 12 - 8940361
## ENST00000000442 protein_coding 1272 11 + 64305497
## ENST00000001008 protein_coding 1380 12 + 2794970
## ENST00000001146 protein_coding 1539 2 - 72129238
## ENST00000002125 protein_coding 1326 2 + 37231631
## end_position
## ENST00000000233 127591700
## ENST00000000412 8949761
## ENST00000000442 64316743
## ENST00000001008 2805423
## ENST00000001146 72148038
## ENST00000002125 37253403
Here are the first 6 rows of the human annotation data. The row names all start with ENST, so they are keyed off the transcript ID. This is sort of a problem, as we really want to use the gene IDs (the second column, ensembl_gene_id).
All the likely tasks we will want to do with the RNASeq data are performed via a data type called the expressionSet. There are a few things which are slightly annoying about them to me, and creating them is a bit more difficult to get correct than I would like; so I wrote a function to hopefully make it easier.
hs_expt <- create_expt(metadata="sample_sheets/tc_samples.xlsx",
gene_info=hs_annot,
file_column="humanfile")
## Reading the sample metadata.
## The sample definitions comprises: 10 rows(samples) and 9 columns(metadata fields).
## Reading count tables.
## Reading count files with read.table().
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0475/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0482/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0476/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0480/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0484/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0473/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0107/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0109/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0122/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## /mnt/sshfs/cbcbsub/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/tcruzi_fernanda_learning_2020/preprocessing/hpgl0124/outputs/tophat_hsapiens/accepted_paired.count.xz contains 51046 rows and merges to 51046 rows.
## Finished reading count data.
## Matched 43573 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 51041 rows and 10 columns.
I like to think of an exressionSet as a 3 dimensional cube-like object.
Unlike a real cube with 6 sides, this one only has 3:
## HPGL0475 HPGL0482 HPGL0476 HPGL0480 HPGL0484 HPGL0473 HPGL0107
## ENSG00000000003 951 871 1005 501 539 607 875
## ENSG00000000005 3 1 1 3 6 7 0
## ENSG00000000419 1304 1097 1061 1004 953 1059 1066
## ENSG00000000457 447 358 394 315 336 390 345
## ENSG00000000460 234 209 203 153 149 163 112
## ENSG00000000938 39 25 18 52 58 63 3
## HPGL0109 HPGL0122 HPGL0124
## ENSG00000000003 790 1007 762
## ENSG00000000005 0 0 0
## ENSG00000000419 981 901 958
## ENSG00000000457 287 338 295
## ENSG00000000460 123 130 87
## ENSG00000000938 2 7 0
## sampleid type stage replicate batch condition originalcond
## HPGL0475 HPGL0475 CLBr A60 1 a CLBr.A60 clbr_a60
## HPGL0482 HPGL0482 CLBr A60 2 a CLBr.A60 clbr_a60
## HPGL0476 HPGL0476 CLBr A60 3 a CLBr.A60 clbr_a60
## HPGL0480 HPGL0480 CLBr A96 1 b CLBr.A96 clbr_a96
## HPGL0484 HPGL0484 CLBr A96 2 b CLBr.A96 clbr_a96
## HPGL0473 HPGL0473 CLBr A96 3 b CLBr.A96 clbr_a96
## humanfile
## HPGL0475 preprocessing/hpgl0475/outputs/tophat_hsapiens/accepted_paired.count.xz
## HPGL0482 preprocessing/hpgl0482/outputs/tophat_hsapiens/accepted_paired.count.xz
## HPGL0476 preprocessing/hpgl0476/outputs/tophat_hsapiens/accepted_paired.count.xz
## HPGL0480 preprocessing/hpgl0480/outputs/tophat_hsapiens/accepted_paired.count.xz
## HPGL0484 preprocessing/hpgl0484/outputs/tophat_hsapiens/accepted_paired.count.xz
## HPGL0473 preprocessing/hpgl0473/outputs/tophat_hsapiens/accepted_paired.count.xz
## file
## HPGL0475 preprocessing/parasite/clbr/hpgl0475/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## HPGL0482 preprocessing/parasite/clbr/hpgl0482/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## HPGL0476 preprocessing/parasite/clbr/hpgl0476/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## HPGL0480 preprocessing/parasite/clbr/hpgl0480/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## HPGL0484 preprocessing/parasite/clbr/hpgl0484/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## HPGL0473 preprocessing/parasite/clbr/hpgl0473/outputs/tophat_tcruzi_all/accepted_paired.count.xz
## ensembl_transcript_id ensembl_gene_id version
## ENSG00000000003 ENST00000373020 ENSG00000000003 15
## ENSG00000000005 ENST00000373031 ENSG00000000005 6
## ENSG00000000419 ENST00000371582 ENSG00000000419 12
## ENSG00000000457 ENST00000367770 ENSG00000000457 14
## ENSG00000000460 ENST00000286031 ENSG00000000460 17
## ENSG00000000938 ENST00000374003 ENSG00000000938 13
## transcript_version hgnc_symbol
## ENSG00000000003 9 TSPAN6
## ENSG00000000005 5 TNMD
## ENSG00000000419 8 DPM1
## ENSG00000000457 5 SCYL3
## ENSG00000000460 10 C1orf112
## ENSG00000000938 7 FGR
## description
## ENSG00000000003 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000005 tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## ENSG00000000419 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## ENSG00000000460 chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## ENSG00000000938 FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
## gene_biotype cds_length chromosome_name strand start_position
## ENSG00000000003 protein_coding 738 X - 100627108
## ENSG00000000005 protein_coding 954 X + 100584936
## ENSG00000000419 protein_coding 864 20 - 50934867
## ENSG00000000457 protein_coding 2229 1 - 169849631
## ENSG00000000460 protein_coding 2562 1 + 169662007
## ENSG00000000938 protein_coding 1590 1 - 27612064
## end_position
## ENSG00000000003 100639991
## ENSG00000000005 100599885
## ENSG00000000419 50958555
## ENSG00000000457 169894267
## ENSG00000000460 169854080
## ENSG00000000938 27635185
All the following analyses will use this as the starting point.
Here is a fun little thing we can do with this:
One of the columns in the annotation data is transcript length. Keep in mind that we are working at the level of genes, so each gene in the expressionset is using the annotation data from the first transcript; this does not generally matter, but we should remember it in case we decide to do something where it does matter…
## Warning: NAs introduced by coercion
gene_lengths <- plot_histogram(lengths)
library(ggplot2)
gene_lengths +
scale_x_continuous(limits=c(0, 10000))
## Warning: Removed 32464 rows containing non-finite values (stat_bin).
## Warning: Removed 32464 rows containing non-finite values (stat_density).
## Warning: Removed 2 rows containing missing values (geom_bar).
The distribution of transcript lengths in the human genome for the set of transcripts <= 10,000 nucleotides in length.
There are a few analyses which are nice to look at with relatively raw data.
I have a function named plot_libsize() which just provides how many counts were found for each sample in the data. This is useful because if one or more samples have waaaay fewer counts than the others, that will cause weird artifacts in the final result.
## Length Class Mode
## plot 9 gg list
## table 4 data.table list
## summary 7 data.table list
As you see, the result of plot_libsize() is a list with 3 elements. A plot, table, and summary.
## id sum condition colors
## 1: HPGL0475 39546773 CLBr.A60 #1B9E77
## 2: HPGL0482 30578474 CLBr.A60 #1B9E77
## 3: HPGL0476 33793748 CLBr.A60 #1B9E77
## 4: HPGL0480 23668540 CLBr.A96 #D95F02
## 5: HPGL0484 26022709 CLBr.A96 #D95F02
## 6: HPGL0473 29556535 CLBr.A96 #D95F02
## 7: HPGL0107 29918273 Uninfected #7570B3
## 8: HPGL0109 26274984 Uninfected #7570B3
## 9: HPGL0122 25498955 Uninfected #7570B3
## 10: HPGL0124 25778002 Uninfected #7570B3
## condition min 1st median mean 3rd max
## 1: CLBr.A60 30578474 32186111 33793748 34639665 36670260 39546773
## 2: CLBr.A96 23668540 24845624 26022709 26415928 27789622 29556535
## 3: Uninfected 25498955 25708240 26026493 26867554 27185806 29918273
We can see that the samples are reasonably similar in terms of number of counts, so that is good.
What about the overall distribution of counts per sample?
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 195051 zero count features.
## Length Class Mode
## plot 10 gg list
## condition_summary 7 data.table list
## batch_summary 7 data.table list
## sample_summary 7 data.table list
## table 5 data.table list
I suspect you are seeing a trend. Most of the things I wrote return lists. Things I wrote which create plots usually have an element in that list named plot…
## condition min 1st median mean 3rd max
## 1: CLBr.A60 0.5 0.5 3.0 678.8 68 789636
## 2: CLBr.A96 0.5 1.0 5.0 517.7 65 322268
## 3: Uninfected 0.5 0.5 0.5 526.7 38 755811
We can see at a glance that the distribution of counts for the infected and uninfected samples are very different. I am thinking that should not be a big surprise.
How do the raw data compare against each other? We can use correlation/distance heatmaps to query that, however we should keep in mind that the results can be misleading if the data has not been normalized (which it has not).
Like I said this is probably not very meaningful, but it would appear at first glance that the 60 hour and uninfected samples are much more similar to each other than either is to the 96 hour.
Lets normalize the data.
## This function will replace the expt$expressionset slot with:
## log2(cpm(quant(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: performing count filter with option: cbcb
## Removing 37199 low-count genes (13842 remaining).
## Step 2: normalizing the data with quant.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## Step 5: not doing batch correction.
The above low-count filters the data (filter=TRUE), log2 transforms it, does a counts-per-million conversion, and quantile normalizes the data.
Now let us replot the data with the normalization. Note that I have a shortcut function: graph_metrics() which does every plot, some of the plots take a long time and so are not generated by default.
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
## Length Class Mode
## boxplot 9 gg list
## corheat 3 recordedplot list
## cvplot 9 gg list
## density 10 gg list
## density_table 5 data.table list
## disheat 3 recordedplot list
## gene_heatmap 0 -none- NULL
## legend 3 recordedplot list
## legend_colors 3 data.frame list
## libsize 9 gg list
## libsizes 4 data.table list
## libsize_summary 7 data.table list
## ma 0 -none- NULL
## nonzero 9 gg list
## nonzero_table 7 data.frame list
## pc_loadplot 3 recordedplot list
## pc_summary 4 data.frame list
## pc_propvar 9 -none- numeric
## pc_plot 9 gg list
## pc_table 17 data.frame list
## qqlog 0 -none- NULL
## qqrat 0 -none- NULL
## smc 9 gg list
## smd 9 gg list
## topnplot 9 gg list
## tsne_summary 4 data.frame list
## tsne_propvar 20 -none- numeric
## tsne_plot 9 gg list
## tsne_table 10 data.frame list
## And no samples have lots of null genes.
## Now for the bad news, normalization kills useful information in the data.
hs_norm_plots$boxplot
## All samples have been smooshed by quantile normalization into identical
## distributions. This is terrible for some things.
hs_norm_plots$cvplot
Lets see if anything changes if we apply sva to the data. Because we removed much of the data from the original experiment, we probably cannot viably add batch to the model in the data, it might work but I am thinking it will give faulty results.
It is worth noting that over time, an increasingly large family of surrogate variable analyses have been created. I have wrappers for a few of them, but I mostly just stick with the default sva.
hs_norm_batch <- normalize_expt(hs_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq")
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 37199 low-count genes (13842 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 813 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 133006 entries are x>1: 96%.
## batch_counts: Before batch/surrogate estimation, 813 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 4601 entries are 0<x<1: 3%.
## The be method chose 2 surrogate variable(s).
## Attempting svaseq estimation with 2 surrogates.
## There are 577 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
We will do three versions of this, one with only condition in the model, one with condition and batch, and one using sva.
My function all_pairwise() does the following:
Thus, if you wish to learn how to invoke limma and friends, you may wish instead to look up the functions: limma_pairwise(), deseq_pairwise(), edger_pairwise(), ebseq_pairwise(), and basic_pairwise() in order to see the things I do to run them. Hopefully, you will find that I pretty carefully followed the suggestions from the authors of each tool.
The _pairwise() family of functions has a lot of options, I will only be using ‘model_batch’ to change one of them.
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Assuming no batch in model for testing pca.
## Not putting labels on the plot.
## Finished running DE analyses, collecting outputs.
## Comparing analyses.
## Length Class Mode
## basic 10 basic_result list
## deseq 16 deseq_result list
## ebseq 5 ebseq_result list
## edger 16 edger_result list
## limma 22 limma_result list
## batch_type 1 -none- character
## comparison 22 -none- list
## extra_contrasts 0 -none- NULL
## input 16 expt list
## model_cond 1 -none- logical
## model_batch 1 -none- logical
## original_pvalues 0 -none- NULL
## pre_batch 5 -none- list
## post_batch 5 -none- list
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Using limma's removeBatchEffect to visualize with(out) batch inclusion.
## Not putting labels on the plot.
## Starting basic_pairwise().
## Starting basic pairwise comparison.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating mean and variance tables.
## Basic step 2/3: Performing 6 comparisons.
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Starting deseq_pairwise().
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including batch and condition in the deseq model.
## converting counts to integer mode
## Error in checkFullRank(modelMatrix) :
## the model matrix is not full rank, so the model cannot be fit as specified.
## One or more variables or interaction terms in the design formula are linear
## combinations of the others and must be removed.
##
## Please read the vignette section 'Model matrix not full rank':
##
## vignette('DESeq2')
## Starting ebseq_pairwise().
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Starting EBSeq pairwise subset.
## Choosing the non-intercept containing model.
## Starting EBTest of CLBrA60 vs. CLBrA96.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of CLBrA60 vs. Uninfected.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of CLBrA96 vs. Uninfected.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting edger_pairwise().
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
## Design matrix not of full rank. The following coefficients not estimable:
## b c
## Warning in edger_pairwise(...): estimateGLMCommonDisp() failed. Trying again
## with estimateDisp().
## Warning in edger_pairwise(...): There was a failure when doing the estimations.
## There was a failure when doing the estimations, using estimateDisp().
## Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, :
## Design matrix not of full rank. The following coefficients not estimable:
## b c
## Starting limma_pairwise().
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Limma step 1/6: choosing model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.
## Coefficients not estimable: b c
## Warning: Partial NA coefficients for 13842 probe(s)
## Limma step 3/6: running lmFit with method: ls.
## Coefficients not estimable: b c
## Warning: Partial NA coefficients for 13842 probe(s)
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: CLBrA96_vs_CLBrA60. Adjust=BH
## Limma step 6/6: 2/3: Creating table: Uninfected_vs_CLBrA60. Adjust=BH
## Limma step 6/6: 3/3: Creating table: Uninfected_vs_CLBrA96. Adjust=BH
## Limma step 6/6: 1/3: Creating table: CLBrA60. Adjust=BH
## Limma step 6/6: 2/3: Creating table: CLBrA96. Adjust=BH
## Limma step 6/6: 3/3: Creating table: Uninfected. Adjust=BH
## Comparing analyses.
## It looks like I was correct, the experimental model does not support a batch
## factor, so this failed.
## For this last invocation, I will turn off parallel processing, so it will
## take longer but also print out what it is doing while it runs...
hs_sva <- all_pairwise(hs_expt, model_batch="svaseq", parallel=FALSE, filter=TRUE)
## batch_counts: Before batch/surrogate estimation, 137255 entries are x>1: 99%.
## batch_counts: Before batch/surrogate estimation, 813 entries are x==0: 1%.
## The be method chose 2 surrogate variable(s).
## Attempting svaseq estimation with 2 surrogates.
## Plotting a PCA before surrogates/batch inclusion.
## Not putting labels on the plot.
## Using svaseq to visualize before/after batch inclusion.
## Performing a test normalization with: raw
## This function will replace the expt$expressionset slot with:
## log2(svaseq(cpm(cbcb(data))))
## It will save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep libsizes in mind
## when invoking limma. The appropriate libsize is non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Step 1: performing count filter with option: cbcb
## Removing 0 low-count genes (13842 remaining).
## Step 2: not normalizing the data.
## Step 3: converting the data with cpm.
## Step 4: transforming the data with log2.
## transform_counts: Found 813 values equal to 0, adding 1 to the matrix.
## Step 5: doing batch correction with svaseq.
## Note to self: If you get an error like 'x contains missing values' The data has too many 0's and needs a stronger low-count filter applied.
## Passing off to all_adjusters.
## batch_counts: Before batch/surrogate estimation, 133006 entries are x>1: 96%.
## batch_counts: Before batch/surrogate estimation, 813 entries are x==0: 1%.
## batch_counts: Before batch/surrogate estimation, 4601 entries are 0<x<1: 3%.
## The be method chose 2 surrogate variable(s).
## Attempting svaseq estimation with 2 surrogates.
## There are 577 (0%) elements which are < 0 after batch correction.
## Setting low elements to zero.
## Not putting labels on the plot.
## Starting basic_pairwise().
## Starting basic pairwise comparison.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## Basic step 0/3: Transforming data.
## Basic step 1/3: Creating mean and variance tables.
## Basic step 2/3: Performing 6 comparisons.
## Basic step 3/3: Creating faux DE Tables.
## Basic: Returning tables.
## Starting deseq_pairwise().
## Starting DESeq2 pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Including batch estimates from sva/ruv/pca in the model.
## Choosing the non-intercept containing model.
## DESeq2 step 1/5: Including a matrix of batch estimates in the deseq model.
## converting counts to integer mode
## DESeq2 step 2/5: Estimate size factors.
## DESeq2 step 3/5: Estimate dispersions.
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Using a parametric fitting seems to have worked.
## DESeq2 step 4/5: nbinomWaldTest.
## Starting ebseq_pairwise().
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Starting EBSeq pairwise subset.
## Choosing the non-intercept containing model.
## Starting EBTest of CLBrA60 vs. CLBrA96.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of CLBrA60 vs. Uninfected.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting EBTest of CLBrA96 vs. Uninfected.
## Copying ppee values as ajusted p-values until I figure out how to deal with them.
## Starting edger_pairwise().
## Starting edgeR pairwise comparisons.
## The data should be suitable for EdgeR/DESeq/EBSeq. If they freak out, check the state of the count table and ensure that it is in integer counts.
## Including batch estimates from sva/ruv/pca in the model.
## Choosing the non-intercept containing model.
## EdgeR step 1/9: Importing and normalizing data.
## EdgeR step 2/9: Estimating the common dispersion.
## EdgeR step 3/9: Estimating dispersion across genes.
## EdgeR step 4/9: Estimating GLM Common dispersion.
## EdgeR step 5/9: Estimating GLM Trended dispersion.
## EdgeR step 6/9: Estimating GLM Tagged dispersion.
## EdgeR step 7/9: Running glmFit, switch to glmQLFit by changing the argument 'edger_test'.
## EdgeR step 8/9: Making pairwise contrasts.
## Starting limma_pairwise().
## Starting limma pairwise comparison.
## libsize was not specified, this parameter has profound effects on limma's result.
## Using the libsize from expt$best_libsize.
## Limma step 1/6: choosing model.
## Including batch estimates from sva/ruv/pca in the model.
## Choosing the non-intercept containing model.
## Limma step 2/6: running limma::voom(), switch with the argument 'which_voom'.
## Using normalize.method=quantile for voom.
## Limma step 3/6: running lmFit with method: ls.
## Limma step 4/6: making and fitting contrasts with no intercept. (~ 0 + factors)
## Limma step 5/6: Running eBayes with robust=FALSE and trend=FALSE.
## Limma step 6/6: Writing limma outputs.
## Limma step 6/6: 1/3: Creating table: CLBrA96_vs_CLBrA60. Adjust=BH
## Limma step 6/6: 2/3: Creating table: Uninfected_vs_CLBrA60. Adjust=BH
## Limma step 6/6: 3/3: Creating table: Uninfected_vs_CLBrA96. Adjust=BH
## Limma step 6/6: 1/3: Creating table: CLBrA60. Adjust=BH
## Limma step 6/6: 2/3: Creating table: CLBrA96. Adjust=BH
## Limma step 6/6: 3/3: Creating table: Uninfected. Adjust=BH
## Comparing analyses.
Let us see how similar the results are… First I need to combine the tables into one big table. I will tell it to write the results to two files. I will also create a variable ‘keepers’ which defines what I want the numerators and denominators to be. The all_pairwise() function does not have any sense of up and down and simply takes every condition in the data and compares it to every other condition in the data, thus it will include things in which we might not be interested…
keepers <- list(
"t60" = c("CLBrA60", "Uninfected"),
"t60t96" = c("CLBrA60", "CLBrA96"),
"t96" = c("CLBrA96", "Uninfected"))
hs_cond_tables <- combine_de_tables(hs_cond, excel="condition.xlsx", keepers=keepers)
## Deleting the file condition.xlsx before writing the tables.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/3: t60 which is: CLBrA60/Uninfected.
## Found inverse table with Uninfected_vs_CLBrA60
## Working on 2/3: t60t96 which is: CLBrA60/CLBrA96.
## Found inverse table with CLBrA96_vs_CLBrA60
## Working on 3/3: t96 which is: CLBrA96/Uninfected.
## Found inverse table with Uninfected_vs_CLBrA96
## Adding venn plots for t60.
## Limma expression coefficients for t60; R^2: 0.867; equation: y = 0.907x + 0.432
## Deseq expression coefficients for t60; R^2: 0.887; equation: y = 0.917x + 0.804
## Edger expression coefficients for t60; R^2: 0.887; equation: y = 0.917x + 0.526
## Adding venn plots for t60t96.
## Limma expression coefficients for t60t96; R^2: 0.867; equation: y = 0.907x + 0.432
## Deseq expression coefficients for t60t96; R^2: 0.887; equation: y = 0.917x + 0.804
## Edger expression coefficients for t60t96; R^2: 0.887; equation: y = 0.917x + 0.526
## Adding venn plots for t96.
## Limma expression coefficients for t96; R^2: 0.867; equation: y = 0.907x + 0.432
## Deseq expression coefficients for t96; R^2: 0.887; equation: y = 0.917x + 0.804
## Edger expression coefficients for t96; R^2: 0.887; equation: y = 0.917x + 0.526
## Writing summary information, compare_plot is: TRUE.
## Performing save of condition.xlsx.
## Writing a legend of columns.
## Printing a pca plot before/after surrogates/batch estimation.
## Working on 1/3: t60 which is: CLBrA60/Uninfected.
## Found inverse table with Uninfected_vs_CLBrA60
## Working on 2/3: t60t96 which is: CLBrA60/CLBrA96.
## Found inverse table with CLBrA96_vs_CLBrA60
## Working on 3/3: t96 which is: CLBrA96/Uninfected.
## Found inverse table with Uninfected_vs_CLBrA96
## Adding venn plots for t60.
## Limma expression coefficients for t60; R^2: 0.867; equation: y = 0.906x + 0.438
## Deseq expression coefficients for t60; R^2: 0.886; equation: y = 0.917x + 0.804
## Edger expression coefficients for t60; R^2: 0.885; equation: y = 0.915x + 0.555
## Adding venn plots for t60t96.
## Limma expression coefficients for t60t96; R^2: 0.867; equation: y = 0.906x + 0.438
## Deseq expression coefficients for t60t96; R^2: 0.886; equation: y = 0.917x + 0.804
## Edger expression coefficients for t60t96; R^2: 0.885; equation: y = 0.915x + 0.555
## Adding venn plots for t96.
## Limma expression coefficients for t96; R^2: 0.867; equation: y = 0.906x + 0.438
## Deseq expression coefficients for t96; R^2: 0.886; equation: y = 0.917x + 0.804
## Edger expression coefficients for t96; R^2: 0.885; equation: y = 0.915x + 0.555
## Writing summary information, compare_plot is: TRUE.
## Performing save of sva.xlsx.
## Testing method: limma.
## Adding method: limma to the set.
## Testing method: deseq.
## Adding method: deseq to the set.
## Testing method: edger.
## Adding method: edger to the set.
## Testing method: ebseq.
## Adding method: ebseq to the set.
## Testing method: basic.
## Adding method: basic to the set.
## $t60
## $t60$logfc
## [1] 0.9982
##
## $t60$p
## [1] 0.9241
##
## $t60$adjp
## [1] 0.9382
##
##
## $t60t96
## $t60t96$logfc
## [1] 0.9985
##
## $t60t96$p
## [1] 0.9606
##
## $t60t96$adjp
## [1] 0.9657
##
##
## $t96
## $t96$logfc
## [1] 0.9996
##
## $t96$p
## [1] 0.9657
##
## $t96$adjp
## [1] 0.9684
Lets grab the set of up/down genes
## Writing a legend of columns.
## Writing excel data according to limma for t60: 1/15.
## After (adj)p filter, the up genes table has 2906 genes.
## After (adj)p filter, the down genes table has 4423 genes.
## After fold change filter, the up genes table has 754 genes.
## After fold change filter, the down genes table has 400 genes.
## Writing excel data according to limma for t60t96: 2/15.
## After (adj)p filter, the up genes table has 4726 genes.
## After (adj)p filter, the down genes table has 4669 genes.
## After fold change filter, the up genes table has 1503 genes.
## After fold change filter, the down genes table has 1467 genes.
## Writing excel data according to limma for t96: 3/15.
## After (adj)p filter, the up genes table has 4981 genes.
## After (adj)p filter, the down genes table has 5691 genes.
## After fold change filter, the up genes table has 2017 genes.
## After fold change filter, the down genes table has 2184 genes.
## Printing significant genes to the file: condition_sig.xlsx
## 1/3: Creating significant table up_limma_t60
## 2/3: Creating significant table up_limma_t60t96
## 3/3: Creating significant table up_limma_t96
## Writing excel data according to edger for t60: 1/15.
## After (adj)p filter, the up genes table has 3738 genes.
## After (adj)p filter, the down genes table has 3425 genes.
## After fold change filter, the up genes table has 1342 genes.
## After fold change filter, the down genes table has 103 genes.
## Writing excel data according to edger for t60t96: 2/15.
## After (adj)p filter, the up genes table has 4510 genes.
## After (adj)p filter, the down genes table has 4534 genes.
## After fold change filter, the up genes table has 1169 genes.
## After fold change filter, the down genes table has 1352 genes.
## Writing excel data according to edger for t96: 3/15.
## After (adj)p filter, the up genes table has 5358 genes.
## After (adj)p filter, the down genes table has 5263 genes.
## After fold change filter, the up genes table has 2388 genes.
## After fold change filter, the down genes table has 1795 genes.
## Printing significant genes to the file: condition_sig.xlsx
## 1/3: Creating significant table up_edger_t60
## 2/3: Creating significant table up_edger_t60t96
## 3/3: Creating significant table up_edger_t96
## Writing excel data according to deseq for t60: 1/15.
## After (adj)p filter, the up genes table has 3704 genes.
## After (adj)p filter, the down genes table has 3722 genes.
## After fold change filter, the up genes table has 1329 genes.
## After fold change filter, the down genes table has 106 genes.
## Writing excel data according to deseq for t60t96: 2/15.
## After (adj)p filter, the up genes table has 4392 genes.
## After (adj)p filter, the down genes table has 4707 genes.
## After fold change filter, the up genes table has 1131 genes.
## After fold change filter, the down genes table has 1387 genes.
## Writing excel data according to deseq for t96: 3/15.
## After (adj)p filter, the up genes table has 5495 genes.
## After (adj)p filter, the down genes table has 5203 genes.
## After fold change filter, the up genes table has 2425 genes.
## After fold change filter, the down genes table has 1759 genes.
## Printing significant genes to the file: condition_sig.xlsx
## 1/3: Creating significant table up_deseq_t60
## 2/3: Creating significant table up_deseq_t60t96
## 3/3: Creating significant table up_deseq_t96
## Writing excel data according to ebseq for t60: 1/15.
## After (adj)p filter, the up genes table has 2437 genes.
## After (adj)p filter, the down genes table has 3446 genes.
## After fold change filter, the up genes table has 1063 genes.
## After fold change filter, the down genes table has 107 genes.
## Writing excel data according to ebseq for t60t96: 2/15.
## After (adj)p filter, the up genes table has 3510 genes.
## After (adj)p filter, the down genes table has 4463 genes.
## After fold change filter, the up genes table has 984 genes.
## After fold change filter, the down genes table has 1361 genes.
## Writing excel data according to ebseq for t96: 3/15.
## After (adj)p filter, the up genes table has 4763 genes.
## After (adj)p filter, the down genes table has 4768 genes.
## After fold change filter, the up genes table has 2285 genes.
## After fold change filter, the down genes table has 1793 genes.
## Printing significant genes to the file: condition_sig.xlsx
## 1/3: Creating significant table up_ebseq_t60
## 2/3: Creating significant table up_ebseq_t60t96
## 3/3: Creating significant table up_ebseq_t96
## Writing excel data according to basic for t60: 1/15.
## After (adj)p filter, the up genes table has 1880 genes.
## After (adj)p filter, the down genes table has 2929 genes.
## After fold change filter, the up genes table has 615 genes.
## After fold change filter, the down genes table has 201 genes.
## Writing excel data according to basic for t60t96: 2/15.
## After (adj)p filter, the up genes table has 4047 genes.
## After (adj)p filter, the down genes table has 4025 genes.
## After fold change filter, the up genes table has 1318 genes.
## After fold change filter, the down genes table has 1280 genes.
## Writing excel data according to basic for t96: 3/15.
## After (adj)p filter, the up genes table has 4398 genes.
## After (adj)p filter, the down genes table has 4977 genes.
## After fold change filter, the up genes table has 1838 genes.
## After fold change filter, the down genes table has 2020 genes.
## Printing significant genes to the file: condition_sig.xlsx
## 1/3: Creating significant table up_basic_t60
## 2/3: Creating significant table up_basic_t60t96
## 3/3: Creating significant table up_basic_t96
## Adding significance bar plots.
a60_ups <- sig_genes[["deseq"]][["ups"]][["t60"]]
a60_downs <- sig_genes[["deseq"]][["downs"]][["t60"]]
a60_ont_ups <- simple_gprofiler(a60_ups)
## Performing gProfiler GO search of 1329 genes against hsapiens.
## GO search found 504 hits.
## Performing gProfiler KEGG search of 1329 genes against hsapiens.
## KEGG search found 32 hits.
## Performing gProfiler REAC search of 1329 genes against hsapiens.
## REAC search found 34 hits.
## Performing gProfiler MI search of 1329 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 1329 genes against hsapiens.
## TF search found 105 hits.
## Performing gProfiler CORUM search of 1329 genes against hsapiens.
## CORUM search found 2 hits.
## Performing gProfiler HP search of 1329 genes against hsapiens.
## HP search found 11 hits.
## Length Class Mode
## go 14 data.frame list
## kegg 14 data.frame list
## reac 14 data.frame list
## mi 14 data.frame list
## tf 14 data.frame list
## corum 14 data.frame list
## hp 14 data.frame list
## input 56 data.frame list
## pvalue_plots 18 -none- list
## Performing gProfiler GO search of 106 genes against hsapiens.
## GO search found 21 hits.
## Performing gProfiler KEGG search of 106 genes against hsapiens.
## KEGG search found 2 hits.
## Performing gProfiler REAC search of 106 genes against hsapiens.
## REAC search found 22 hits.
## Performing gProfiler MI search of 106 genes against hsapiens.
## MI search found 0 hits.
## Performing gProfiler TF search of 106 genes against hsapiens.
## TF search found 0 hits.
## Performing gProfiler CORUM search of 106 genes against hsapiens.
## CORUM search found 1 hits.
## Performing gProfiler HP search of 106 genes against hsapiens.
## HP search found 9 hits.
I will leave it as an exercise to the reader to find out where the plots are for the ontology searches.