1 Perform all S. cerevisiae analyses in a single report.

This document should evaluate and combine all the analyses used for our Saccharomyces RNAseq data.

The primary thing to note for those paying attention, is the skip_load variable above, its existence is checked in all of the following documents. If it is found, then they will not try to load any of my checkpoint saves and just continue running. Thus I can make 100% certain that no cross-analysis contamination is happening.

2 A couple conventions:

While I am writing down my thoughts, I should make explicit a couple of things I repeatedly do which might look weird:

pp. (printpicture) This is a small function which sets hopefully useful defaults when printing/saving images. I find that keeping track of R’s device list via dev.new()/dev.off() confusing. Thus if I give pp() a file name and a picture object, it will call for me the appropriate graphics device with options sufficient to give a presentable picture and remember to do the dev.off() at the appropriate time, which I often seem to forget.
tt. This is a trash variable which I reuse throughout the analyses.
sm. This is a short silencer function. A lot of my functions are very chatty so that I can watch their output for problems, but once I am confident that a function will work properly, I often wrap it in sm().

3 Collect annotation data

This first document uses data from yeastgenome.org via biomart, the sqlite annotation data from bioconductor, and IDs/annotations from the gff file used during the preprocessing steps in order to create a combined table of annotations. Along the way it also collects GO IDs and gene lengths (assuming we decide to use goseq later).

At the end, it creates an expressionset using the sample sheet and this annotation data, this expressionset is the base for all future analyses.

4 Annotation version: 20180310

4.1 Genome annotation input

There are a few methods of importing annotation data into R. I will attempt some of them in preparation for loading them into the S.cerevisiae RNASeq data.

4.1.1 AnnotationHub: loading OrgDb

AnnotationHub is a newer service and has promise to be an excellent top-level resource for gathering annotation data. Its organization is peculiar, one connects to the database with AnnotationHub() and downloads lists of sqlite databases via query(). The primary problem with AnnotationHub is that it seems difficult to keep up with the changes in their database – so one must double-check the downloaded data to be sure that it is still actually data and not just a message saying ‘Public’.

Once it does download data, one may use normal orgdb functions to query it.

tmp <- sm(library(AnnotationHub))
ah = sm(AnnotationHub())
orgdbs <- sm(query(ah, "OrgDb"))
sc_orgdb <- sm(query(ah, c("OrgDB", "Saccharomyces"))) ##   AH49589 | org.Sc.sgd.db.sqlite
sc_orgdb

## AnnotationHub with 7 records
## # snapshotDate(): 2017-10-27 
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Saccharomyces cerevisiae, Saccharomyces eubayanus, Schizosa...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH57980"]]' 
## 
##             title                                              
##   AH57980 | org.Sc.sgd.db.sqlite                               
##   AH59735 | org.Schizosaccharomyces_pombe.eg.sqlite            
##   AH59859 | org.Saccharomyces_eubayanus.eg.sqlite              
##   AH59874 | org.Schizosaccharomyces_cryophilus_OY26.eg.sqlite  
##   AH59893 | org.Schizosaccharomyces_octosporus_yFS286.eg.sqlite
##   AH59899 | org.Zygosaccharomyces_rouxii.eg.sqlite             
##   AH59913 | org.Schizosaccharomyces_japonicus_yFS275.eg.sqlite

sc_orgdb <- ah[["AH57980"]]

## loading from cache '/home/trey//.AnnotationHub/64726'

sc_orgdb

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: YEAST_DB
## | ORGANISM: Saccharomyces cerevisiae
## | SPECIES: Yeast
## | YGSOURCENAME: Yeast Genome
## | YGSOURCEURL: http://downloads.yeastgenome.org/
## | YGSOURCEDATE: 14-Jan-2017
## | CENTRALID: ORF
## | TAXID: 559292
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2017-Nov01
## | EGSOURCEDATE: 2017-Nov6
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | ENSOURCEDATE: 2017-Aug23
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Tue Nov  7 21:11:11 2017

## 
## Please see: help('select') for usage information

## Holy crap it worked!
sc_annotv1 <- load_orgdb_annotations(
  sc_orgdb,
  fields=c("alias", "description", "entrezid", "genename", "sgd"))

## Unable to find TYPE in the db, removing it.

## Unable to find CHR in the db, removing it.

## Unable to find TXSTRAND in the db, removing it.

## Unable to find TXSTART in the db, removing it.

## Unable to find TXEND in the db, removing it.

## Extracted all gene ids.

## 'select()' returned 1:many mapping between keys and columns

sc_annotv1 <- sc_annotv1[["genes"]]

4.1.2 TxDb

In yeast, the transcript database is not super-useful, given that there are only 80 some genes wit introns, but it does provide a quick way to get transcript lengths.

tt <- please_install("TxDb.Scerevisiae.UCSC.sacCer3.sgdGene")
tmp <- sm(library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene))
sc_txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene

5 Loading a genome

There is a non-zero chance we will want to use the actual genome sequence along with these annotations. The BSGenome packages provide that functionality.

tt <- sm(please_install("BSgenome.Scerevisiae.UCSC.sacCer3"))

6 Loading from biomart

A completely separate and competing annotation source is biomart. Biomart provides programmatic access to the tremendous quantity of ensembl data. Their data is organized in some weird ways, including tables with thousands of available columns; but man is it awesome to have so much available stuff to sift through and learn about.

sc_annotv2 <- sm(load_biomart_annotations("scerevisiae"))
sc_annotv2 <- sc_annotv2[["annotation"]]
head(sc_annotv2)

##           transcriptID   geneID
## X15S_rRNA     15S_rRNA 15S_rRNA
## X21S_rRNA     21S_rRNA 21S_rRNA
## HRA1              HRA1     HRA1
## ICR1              ICR1     ICR1
## LSR1              LSR1     LSR1
## NME1              NME1     NME1
##                                                                                                                                                                                                                                                                                    Description
## X15S_rRNA                                                                                                            Ribosomal RNA of the small mitochondrial ribosomal subunit; MSU1 allele suppresses ochre stop mutations in mitochondrial protein-coding genes [Source:SGD;Acc:S000007287]
## X21S_rRNA                                                                                                                                                                                       Mitochondrial 21S rRNA; intron encodes the I-SceI DNA endonuclease [Source:SGD;Acc:S000007288]
## HRA1                                                                                                         Non-protein-coding RNA; substrate of RNase P, possibly involved in rRNA processing, specifically maturation of 20S precursor into the mature 18S rRNA [Source:SGD;Acc:S000119380]
## ICR1      Long intergenic regulatory ncRNA; has a key role in regulating transcription of the nearby protein-coding ORF FLO11; initiated far upstream from FLO11 and transcribed across much of the large promoter of FLO11, repressing FLO11 transcription in cis [Source:SGD;Acc:S000132612]
## LSR1           U2 spliceosomal RNA (U2 snRNA), component of the spliceosome; pairs with the branchpoint sequence; functionally equivalent to mammalian U2 snRNA; stress-induced pseudouridylations at positions 56 and 93 may contribute to regulation of splicing [Source:SGD;Acc:S000006478]
## NME1                                                    RNA component of RNase MRP; RNase MRP cleaves pre-rRNA and has a role in cell cycle-regulated degradation of daughter cell-specific mRNAs; human ortholog is implicated in cartilage-hair hypoplasia (CHH) [Source:SGD;Acc:S000007436]
##             Type length chromosome strand  start    end
## X15S_rRNA   rRNA     NA       Mito      1   6546   8194
## X21S_rRNA   rRNA     NA       Mito      1  58009  62447
## HRA1       ncRNA     NA          I      1  99305  99868
## ICR1       ncRNA     NA         IX     -1 393884 397082
## LSR1       snRNA     NA         II     -1 680688 681862
## NME1      snoRNA     NA        XIV      1 585587 585926

sc_ontology <- sm(load_biomart_go("scerevisiae"))
sc_ontology <- sc_ontology[["go"]]
head(sc_ontology)

##        ID         GO
## 1 YHR055C GO:0046872
## 2 YHR055C GO:0005829
## 3 YHR055C GO:0016209
## 4 YHR055C GO:0004784
## 5 YHR055C GO:0019430
## 6 YHR055C GO:0005507

7 Read a gff file

In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments. The main problem with gff data is that the format is incredibly inconsistent; but it is often the most direct way to go from the IDs of the genome to some immediately useful data.

## The old way of getting genome/annotation data
sc_gff <- "reference/scerevisiae.gff.gz"
sc_gff_annotations <- load_gff_annotations(sc_gff, type="gene")

## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=TRUE)

## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo=FALSE)

## Trying attempt: rtracklayer::import.gff2(gff, sequenceRegionsAsSeqinfo=TRUE)

## Had a successful gff import with rtracklayer::import.gff2(gff, sequenceRegionsAsSeqinfo=TRUE)

## Returning a df with 18 columns and 7050 rows.

rownames(sc_gff_annotations) <- make.names(sc_gff_annotations$transcript_name, unique=TRUE)
head(sc_gff_annotations)

##           seqnames start   end width strand         source type score
## YAL069W          I   335   646   312      + protein_coding gene    NA
## YAL068W.A        I   538   789   252      + protein_coding gene    NA
## PAU8             I  1810  2169   360      - protein_coding gene    NA
## YAL067W.A        I  2480  2704   225      + protein_coding gene    NA
## SEO1             I  7238  9016  1779      - protein_coding gene    NA
## YAL066W          I 10091 10396   306      + protein_coding gene    NA
##           phase exon_number   gene_id        ID  p_id protein_id
## YAL069W       0           1   YAL069W   YAL069W P3633    YAL069W
## YAL068W.A     0           1 YAL068W-A YAL068W-A P5377  YAL068W-A
## PAU8          0           1   YAL068C      PAU8 P6023    YAL068C
## YAL067W.A     0           1 YAL067W-A YAL067W-A P4547  YAL067W-A
## SEO1          0           1   YAL067C      SEO1 P5747    YAL067C
## YAL066W       0           1   YAL066W   YAL066W P1766    YAL066W
##           transcript_id transcript_name  tss_id seqedit
## YAL069W         YAL069W         YAL069W TSS1128    <NA>
## YAL068W.A     YAL068W-A       YAL068W-A TSS5439    <NA>
## PAU8            YAL068C            PAU8  TSS249    <NA>
## YAL067W.A     YAL067W-A       YAL067W-A TSS1248    <NA>
## SEO1            YAL067C            SEO1 TSS5464    <NA>
## YAL066W         YAL066W         YAL066W TSS2674    <NA>

8 Putting the pieces together

In the following block we create an expressionset using the sample sheet and the annotations.

Annoyingly, the gff annotations are keyed in a peculiar fashion. Therefore I need to do a little work to merge them.

In the following block, I spend a little time setting up locations by chromosome/start/end and using those to merge the gff data and biomart data, thus solving the problem of inconsistent IDs. It is worth noting that I split this process between the genes on the + strand and those on the - strand because the definitions of ‘beginning of gene’ and ‘start’ mean different things: ‘beginning of gene’ refers to the location with respect to the start codon and is used by biomart, ‘start’ refers to the location with respect to the beginning of the chromosome and is used by the gff data.

## Start by making locations for the biomart data
sc_annotv2[["fwd_location"]] <- paste0(sc_annotv2[["chromosome"]], "_", sc_annotv2[["start"]])
sc_annotv2[["rev_location"]] <- paste0(sc_annotv2[["chromosome"]], "_", sc_annotv2[["end"]])
## Do the same for the gff annotations
sc_gff_annotations[["fwd_location"]] <- paste0(sc_gff_annotations[["seqnames"]], "_",
                                               sc_gff_annotations[["start"]])
sc_gff_annotations[["rev_location"]] <- paste0(sc_gff_annotations[["seqnames"]], "_",
                                               sc_gff_annotations[["end"]])
sc_gff_annotations[["gff_rowname"]] <- rownames(sc_gff_annotations)
## Now merge them.
sc_fwd_annotations <- merge(sc_annotv2, sc_gff_annotations, by="fwd_location")
sc_rev_annotations <- merge(sc_annotv2, sc_gff_annotations, by="rev_location")
colnames(sc_fwd_annotations) <- c("location","transcriptID","geneID", "Description",
                                  "Type", "length", "chromosome", "strand.x", "start.x",
                                  "end.x", "location.x", "seqnames",
                                  "start.y", "end.y", "width", "strand.y", "source", "type",
                                  "score", "phase", "exon_number", "gene_id", "ID", "p_id",
                                  "protein_id", "transcript_id", "transcript_name", "tss_id",
                                  "seqedit", "location.y", "gff_rowname")
colnames(sc_rev_annotations) <- colnames(sc_fwd_annotations)
sc_all_annotations <- rbind(sc_fwd_annotations, sc_rev_annotations)
rownames(sc_all_annotations) <- make.names(sc_all_annotations[["gff_rowname"]], unique=TRUE)
sc_all_annotations <- sc_all_annotations[, c("transcriptID", "geneID", "Description", "Type",
                                             "length", "chromosome", "strand.x", "start.x", "end.x",
                                             "tss_id")]
colnames(sc_all_annotations) <- c("transcriptID", "geneID", "Description", "Type", "length",
                                  "chromosome", "strand", "start", "end", "tss_id")
sc_all_annotations[["location"]] <- paste0(sc_all_annotations[["chromosome"]], "_",
                                           sc_all_annotations[["start"]], "_",
                                           sc_all_annotations[["end"]])

8.0.1 Make the expressionset

The function ’create_expt() gathers the annotation data, metadata, and counts, and invokes the various R functions to create an expressionset. This function is more complex than it should be, primarily because I go to some effort to accept a large array of inputs including: raw data frames of counts vs. sets of filenames of counts/hdf5/tpm/etc, data frames of metadata vs. excel sheets with varying columns, and arbitrary annotation data which may include all, some, or none of the included genes.

sc2_expt <- create_expt(
  metadata="sample_sheets/all_samples.xlsx",
  gene_info=sc_all_annotations,
  file_column="bt2file")

## Reading the sample metadata.

## The sample definitions comprises: 28, 19 rows, columns.

## Reading count tables.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0774/outputs/bowtie2_scerevisiae/hpgl0774_forward-trimmed.count.xz contains 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0775/outputs/bowtie2_scerevisiae/hpgl0775_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0776/outputs/bowtie2_scerevisiae/hpgl0776_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0777/outputs/bowtie2_scerevisiae/hpgl0777_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0778/outputs/bowtie2_scerevisiae/hpgl0778_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0779/outputs/bowtie2_scerevisiae/hpgl0779_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0780/outputs/bowtie2_scerevisiae/hpgl0780_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0781/outputs/bowtie2_scerevisiae/hpgl0781_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0782/outputs/bowtie2_scerevisiae/hpgl0782_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0783/outputs/bowtie2_scerevisiae/hpgl0783_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0784/outputs/bowtie2_scerevisiae/hpgl0784_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0785/outputs/bowtie2_scerevisiae/hpgl0785_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0786/outputs/bowtie2_scerevisiae/hpgl0786_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0787/outputs/bowtie2_scerevisiae/hpgl0787_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0788/outputs/bowtie2_scerevisiae/hpgl0788_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## /cbcb/nelsayed-scratch/atb/rnaseq/scerevisiae_cbf5_2017/preprocessing/v2/hpgl0789/outputs/bowtie2_scerevisiae/hpgl0789_forward-trimmed.count.xz contains 7131 rows and merges to 7131 rows.

## Finished reading count tables.

## Matched 6540 annotations and counts.

## Bringing together the count matrix and gene information.

## Some annotations were lost in merging, setting them to 'undefined'.

knitr::kable(head(exprs(sc2_expt$expressionset)))

	hpgl0774	hpgl0775	hpgl0776	hpgl0777	hpgl0778	hpgl0779	hpgl0780	hpgl0781	hpgl0782	hpgl0783	hpgl0784	hpgl0785	hpgl0786	hpgl0787	hpgl0788	hpgl0789
X15S_rRNA	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
X21S_rRNA	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
AAC1	536	477	743	443	634	188	763	414	175	145	140	237	124	142	141	181
AAC3	126	216	93	765	152	154	102	738	295	119	341	542	210	118	438	1071
AAD10	1784	1928	2327	3869	2172	994	2472	3551	365	589	1476	1593	352	542	1782	2082
AAD14	1054	901	1222	1863	1106	836	1307	1588	542	766	1580	1814	439	795	1924	2333

knitr::kable(head(fData(sc2_expt$expressionset)))

	transcriptID	geneID	Description	Type	length	chromosome	strand	start	end	tss_id	location
X15S_rRNA	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined
X21S_rRNA	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined	undefined
AAC1	YMR056C	YMR056C	Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; phosphorylated; Aac1p is a minor isoform while Pet9p is the major ADP/ATP translocator; relocalizes from mitochondrion to cytoplasm upon DNA replication stress [Source:SGD;Acc:S000004660]	protein_coding	930	XIII	-1	387315	388244	TSS5132	XIII_387315_388244
AAC3	YBR085W	YBR085W	Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; expressed under anaerobic conditions; similar to Aac1p; has roles in maintenance of viability and in respiration; AAC3 has a paralog, PET9, that arose from the whole genome duplication [Source:SGD;Acc:S000000289]	protein_coding	924	II	1	415983	416906	TSS1609	II_415983_416906
AAD10	YJR155W	YJR155W	Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000003916]	protein_coding	867	X	1	727405	728271	TSS5024	X_727405_728271
AAD14	YNL331C	YNL331C	Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005275]	protein_coding	1131	XIV	-1	16118	17248	TSS6941	XIV_16118_17248

knitr::kable(head(pData(sc2_expt$expressionset)))

	sampleid	strain	condition	batch	originalbatch	tube	cbf5igv	upf1igv	incubationtime	genotype	conc	bttotalreads	bttotalmapped	btleftmapped	btrightmapped	bowtiefile	bt2file	intronfile	allfile	file
hpgl0774	hpgl0774	yJD1524	WT	E2B1	E2B1	f	wt	wt	18h	wt ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 + CBF5 on pRS313	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0774/outputs/bowtie2_scerevisiae/hpgl0774_forward-trimmed.count.xz	preprocessing/v2/hpgl0774/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0774/outputs/bowtie2_scerevisiae/hpgl0774_forward-trimmed.count.xz	null
hpgl0775	hpgl0775	yJD1525	cbf5_D95A	E2B1	E2B1	f	mut	wt	18h	d95a ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 + CBF5 D95A on pRS313	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0775/outputs/bowtie2_scerevisiae/hpgl0775_forward-trimmed.count.xz	preprocessing/v2/hpgl0775/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0775/outputs/bowtie2_scerevisiae/hpgl0775_forward-trimmed.count.xz	null
hpgl0776	hpgl0776	yJD1745	upf1d	E2B1	E2B1	f	wt	mut	18h	wt ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 upf1::LEU2 + CBF5 on pRS313 (yJD1524 upf1Δ)	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0776/outputs/bowtie2_scerevisiae/hpgl0776_forward-trimmed.count.xz	preprocessing/v2/hpgl0776/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0776/outputs/bowtie2_scerevisiae/hpgl0776_forward-trimmed.count.xz	null
hpgl0777	hpgl0777	yJD1746	cbf5_D95A upf1d	E2B1	E2B1	f	mut	mut	18h	d95a ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 upf1::LEU2 + CBF5 D95A on pRS313 (yJD1525 upf1Δ)	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0777/outputs/bowtie2_scerevisiae/hpgl0777_forward-trimmed.count.xz	preprocessing/v2/hpgl0777/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0777/outputs/bowtie2_scerevisiae/hpgl0777_forward-trimmed.count.xz	null
hpgl0778	hpgl0778	yJD1524	WT	E2B1	E2B2	g	wt	wt	18h	wt ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 + CBF5 on pRS313	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0778/outputs/bowtie2_scerevisiae/hpgl0778_forward-trimmed.count.xz	preprocessing/v2/hpgl0778/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0778/outputs/bowtie2_scerevisiae/hpgl0778_forward-trimmed.count.xz	null
hpgl0779	hpgl0779	yJD1525	cbf5_D95A	E2B1	E2B2	g	mut	wt	18h	d95a ade2-1 can1-100 his3-11 leu2-3, 112 trp1-1 ura3-1 cbf5::TRP1 + CBF5 D95A on pRS313	NA	NA	NA	NA	NA	NA	preprocessing/v2/hpgl0779/outputs/bowtie2_scerevisiae/hpgl0779_forward-trimmed.count.xz	preprocessing/v2/hpgl0779/outputs/bowtie2_scerevisiae/introns.count.xz	preprocessing/v2/hpgl0779/outputs/bowtie2_scerevisiae/hpgl0779_forward-trimmed.count.xz	null

if (!isTRUE(get0("skip_load"))) {
  pander::pander(sessionInfo())
  message(paste0("This is hpgltools commit: ", get_git_commit()))
  this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
  tmp <- sm(saveme(filename=this_save))
}

9 Sample Estimation (Lots of plots!)

The following document performs the most subjective portion of the analysis. It attempts to present the data so that we can see its strengths and weakesses. In pretty much every analysis (and this one), the primary weakness is batch effect. Thus it performs a series of batch/surrogate estimations and re-plots to see if any patterns emerge which ‘make sense’. That of course is where the subjective danger rears its head.

10 S. cerevisiae sample estimation of all samples (old and new)

This document should make clear the suitability of our yeast data for differential expression analyses. It should also give some ideas about the depth and distribution of the data.

11 Gathering samples

Later, we will likely choose to exclude one of the three experimental batches in the data. Therefore, I am making three expressionsets, one with all data, and one the various subsets.

merged_expt <- sm(create_expt(metadata="sample_sheets/all_samples.xlsx",
                              gene_info=sc_all_annotations,
                              file_column="allfile"))
## Don't forget, I need to change the condition names.
only_old <- subset_expt(merged_expt, subset="batch=='E1'")
merged_new <- subset_expt(merged_expt, subset="batch!='E1'")
merged_nor <- subset_expt(merged_expt, subset="batch!='E2B1'")
merged_nos <- subset_expt(merged_expt, subset="batch!='E2B2'")

12 Visualizing raw data

There are lots of methods we have to examine raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’, when invoked it provides a list of plots including: library sizes, the number of non-zero genes, density/box plots by sample, distance/correlation heatmaps, standard median correlation/distance, PCA/TSNE clustering, top-n genes, a legend describing the colors/symbols used, and some tables describing the data. Optionally, it can also perform quantile-quantile plots showing the distribution of each sample vs. the median of all samples and MA plots of the same.

Caveat: some plots do not work well with gene IDs that are all-0, thus I first filter the data to remove them.

I also added a neat function ‘plot_libsize_prepost()’ which shows how many genes are poorly represented before/after filtering the data.

The plots printed here are all metrics which are useful when considering raw data.

merged_filt <- sm(normalize_expt(merged_expt, filter=TRUE))
merged_metrics <- sm(graph_metrics(merged_expt))

pp(file="illustrator_input/01_legend.pdf", image=merged_metrics$legend)

## Wrote the image to: illustrator_input/01_legend.pdf

pp(file="illustrator_input/02_raw_libsize.pdf", image=merged_metrics$libsize)

## Wrote the image to: illustrator_input/02_raw_libsize.pdf

prepost <- plot_libsize_prepost(merged_expt)
pp(file="illustrator_input/03_libsize_changed_lowgenes.pdf", image=prepost$lowgene_plot)

## Wrote the image to: illustrator_input/03_libsize_changed_lowgenes.pdf

pp(file="illustrator_input/04_nonzero_genes.pdf", image=merged_metrics$nonzero)

## Wrote the image to: illustrator_input/04_nonzero_genes.pdf

pp(file="illustrator_input/05_raw_boxplot.pdf", image=merged_metrics$boxplot)

## Wrote the image to: illustrator_input/05_raw_boxplot.pdf

pp(file="illustrator_input/06_raw_density.pdf", image=merged_metrics$density)

## Wrote the image to: illustrator_input/06_raw_density.pdf

pp(file="illustrator_input/07_raw_boxplot.pdf", image=merged_metrics$boxplot)

## Wrote the image to: illustrator_input/07_raw_boxplot.pdf

pp(file="illustrator_input/08_topn.pdf", image=merged_metrics$topnplot)

## Wrote the image to: illustrator_input/08_topn.pdf

13 Normalize and visualize

Other metrics are more useful when used with data on the log scale and normalized by number of reads/library and/or by quantile.

merged_norm <- normalize_expt(merged_expt, transform="log2", convert="cpm", filter=TRUE)

## This function will replace the expt$expressionset slot with:

## log2(cpm(hpgl(data)))

## It backs up the current data into a slot named:
##  expt$backup_expressionset. It will also save copies of each step along the way
##  in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
##  when invoking limma.  The appropriate libsize is the non-log(cpm(normalized)).
##  This is most likely kept at:
##  'new_expt$normalized$intermediate_counts$normalization$libsizes'
##  A copy of this may also be found at:
##  new_expt$best_libsize

## Leaving the data unnormalized.  This is necessary for DESeq, but
##  EdgeR/limma might benefit from normalization.  Good choices include quantile,
##  size-factor, tmm, etc.

## Not correcting the count-data for batch effects.  If batch is
##  included in EdgerR/limma's model, then this is probably wise; but in extreme
##  batch effects this is a good parameter to play with.

## Step 1: performing count filter with option: hpgl

## Removing 1030 low-count genes (6095 remaining).

## Step 2: not normalizing the data.

## Step 3: converting the data with cpm.

## Step 4: transforming the data with log2.

## transform_counts: Found 1152 values equal to 0, adding 1 to the matrix.

## Step 5: not doing batch correction.

merged_metrics <- graph_metrics(merged_norm)

## Graphing number of non-zero genes with respect to CPM by library.

## Graphing library sizes.

## Graphing a boxplot.

## Graphing a correlation heatmap.

## Graphing a standard median correlation.

## Performing correlation.

## Graphing a distance heatmap.

## Graphing a standard median distance.

## Performing distance.

## Graphing a PCA plot.

## Graphing a T-SNE plot.

## Plotting a density plot.

## Plotting the representation of the top-n genes.

## Printing a color to condition legend.

The data should now be normalized, lets view some metrics post-facto.

pp(file="illustrator_input/09_norm_corheat.pdf", image=merged_metrics$corheat)

## Wrote the image to: illustrator_input/09_norm_corheat.pdf

pp(file="illustrator_input/10_norm_disheat.pdf", image=merged_metrics$disheat)

## Wrote the image to: illustrator_input/10_norm_disheat.pdf

## It appears that just the normalization is sufficient to split the samples completely by type and deeply separate them from the heterologous samples

pp(file="illustrator_input/11_norm_smc.pdf", image=merged_metrics$smc)

## Wrote the image to: illustrator_input/11_norm_smc.pdf

pp(file="illustrator_input/12_norm_smd.pdf", image=merged_metrics$smd)

## Wrote the image to: illustrator_input/12_norm_smd.pdf

## The samples are very well behaved, none fall below the red line.

pp(file="illustrator_input/13_norm_pca.pdf", image=merged_metrics$pcaplot)

## Wrote the image to: illustrator_input/13_norm_pca.pdf

pp(file="illustrator_input/14_norm_tsne.pdf", image=merged_metrics$tsneplot)

## Wrote the image to: illustrator_input/14_norm_tsne.pdf

## The homogeneous wt/mutant are nicely separated, and what is more, the exogeneous samples also split wt/mutant, that might prove to be quite useful.

13.1 Print some variance partition information

This is an interesting aside which came up last week for some other data. It might be good to include the % variance correlated with ‘condition’ as a column in the annotation data for the expressionset. Thus, when we do the differential expression analysis later, we can look and see if a ‘significant’ gene has variance which is actually correlated with condition.

Since writing these analyses, I implemented this idea and so am including it here.

vp <- varpart(merged_expt, predictor=NULL, factors=c("condition", "batch"))

## Attempting mixed linear model with: ~  (1|condition) + (1|batch)

## Fitting the expressionset to the model, this is slow.

## Projected run time: ~ 0.7 min

## Placing factor: condition at the beginning of the model.

pp(file="illustrator_input/15_merged_varpart_partition.pdf", image=vp$partition_plot)

## Wrote the image to: illustrator_input/15_merged_varpart_partition.pdf

## Check out the last two columns!
head(fData(vp$modified_expt))

##       transcriptID  geneID
## AAC1       YMR056C YMR056C
## AAC3       YBR085W YBR085W
## AAD10      YJR155W YJR155W
## AAD14      YNL331C YNL331C
## AAD15      YOL165C YOL165C
## AAD16      YFL057C YFL057C
##                                                                                                                                                                                                                                                                                                                                                                                                                                                Description
## AAC1                                                                                                                                           Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; phosphorylated; Aac1p is a minor isoform while Pet9p is the major ADP/ATP translocator; relocalizes from mitochondrion to cytoplasm upon DNA replication stress [Source:SGD;Acc:S000004660]
## AAC3                                                                                                                  Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; expressed under anaerobic conditions; similar to Aac1p; has roles in maintenance of viability and in respiration; AAC3 has a paralog, PET9, that arose from the whole genome duplication [Source:SGD;Acc:S000000289]
## AAD10                                                                     Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000003916]
## AAD14                                                                     Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005275]
## AAD15 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; AAD15 has a paralog, AAD3, that arose from a segmental duplication; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005525]
## AAD16                                    Putative aryl alcohol dehydrogenase; similar to Phanerochaete chrysosporium aryl alcohol dehydrogenase; ORFs AAD6/YFL056C and AAD16/YFL057C are displaced from one another by -1 frameshift; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000001837]
##                 Type length chromosome strand  start    end  tss_id
## AAC1  protein_coding    930       XIII     -1 387315 388244 TSS5132
## AAC3  protein_coding    924         II      1 415983 416906 TSS1609
## AAD10 protein_coding    867          X      1 727405 728271 TSS5024
## AAD14 protein_coding   1131        XIV     -1  16118  17248 TSS6941
## AAD15 protein_coding    432         XV     -1   1647   2078  TSS108
## AAD16 protein_coding    459         VI     -1  14305  14763 TSS2145
##                 location variance_batch variance_condition
## AAC1  XIII_387315_388244        0.60247             0.0000
## AAC3    II_415983_416906        0.01024             0.7903
## AAD10    X_727405_728271        0.44833             0.4727
## AAD14    XIV_16118_17248        0.18558             0.6522
## AAD15       XV_1647_2078        0.02458             0.8443
## AAD16     VI_14305_14763        0.06128             0.8080
##       variance_Residuals
## AAC1             0.39753
## AAC3             0.19950
## AAD10            0.07901
## AAD14            0.16217
## AAD15            0.13115
## AAD16            0.13069

merged_sva <- sm(normalize_expt(merged_expt, transform="log2",
                                filter=TRUE, batch="sva", low_to_zero=TRUE))
vpsva <- varpart(merged_sva, predictor=NULL, factors=c("condition", "batch"))

## Attempting mixed linear model with: ~  (1|condition) + (1|batch) 
## Fitting the expressionset to the model, this is slow.

## Projected run time: ~ 0.7 min

## Placing factor: condition at the beginning of the model.

pp(file="illustrator_input/16_merged_varpart_sva_partition.pdf", image=vpsva$partition_plot)

## Wrote the image to: illustrator_input/16_merged_varpart_sva_partition.pdf

head(fData(vpsva$modified_expt))

##       transcriptID  geneID
## AAC1       YMR056C YMR056C
## AAC3       YBR085W YBR085W
## AAD10      YJR155W YJR155W
## AAD14      YNL331C YNL331C
## AAD15      YOL165C YOL165C
## AAD16      YFL057C YFL057C
##                                                                                                                                                                                                                                                                                                                                                                                                                                                Description
## AAC1                                                                                                                                           Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; phosphorylated; Aac1p is a minor isoform while Pet9p is the major ADP/ATP translocator; relocalizes from mitochondrion to cytoplasm upon DNA replication stress [Source:SGD;Acc:S000004660]
## AAC3                                                                                                                  Mitochondrial inner membrane ADP/ATP translocator; exchanges cytosolic ADP for mitochondrially synthesized ATP; expressed under anaerobic conditions; similar to Aac1p; has roles in maintenance of viability and in respiration; AAC3 has a paralog, PET9, that arose from the whole genome duplication [Source:SGD;Acc:S000000289]
## AAD10                                                                     Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000003916]
## AAD14                                                                     Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005275]
## AAD15 Putative aryl-alcohol dehydrogenase; similar to P. chrysosporium aryl-alcohol dehydrogenase; mutational analysis has not yet revealed a physiological role; AAD15 has a paralog, AAD3, that arose from a segmental duplication; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000005525]
## AAD16                                    Putative aryl alcohol dehydrogenase; similar to Phanerochaete chrysosporium aryl alcohol dehydrogenase; ORFs AAD6/YFL056C and AAD16/YFL057C are displaced from one another by -1 frameshift; members of the AAD gene family comprise three pairs (AAD3 + AAD15, AAD6/AAD16 + AAD4, AAD10 + AAD14) whose two genes are more related to one another than to other members of the family [Source:SGD;Acc:S000001837]
##                 Type length chromosome strand  start    end  tss_id
## AAC1  protein_coding    930       XIII     -1 387315 388244 TSS5132
## AAC3  protein_coding    924         II      1 415983 416906 TSS1609
## AAD10 protein_coding    867          X      1 727405 728271 TSS5024
## AAD14 protein_coding   1131        XIV     -1  16118  17248 TSS6941
## AAD15 protein_coding    432         XV     -1   1647   2078  TSS108
## AAD16 protein_coding    459         VI     -1  14305  14763 TSS2145
##                 location variance_condition variance_batch
## AAC1  XIII_387315_388244             0.2121              0
## AAC3    II_415983_416906             0.8471              0
## AAD10    X_727405_728271             0.9067              0
## AAD14    XIV_16118_17248             0.8387              0
## AAD15       XV_1647_2078             0.9147              0
## AAD16     VI_14305_14763             0.9024              0
##       variance_Residuals
## AAC1             0.78786
## AAC3             0.15286
## AAD10            0.09335
## AAD14            0.16130
## AAD15            0.08532
## AAD16            0.09764

14 Now look at the data without batch ‘E2B1’

nor_norm <- sm(normalize_expt(merged_nor, transform="log2", convert="cpm", norm="quant", filter=TRUE))
nor_plots <- sm(graph_metrics(nor_norm))

nor_batchnorm <- sm(normalize_expt(merged_nor, transform="log2", convert="cpm", norm="quant", filter=TRUE,
                                batch="fsva"))
nor_batchplots <- sm(graph_metrics(nor_batchnorm))

14.1 Print the smaller set of data

## Before fsva, correlation heatmap without 'r'
pp(file="illustrator_input/17_nor_corheat.pdf", image=nor_plots$corheat)

## Wrote the image to: illustrator_input/17_nor_corheat.pdf

## Before fsva, pca without 'r'
pp(file="illustrator_input/18_nor_pca.pdf", image=nor_plots$pcaplot)

## Too few points to calculate an ellipse

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Wrote the image to: illustrator_input/18_nor_pca.pdf

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## After fsva, correlation heatmap without 'r'
pp(file="illustrator_input/19_nor_fsva_corheat.pdf", image=nor_batchplots$corheat)

## Wrote the image to: illustrator_input/19_nor_fsva_corheat.pdf

## After fsva, pca without 'r'
pp(file="illustrator_input/20_nor_fsva_pca.pdf", image=nor_batchplots$pcaplot)

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Wrote the image to: illustrator_input/20_nor_fsva_pca.pdf

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

15 Try many batch estimation methods to see what is up.

I do not think any of this information is currently being used, so I am going to stop running it for the moment but keep it here in case we decide to revive it.

merged_pcabatch <- sm(normalize_expt(merged_expt, transform="log2",
                                filter=TRUE, batch="pca", low_to_zero=TRUE))
merged_pcabatch_metrics <- sm(graph_metrics(merged_pcabatch))

merged_nor_pcabatch <- sm(normalize_expt(merged_nor, transform="log2",
                                filter=TRUE, batch="pca", low_to_zero=TRUE))
merged_nor_pcabatch_metrics <- sm(graph_metrics(merged_nor_pcabatch))

merged_nos_pcabatch <- sm(normalize_expt(merged_nos, transform="log2",
                                filter=TRUE, batch="pca", low_to_zero=TRUE))
merged_nos_pcabatch_metrics <- sm(graph_metrics(merged_nos_pcabatch))

merged_sva <- sm(normalize_expt(merged_expt, transform="log2",
                                filter=TRUE, batch="sva", low_to_zero=TRUE))
merged_sva_metrics <- sm(graph_metrics(merged_sva))

merged_nor_sva <- sm(normalize_expt(merged_nor, transform="log2",
                                filter=TRUE, batch="sva", low_to_zero=TRUE))
merged_nor_sva_metrics <- sm(graph_metrics(merged_nor_sva))

merged_nos_sva <- sm(normalize_expt(merged_nos, transform="log2",
                                filter=TRUE, batch="sva", low_to_zero=TRUE))
merged_nos_sva_metrics <- sm(graph_metrics(merged_nos_sva))

merged_combat <- sm(normalize_expt(merged_expt, transform="log2",
                                   filter=TRUE, batch="combat_scale", low_to_zero=TRUE))
merged_combat_metrics <- sm(graph_metrics(merged_combat))

merged_nor_combat <- sm(normalize_expt(merged_nor, transform="log2",
                                   filter=TRUE, batch="combat_scale", low_to_zero=TRUE))
merged_nor_combat_metrics <- sm(graph_metrics(merged_nor_combat))

merged_nos_combat <- sm(normalize_expt(merged_nos, transform="log2",
                                   filter=TRUE, batch="combat_scale", low_to_zero=TRUE))
merged_nos_combat_metrics <- sm(graph_metrics(merged_nos_combat))

merged_limma <- sm(normalize_expt(merged_expt, transform="log2",
                                   filter=TRUE, batch="limmaresid", low_to_zero=TRUE))
merged_limma_metrics <- sm(graph_metrics(merged_limma))

merged_nor_limma <- sm(normalize_expt(merged_nor, transform="log2",
                                   filter=TRUE, batch="limmaresid", low_to_zero=TRUE))
merged_nor_limma_metrics <- sm(graph_metrics(merged_nor_limma))

merged_nos_limma <- sm(normalize_expt(merged_nos, transform="log2",
                                   filter=TRUE, batch="limmaresid", low_to_zero=TRUE))
merged_nos_limma_metrics <- sm(graph_metrics(merged_nos_limma))

merged_fsva <- sm(normalize_expt(merged_expt, transform="log2",
                              filter=TRUE, batch="fsva", low_to_zero=TRUE))
merged_fsva_metrics <- sm(graph_metrics(merged_fsva))

merged_nor_fsva <- sm(normalize_expt(merged_nor, transform="log2",
                              filter=TRUE, batch="fsva", low_to_zero=TRUE))
merged_nor_fsva_metrics <- sm(graph_metrics(merged_nor_fsva))

merged_nos_fsva <- sm(normalize_expt(merged_nos, transform="log2",
                              filter=TRUE, batch="ssva", low_to_zero=TRUE))
merged_nos_fsva_metrics <- sm(graph_metrics(merged_nos_fsva))

15.0.1 The resulting plots

Why is is suddenly printing stuff now and not before?

merged_pcabatch_metrics$pcaplot

merged_nor_pcabatch_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_nos_pcabatch_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_sva_metrics$pcaplot

merged_nor_sva_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_nos_sva_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_combat_metrics$pcaplot

merged_nor_combat_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_nos_combat_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_limma_metrics$pcaplot

merged_nor_limma_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_nos_limma_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_fsva_metrics$pcaplot

merged_nor_fsva_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

merged_nos_fsva_metrics$pcaplot

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

16 Write the expt

## hmm I was going to silence this line, but looking at the report it seems to be missing some pictures.
fun_nor <- write_expt(merged_nor, excel=paste0("excel/samples_written_merged_nor-v", ver, ".xlsx"),
                      filter=TRUE, norm="raw", convert="cpm", batch="fsva", transform="log2")

## Writing the legend.

## Writing the raw reads.

## Graphing the raw reads.

## Too few points to calculate an ellipse

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Writing the normalized reads.

## Graphing the normalized reads.

## Too few points to calculate an ellipse

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Writing the median reads by factor.

## The factor WT has 6 rows.

## The factor cbf5_D95A has 8 rows.

## The factor upf1d has 4 rows.

## The factor cbf5_D95A upf1d has 2 rows.

17 Differential Expression (This takes forever)

Once we choose a dataset and set of batch factors to add to the statistical model, we pass them to deseq/edger/limma and see what happens. The following document handles these tasks. Though the shortest document, this is the most computationally intensive task.

18 Differential Expression: 20180310

19 Filter the raw data

In sample_estimation, I created sc_filt which is precisely what I want.

20 Start with batch in the model

I am going to leave these running without silence as I want to make sure they are running without troubles.

20.1 Set up contrasts

I use the variable ‘keepers’ to define the numerators/denominators of interest and give their contrasts names which are appropriate. I do this because my all_pairwise() function performs all possible pairwise comparisons in the specific order of: a:b, a:c, a:d, … b:c, b:d, …, c:d, … which is not necessarily what is biologically interesting/intuitive. Thus when we specifically set the numerators/denominators here, it will make sure that the result of the contrast is in the chosen orientation.

keepers <- list(
  "D95A_vs_WT" = c("cbf5_D95A", "WT"),
  "upf1d_vs_WT" = c("upf1d", "WT"),
  "double_vs_D95A" = c("cbf5_D95Aupf1d", "cbf5_D95A"),
  "double_vs_upf1d" = c("cbf5_D95Aupf1d", "upf1d"),
  "double_vs_wt" = c("cbf5_D95Aupf1d", "WT"))

20.2 Perform search

As I suspect you know, all_pairwise() is the work horse. It runs deseq/edger/limma/basic differential expression searches and attempts to use similar/compatible statistical models for each search (except basic which is intended as a diagnostic/negative control).

combine_de_tables() does most of the work. It takes the data from all_pairwise(), combines them, and writes them out.

I should add some more pictures here.

fsva <- sm(all_pairwise(input=merged_filt, model_batch="fsva", parallel=FALSE))

fsva_write <- combine_de_tables(
  all_pairwise_result=fsva, keepers=keepers,
  excel=paste0("excel/sva_in_model_differential_merged-v", ver, ".xlsx"),
  sig_excel=paste0("excel/sva_in_model_significant-v", ver, ".xlsx"),
  abundant_excel=paste0("excel/sva_in_model_abundance-v", ver, ".xlsx"))

## Deleting the file excel/sva_in_model_differential_merged-v20180310.xlsx before writing the tables.

## Writing a legend of columns.

## Printing a pca plot before/after surrogates/batch estimation.

## Working on 1/5: D95A_vs_WT

## Found inverse table with WT_vs_cbf5_D95A

## Working on 2/5: upf1d_vs_WT

## Found inverse table with WT_vs_upf1d

## Working on 3/5: double_vs_D95A

## Found table with cbf5_D95Aupf1d_vs_cbf5_D95A

## Working on 4/5: double_vs_upf1d

## Found inverse table with upf1d_vs_cbf5_D95Aupf1d

## Working on 5/5: double_vs_wt

## Found inverse table with WT_vs_cbf5_D95Aupf1d

## Adding venn plots for cbf5_D95A_vs_WT.

## Limma expression coefficients for cbf5_D95A_vs_WT; R^2: 0.94; equation: y = 1.01x - 0.0273

## Edger expression coefficients for cbf5_D95A_vs_WT; R^2: 0.938; equation: y = 1.01x - 0.0115

## DESeq2 expression coefficients for cbf5_D95A_vs_WT; R^2: 0.939; equation: y = 1.01x - 0.0269

## Adding venn plots for upf1d_vs_WT.

## Limma expression coefficients for upf1d_vs_WT; R^2: 0.925; equation: y = 0.94x + 0.385

## Edger expression coefficients for upf1d_vs_WT; R^2: 0.929; equation: y = 0.916x + 1.16

## DESeq2 expression coefficients for upf1d_vs_WT; R^2: 0.929; equation: y = 0.916x + 0.776

## Adding venn plots for cbf5_D95Aupf1d_vs_cbf5_D95A.

## Limma expression coefficients for cbf5_D95Aupf1d_vs_cbf5_D95A; R^2: 0.894; equation: y = 0.903x + 0.577

## Edger expression coefficients for cbf5_D95Aupf1d_vs_cbf5_D95A; R^2: 0.892; equation: y = 0.843x + 1.43

## DESeq2 expression coefficients for cbf5_D95Aupf1d_vs_cbf5_D95A; R^2: 0.892; equation: y = 0.843x + 1.5

## Adding venn plots for cbf5_D95Aupf1d_vs_upf1d.

## Limma expression coefficients for cbf5_D95Aupf1d_vs_upf1d; R^2: 0.965; equation: y = 1.02x - 0.111

## Edger expression coefficients for cbf5_D95Aupf1d_vs_upf1d; R^2: 0.95; equation: y = 0.974x + 0.371

## DESeq2 expression coefficients for cbf5_D95Aupf1d_vs_upf1d; R^2: 0.957; equation: y = 0.996x + 0.0755

## Adding venn plots for cbf5_D95Aupf1d_vs_WT.

## Limma expression coefficients for cbf5_D95Aupf1d_vs_WT; R^2: 0.863; equation: y = 0.968x + 0.234

## Edger expression coefficients for cbf5_D95Aupf1d_vs_WT; R^2: 0.874; equation: y = 0.903x + 0.958

## DESeq2 expression coefficients for cbf5_D95Aupf1d_vs_WT; R^2: 0.872; equation: y = 0.897x + 1.01

## Writing summary information.

## Attempting to add the comparison plot to pairwise_summary at row: 23 and column: 1

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%

## Performing save of the workbook.

## Invoking extract_significant_genes().

## Writing a legend of columns.

## Writing excel data sheet 0/5: cbf5_D95A_vs_WT

## After (adj)p filter, the up genes table has 1515 genes.

## After (adj)p filter, the down genes table has 1385 genes.

## After fold change filter, the up genes table has 353 genes.

## After fold change filter, the down genes table has 149 genes.

## Writing excel data sheet 1/5: upf1d_vs_WT

## After (adj)p filter, the up genes table has 1448 genes.

## After (adj)p filter, the down genes table has 1159 genes.

## After fold change filter, the up genes table has 307 genes.

## After fold change filter, the down genes table has 196 genes.

## Writing excel data sheet 2/5: cbf5_D95Aupf1d_vs_cbf5_D95A

## After (adj)p filter, the up genes table has 1747 genes.

## After (adj)p filter, the down genes table has 1861 genes.

## After fold change filter, the up genes table has 558 genes.

## After fold change filter, the down genes table has 488 genes.

## Writing excel data sheet 3/5: cbf5_D95Aupf1d_vs_upf1d

## After (adj)p filter, the up genes table has 577 genes.

## After (adj)p filter, the down genes table has 545 genes.

## After fold change filter, the up genes table has 169 genes.

## After fold change filter, the down genes table has 64 genes.

## Writing excel data sheet 4/5: cbf5_D95Aupf1d_vs_WT

## After (adj)p filter, the up genes table has 2011 genes.

## After (adj)p filter, the down genes table has 1758 genes.

## After fold change filter, the up genes table has 703 genes.

## After fold change filter, the down genes table has 587 genes.

## Printing significant genes to the file: excel/sva_in_model_significant-v20180310.xlsx

## 1/5: Creating significant table up_1limma_cbf5_D95A_vs_WT

## 2/5: Creating significant table up_2limma_upf1d_vs_WT

## 3/5: Creating significant table up_3limma_cbf5_D95Aupf1d_vs_cbf5_D95A

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## 4/5: Creating significant table up_4limma_cbf5_D95Aupf1d_vs_upf1d

## The sheet name was too long for Excel, truncating it by removing vowels.
## The sheet name was too long for Excel, truncating it by removing vowels.

## 5/5: Creating significant table up_5limma_cbf5_D95Aupf1d_vs_WT

## The sheet name was too long for Excel, truncating it by removing vowels.

## Writing excel data sheet 5/5: cbf5_D95A_vs_WT

## After (adj)p filter, the up genes table has 1864 genes.

## After (adj)p filter, the down genes table has 1076 genes.

## After fold change filter, the up genes table has 440 genes.

## After fold change filter, the down genes table has 119 genes.

## Writing excel data sheet 6/5: upf1d_vs_WT

## After (adj)p filter, the up genes table has 1537 genes.

## After (adj)p filter, the down genes table has 987 genes.

## After fold change filter, the up genes table has 297 genes.

## After fold change filter, the down genes table has 194 genes.

## Writing excel data sheet 7/5: cbf5_D95Aupf1d_vs_cbf5_D95A

## After (adj)p filter, the up genes table has 1646 genes.

## After (adj)p filter, the down genes table has 1848 genes.

## After fold change filter, the up genes table has 557 genes.

## After fold change filter, the down genes table has 557 genes.

## Writing excel data sheet 8/5: cbf5_D95Aupf1d_vs_upf1d

## After (adj)p filter, the up genes table has 633 genes.

## After (adj)p filter, the down genes table has 568 genes.

## After fold change filter, the up genes table has 193 genes.

## After fold change filter, the down genes table has 71 genes.

## Writing excel data sheet 9/5: cbf5_D95Aupf1d_vs_WT

## After (adj)p filter, the up genes table has 2067 genes.

## After (adj)p filter, the down genes table has 1416 genes.

## After fold change filter, the up genes table has 692 genes.

## After fold change filter, the down genes table has 426 genes.

## Printing significant genes to the file: excel/sva_in_model_significant-v20180310.xlsx

## 1/5: Creating significant table up_1edger_cbf5_D95A_vs_WT

## 2/5: Creating significant table up_2edger_upf1d_vs_WT

## 3/5: Creating significant table up_3edger_cbf5_D95Aupf1d_vs_cbf5_D95A

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## 4/5: Creating significant table up_4edger_cbf5_D95Aupf1d_vs_upf1d

## The sheet name was too long for Excel, truncating it by removing vowels.
## The sheet name was too long for Excel, truncating it by removing vowels.

## 5/5: Creating significant table up_5edger_cbf5_D95Aupf1d_vs_WT

## The sheet name was too long for Excel, truncating it by removing vowels.

## Writing excel data sheet 10/5: cbf5_D95A_vs_WT

## After (adj)p filter, the up genes table has 1769 genes.

## After (adj)p filter, the down genes table has 1355 genes.

## After fold change filter, the up genes table has 381 genes.

## After fold change filter, the down genes table has 142 genes.

## Writing excel data sheet 11/5: upf1d_vs_WT

## After (adj)p filter, the up genes table has 1049 genes.

## After (adj)p filter, the down genes table has 1654 genes.

## After fold change filter, the up genes table has 179 genes.

## After fold change filter, the down genes table has 323 genes.

## Writing excel data sheet 12/5: cbf5_D95Aupf1d_vs_cbf5_D95A

## After (adj)p filter, the up genes table has 1377 genes.

## After (adj)p filter, the down genes table has 2302 genes.

## After fold change filter, the up genes table has 464 genes.

## After fold change filter, the down genes table has 703 genes.

## Writing excel data sheet 13/5: cbf5_D95Aupf1d_vs_upf1d

## After (adj)p filter, the up genes table has 748 genes.

## After (adj)p filter, the down genes table has 691 genes.

## After fold change filter, the up genes table has 186 genes.

## After fold change filter, the down genes table has 70 genes.

## Writing excel data sheet 14/5: cbf5_D95Aupf1d_vs_WT

## After (adj)p filter, the up genes table has 1660 genes.

## After (adj)p filter, the down genes table has 1951 genes.

## After fold change filter, the up genes table has 496 genes.

## After fold change filter, the down genes table has 627 genes.

## Printing significant genes to the file: excel/sva_in_model_significant-v20180310.xlsx

## 1/5: Creating significant table up_1deseq_cbf5_D95A_vs_WT

## 2/5: Creating significant table up_2deseq_upf1d_vs_WT

## 3/5: Creating significant table up_3deseq_cbf5_D95Aupf1d_vs_cbf5_D95A

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## 4/5: Creating significant table up_4deseq_cbf5_D95Aupf1d_vs_upf1d

## The sheet name was too long for Excel, truncating it by removing vowels.
## The sheet name was too long for Excel, truncating it by removing vowels.

## 5/5: Creating significant table up_5deseq_cbf5_D95Aupf1d_vs_WT

## The sheet name was too long for Excel, truncating it by removing vowels.

## Writing excel data sheet 15/5: cbf5_D95A_vs_WT

## After (adj)p filter, the up genes table has 6 genes.

## After (adj)p filter, the down genes table has 14 genes.

## After fold change filter, the up genes table has 5 genes.

## After fold change filter, the down genes table has 8 genes.

## Writing excel data sheet 16/5: upf1d_vs_WT

## After (adj)p filter, the up genes table has 21 genes.

## After (adj)p filter, the down genes table has 8 genes.

## After fold change filter, the up genes table has 17 genes.

## After fold change filter, the down genes table has 5 genes.

## Writing excel data sheet 17/5: cbf5_D95Aupf1d_vs_cbf5_D95A

## After (adj)p filter, the up genes table has 152 genes.

## After (adj)p filter, the down genes table has 57 genes.

## After fold change filter, the up genes table has 119 genes.

## After fold change filter, the down genes table has 22 genes.

## Writing excel data sheet 18/5: cbf5_D95Aupf1d_vs_upf1d

## After (adj)p filter, the up genes table has 0 genes.

## After (adj)p filter, the down genes table has 0 genes.

## After fold change filter, the up genes table has 0 genes.

## After fold change filter, the down genes table has 0 genes.

## Writing excel data sheet 19/5: cbf5_D95Aupf1d_vs_WT

## After (adj)p filter, the up genes table has 95 genes.

## After (adj)p filter, the down genes table has 151 genes.

## After fold change filter, the up genes table has 76 genes.

## After fold change filter, the down genes table has 108 genes.

## Printing significant genes to the file: excel/sva_in_model_significant-v20180310.xlsx

## 1/5: Creating significant table up_1basic_cbf5_D95A_vs_WT

## 2/5: Creating significant table up_2basic_upf1d_vs_WT

## 3/5: Creating significant table up_3basic_cbf5_D95Aupf1d_vs_cbf5_D95A

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## The sheet name was too long for Excel, truncating it by removing vowels.

## hmph, still too long, truncating to 30 characters.

## 4/5: Creating significant table up_4basic_cbf5_D95Aupf1d_vs_upf1d

## The sheet name was too long for Excel, truncating it by removing vowels.
## The sheet name was too long for Excel, truncating it by removing vowels.

## 5/5: Creating significant table up_5basic_cbf5_D95Aupf1d_vs_WT

## The sheet name was too long for Excel, truncating it by removing vowels.

## Adding significance bar plots.

## Invoking extract_abundant_genes().

## Writing a legend of columns.

strict_sig_write <- sm(extract_significant_genes(
  fsva_write, lfc=2,
  excel=paste0("excel/sva_in_model_sig2lfc-v", ver, ".xlsx")))

new_colors <- c("#008000", "#4CA64C", "#7FBF7F", "#FF0000", "#FF4C4C", "#FF9999")
new_bars <- significant_barplots(fsva_write, color_list=new_colors)

cbf5_de_plots <- extract_de_plots(fsva_write, type="deseq", table="cbf5_D95A_vs_WT")
pp(file="illustrator_input/21_cbf5_deseq_ma.pdf", image=cbf5_de_plots$ma$plot)

## Wrote the image to: illustrator_input/21_cbf5_deseq_ma.pdf

pp(file="illustrator_input/22_cbf5_deseq_vol.pdf", image=cbf5_de_plots$volcano$plot)

## Wrote the image to: illustrator_input/22_cbf5_deseq_vol.pdf

pp(file="illustrator_input/23_redgreen_sigbars.pdf", image=new_bars$deseq)

## Wrote the image to: illustrator_input/23_redgreen_sigbars.pdf

20.3 Repeat pairwise searches without the batch ‘r’

merged_nor <- subset_expt(merged_filt, subset="batch!='E2B1'")
fsva_nor <- sm(all_pairwise(input=merged_nor, model_batch="fsva", parallel=FALSE))

fsva_nor_write <- sm(combine_de_tables(
  all_pairwise_result=fsva_nor, keepers=keepers,
  excel=paste0("excel/nor_sva_in_model_differential_merged-v", ver, ".xlsx"),
  abundant_excel=paste0("excel/nor_sva_in_model_abundance-v", ver, ".xlsx"),
  sig_excel=paste0("excel/nor_sva_in_model_significant-v", ver, ".xlsx")))

fsva_nor_sig <- fsva_nor_write[["significant"]]
strict_fsva_nor_write <- sm(extract_significant_genes(
  fsva_nor_write, lfc=2,
  excel=paste0("excel/nor_sva_in_model_sig2lfc-v", ver, ".xlsx")))

new_colors <- c("#008000", "#4CA64C", "#7FBF7F", "#FF0000", "#FF4C4C", "#FF9999")
new_bars <- significant_barplots(fsva_nor_write, color_list=new_colors)

norcbf5_de_plots <- extract_de_plots(fsva_nor_write, type="deseq", table="cbf5_D95A_vs_WT")
pp(file="illustrator_input/24_norcbf5_deseq_ma.pdf", image=norcbf5_de_plots$ma$plot)

## Wrote the image to: illustrator_input/24_norcbf5_deseq_ma.pdf

pp(file="illustrator_input/25_norcbf5_deseq_vol.pdf", image=norcbf5_de_plots$volcano$plot)

## Wrote the image to: illustrator_input/25_norcbf5_deseq_vol.pdf

pp(file="illustrator_input/26_norredgreen_sigbars.pdf", image=new_bars$deseq)

## Wrote the image to: illustrator_input/26_norredgreen_sigbars.pdf

pp(file="illustrator_input/27_nor_agreement.pdf", image=fsva_nor$comparison$heat)

## Wrote the image to: illustrator_input/27_nor_agreement.pdf

21 Gene Set Enrichment searches

I always think of this as gene ontology searching, but that is not correct, as the following document should perform searches of gene enrichment across a number of databases with multiple methods. Currently I think it is limited to goseq/gProfileR and some pathway analyses. One of my long-term goals (exceedingly unlikely to happen for this data) is to figure out a reliable way to combine/compare these searches similar to what I do with all_pairwise() in the previous document.

22 Sample Estimation version: 20180310

23 Ontology searches with RNA sequencing data of Saccharomyces cerevisiae wt/mutant cbf5.

This document will pull together the various annotations and results from the differential expression analyses and attempt to use them for gene ontology searches.

24 Ontology searching is weird

Sadly, the authors of the various ontology tools I use (goseq/clusterprofiler/topgo/gostats/gprofiler) keep changing the input requirements and make it hard for me to keep up. Lets see how well I did.

24.1 Set up some other data sets, 1/2 fold and 4 fold.

## The strict subset was generated in 03 merged.
summary(strict_fsva_nor_write)

##               Length Class  Mode
## limma          7     -none- list
## edger          7     -none- list
## deseq          7     -none- list
## basic          7     -none- list
## sig_bar_plots 11     -none- list

## I didn't make a 1.5 fold set yet, make it now.
fsva_nor_onehalf_sig <- sm(extract_significant_genes(
  combined=fsva_nor_write, lfc=0.585,
  excel=paste0("excel/nor_sva_in_model_onehalf_significant_merged-v", ver, ".xlsx")))

24.2 Searching with gprofiler

g:ProfileR is a web-services which handles a large variety of ontology/enrichment searches for a limited set of species, including S.cerevisiae. It is probably my favorite because of its simplicity and inclusive set of searches.

## I was getting weird errors:
## "external pointer is not valid", so I am reloading the txdb here.
please_install("TxDb.Scerevisiae.UCSC.sacCer3.sgdGene")

## [1] 0

tmp <- sm(library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene))
sc_txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene

nor_twofold_merged_up <- fsva_nor_sig$deseq$ups[[1]]
nor_twofold_merged_down <- fsva_nor_sig$deseq$downs[[1]]

nor_two_gprofiler_written <- sm(sig_ontologies(
  significant_result=fsva_nor_sig, search_by="deseq",
  excel_prefix="excel/ontology_nor_two", excel_suffix=paste0("-v", ver, ".xlsx"),
  type="gprofiler", species="scerevisiae"))

## I think these are duplicates of the above search.
## yeah, why did I do that?
##nor_twoup_gprofiler <- sm(simple_gprofiler(
##  sig_genes=nor_twofold_merged_up, species="scerevisiae",
##  excel=paste0("excel/nor_twoup_gprofiler-v", ver, ".xlsx")))
##nor_twodown_gprofiler <- sm(simple_gprofiler(
##  sig_genes=nor_twofold_merged_down, species="scerevisiae",
##  excel=paste0("excel/nor_twodown_gprofiler-v", ver, ".xlsx")))
##nor_two_written <- sm(sig_ontologies(
##  significant_result=fsva_nor_sig, search_by="deseq",
##  excel_prefix="excel/ontology_two", excel_suffix=paste0("v", ver),
##  type="gprofiler", species="scerevisiae"))

pp(file="illustrator_input/28_gprofiler_upbars_mf.pdf",
   image=nor_two_gprofiler_written[["ups"]][["cbf5_D95A_vs_WT"]][["pvalue_plots"]][["mfp_plot_over"]])

## Wrote the image to: illustrator_input/28_gprofiler_upbars_mf.pdf

pp(file="illustrator_input/29_gprofiler_upbars_bp.pdf",
   image=nor_two_gprofiler_written[["ups"]][["cbf5_D95A_vs_WT"]][["pvalue_plots"]][["bpp_plot_over"]])

## Wrote the image to: illustrator_input/29_gprofiler_upbars_bp.pdf

pp(file="illustrator_input/30_gprofiler_upbars_kegg.pdf",
   image=nor_two_gprofiler_written[["ups"]][["cbf5_D95A_vs_WT"]][["pvalue_plots"]][["kegg_plot_over"]])

## Wrote the image to: illustrator_input/30_gprofiler_upbars_kegg.pdf

pp(file="illustrator_input/31_gprofiler_downbars_kegg.pdf",
   image=nor_two_gprofiler_written[["downs"]][["cbf5_D95A_vs_WT"]][["pvalue_plots"]][["kegg_plot_over"]])

## Wrote the image to: illustrator_input/31_gprofiler_downbars_kegg.pdf

24.2.1 Repeat using a more/less restrictive set of genes

Let us repeat the ontology searches using a more restrictive fold-change (2^0.6 -> 1.516), so basically 1.5 fold. Perhaps I should change it to 0.585, hmm yeah I guess so, ok now it is exactly 1.5 fold.

24.2.1.1 Less strict first

Do the 1.5 fold cutoff first.

nor_onehalf_merged_up <- fsva_nor_onehalf_sig$limma$ups[[1]]
nor_onehalf_merged_down <- fsva_nor_onehalf_sig$limma$downs[[1]]

nor_onehalf_written <- sm(sig_ontologies(
  significant_result=fsva_nor_onehalf_sig, search_by="deseq",
  excel_prefix="excel/ontology_nor_onehalf", excel_suffix=paste0("-v", ver, ".xlsx"),
  type="gprofiler", species="scerevisiae"))

24.2.1.2 4 fold cutoff

nor_four_written_gprofiler <- sm(sig_ontologies(
  significant_result=strict_fsva_nor_write, search_by="deseq",
  excel_prefix="excel/ontology_nor_four", excel_suffix=paste0("-v", ver, ".xlsx"),
  type="gprofiler", species="scerevisiae"))

24.3 Searching with goseq

goseq is one of the most commonly used ontology search tools in R. It makes some interesting statements about the over-representation of longer/shorter genes in differential expression analyses and therefore performs some pre-processing steps to attempt to adjust these (and/or other) systematic biases.

Whenever I go back over their paper I find myself thinking I should do some sort of similar probability weighting to seek out biases in the differential expression results and do a similar plot to describe biases in the data/results – but I have not yet.

nor_twoup_merged_goseq <- sm(simple_goseq(
  sig_genes=nor_twofold_merged_up, go_db=sc_ontology,
  length_db=sc_txdb, excel=paste0("excel/nor_twoup_goseq-v", ver, ".xlsx")))

nor_twodown_merged_goseq <- sm(simple_goseq(
  sig_genes=nor_twofold_merged_down, go_db=sc_ontology,
  length_db=sc_txdb, excel=paste0("excel/nor_twodown_goseq-v", ver, ".xlsx")))

nor_onehalfup_goseq <- sm(simple_goseq(
  sig_genes=nor_onehalf_merged_up, go_db=sc_ontology,
  length_db=sc_txdb, excel=paste0("excel/nor_onehalfup_goseq-v", ver, ".xlsx")))

nor_onehalfdown_goseq <- sm(simple_goseq(
  sig_genes=nor_onehalf_merged_down, go_db=sc_ontology,
  length_db=sc_txdb, excel=paste0("excel/nor_onehalfdown_goseq-v", ver, ".xlsx")))

## Ideally, I can choose any ontology tool via sig_ontologies().
## In practice, the more brittle tools still fail because I haven't worked out
## all the corner cases yet.
nor_four_written_goseq <- sm(sig_ontologies(
  significant_result=strict_fsva_nor_write, search_by="deseq",
  excel_prefix="excel/ontology_nor_four_goseq", excel_suffix=paste0("-v", ver, ".xlsx"),
  type="goseq", species="scerevisiae"))

24.4 Maybe try clusterprofiler for this?

tmp <- sm(library(AnnotationHub))
ah = sm(AnnotationHub())
orgdbs <- sm(query(ah, "OrgDb"))
sc_orgdb <- sm(query(ah, c("OrgDB", "Saccharomyces"))) ##   AH49589 | org.Sc.sgd.db.sqlite
sc_orgdb <- ah[["AH49589"]]

nor_onehalfup_cp <- simple_clusterprofiler(sig_genes=nor_onehalf_merged_up,
                                           orgdb=sc_orgdb,
                                           excel=paste0("excel/nor_onehalfup_cp-v", ver, ".xlsx"))
nor_onehalfdown_cp <- simple_clusterprofiler(sig_genes=nor_onehalf_merged_down,
                                             orgdb=sc_orgdb,
                                             excel=paste0("excel/nor_onehalfup_cp-v", ver, ".xlsx"))

It has been quite a while since last I used KEGG in an efficient fashion, lets see what happens!

try_species <- kegg_get_orgn("Saccharomyces cerevisiae")
pathview_data <- batch_write$data[[1]]
rownames(pathview_data) <- make.names(pathview_data$transcriptid, unique=TRUE)
## If I read the yeast xml files correctly, they all follow the traditional chromosomal location naming scheme...
pathview_result <- simple_pathview(pathview_data, species=try_species)

message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
pander::pander(sessionInfo())

LS0tCnRpdGxlOiAiUy5jZXJldmlzaWFlIDIwMTc6IENvbWJpbmVkIFJOQVNlcSBhbmFseXNlcy4iCmF1dGhvcjogImF0YiBhYmVsZXdAZ21haWwuY29tIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDoKIGh0bWxfZG9jdW1lbnQ6CiAgY29kZV9kb3dubG9hZDogdHJ1ZQogIGNvZGVfZm9sZGluZzogc2hvdwogIGZpZ19jYXB0aW9uOiB0cnVlCiAgZmlnX2hlaWdodDogNwogIGZpZ193aWR0aDogNwogIGhpZ2hsaWdodDogdGFuZ28KICBrZWVwX21kOiBmYWxzZQogIG1vZGU6IHNlbGZjb250YWluZWQKICBudW1iZXJfc2VjdGlvbnM6IHRydWUKICBzZWxmX2NvbnRhaW5lZDogdHJ1ZQogIHRoZW1lOiBjb3NtbwogIHRvYzogdHJ1ZQogIHRvY19mbG9hdDoKICAgIGNvbGxhcHNlZDogZmFsc2UKICAgIHNtb290aF9zY3JvbGw6IGZhbHNlCi0tLQoKPHN0eWxlPgogIGJvZHkgLm1haW4tY29udGFpbmVyIHsKICAgIG1heC13aWR0aDogMTYwMHB4Owp9Cjwvc3R5bGU+CgpgYGB7ciBvcHRpb25zLCBpbmNsdWRlPUZBTFNFfQpsaWJyYXJ5KCJocGdsdG9vbHMiKQp0dCA8LSBkZXZ0b29sczo6bG9hZF9hbGwoIn4vaHBnbHRvb2xzIikKa25pdHI6Om9wdHNfa25pdCRzZXQocHJvZ3Jlc3M9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgdmVyYm9zZT1UUlVFLAogICAgICAgICAgICAgICAgICAgICB3aWR0aD05MCwKICAgICAgICAgICAgICAgICAgICAgZWNobz1UUlVFKQprbml0cjo6b3B0c19jaHVuayRzZXQoZXJyb3I9VFJVRSwKICAgICAgICAgICAgICAgICAgICAgIGZpZy53aWR0aD04LAogICAgICAgICAgICAgICAgICAgICAgZmlnLmhlaWdodD04LAogICAgICAgICAgICAgICAgICAgICAgZHBpPTk2KQpvcHRpb25zKGRpZ2l0cz00LAogICAgICAgIHN0cmluZ3NBc0ZhY3RvcnM9RkFMU0UsCiAgICAgICAga25pdHIuZHVwbGljYXRlLmxhYmVsPSJhbGxvdyIpCmdncGxvdDI6OnRoZW1lX3NldChnZ3Bsb3QyOjp0aGVtZV9idyhiYXNlX3NpemU9MTApKQpzZXQuc2VlZCgxKQp2ZXIgPC0gIjIwMTgwMzEwIgpwcmV2aW91c19maWxlIDwtICJpbmRleC5SbWQiCgp0bXAgPC0gdHJ5KHNtKGxvYWRtZShmaWxlbmFtZT1wYXN0ZTAoZ3N1YihwYXR0ZXJuPSJcXC5SbWQiLCByZXBsYWNlPSIiLCB4PXByZXZpb3VzX2ZpbGUpLCAiLXYiLCB2ZXIsICIucmRhLnh6IikpKSkKcm1kX2ZpbGUgPC0gImNvbWJpbmVkLlJtZCIKc2tpcF9sb2FkIDwtIFRSVUUKYGBgCgpQZXJmb3JtIGFsbCBTLiBjZXJldmlzaWFlIGFuYWx5c2VzIGluIGEgc2luZ2xlIHJlcG9ydC4KPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CgpUaGlzIGRvY3VtZW50IHNob3VsZCBldmFsdWF0ZSBhbmQgY29tYmluZSBhbGwgdGhlIGFuYWx5c2VzIHVzZWQgZm9yIG91cgpTYWNjaGFyb215Y2VzIFJOQXNlcSBkYXRhLgoKVGhlIHByaW1hcnkgdGhpbmcgdG8gbm90ZSBmb3IgdGhvc2UgcGF5aW5nIGF0dGVudGlvbiwgaXMgdGhlIHNraXBfbG9hZCB2YXJpYWJsZQphYm92ZSwgaXRzIGV4aXN0ZW5jZSBpcyBjaGVja2VkIGluIGFsbCBvZiB0aGUgZm9sbG93aW5nIGRvY3VtZW50cy4gIElmIGl0IGlzCmZvdW5kLCB0aGVuIHRoZXkgd2lsbCBub3QgdHJ5IHRvIGxvYWQgYW55IG9mIG15IGNoZWNrcG9pbnQgc2F2ZXMgYW5kIGp1c3QKY29udGludWUgcnVubmluZy4gIFRodXMgSSBjYW4gbWFrZSAxMDAlIGNlcnRhaW4gdGhhdCBubyBjcm9zcy1hbmFseXNpcyBjb250YW1pbmF0aW9uCmlzIGhhcHBlbmluZy4KCiMgQSBjb3VwbGUgY29udmVudGlvbnM6CgpXaGlsZSBJIGFtIHdyaXRpbmcgZG93biBteSB0aG91Z2h0cywgSSBzaG91bGQgbWFrZSBleHBsaWNpdCBhIGNvdXBsZSBvZiB0aGluZ3MgSQpyZXBlYXRlZGx5IGRvIHdoaWNoIG1pZ2h0IGxvb2sgd2VpcmQ6CgoxLiAgcHAuIChwcmludHBpY3R1cmUpICBUaGlzIGlzIGEgc21hbGwgZnVuY3Rpb24gd2hpY2ggc2V0cyBob3BlZnVsbHkgdXNlZnVsCiAgICBkZWZhdWx0cyB3aGVuIHByaW50aW5nL3NhdmluZyBpbWFnZXMuICBJIGZpbmQgdGhhdCBrZWVwaW5nIHRyYWNrIG9mIFIncwogICAgZGV2aWNlIGxpc3QgdmlhIGRldi5uZXcoKS9kZXYub2ZmKCkgY29uZnVzaW5nLiAgVGh1cyBpZiBJIGdpdmUgcHAoKSBhIGZpbGUKICAgIG5hbWUgYW5kIGEgcGljdHVyZSBvYmplY3QsIGl0IHdpbGwgY2FsbCBmb3IgbWUgdGhlIGFwcHJvcHJpYXRlIGdyYXBoaWNzCiAgICBkZXZpY2Ugd2l0aCBvcHRpb25zIHN1ZmZpY2llbnQgdG8gZ2l2ZSBhIHByZXNlbnRhYmxlIHBpY3R1cmUgYW5kIHJlbWVtYmVyIHRvCiAgICBkbyB0aGUgZGV2Lm9mZigpIGF0IHRoZSBhcHByb3ByaWF0ZSB0aW1lLCB3aGljaCBJIG9mdGVuIHNlZW0gdG8gZm9yZ2V0LgoyLiAgdHQuICBUaGlzIGlzIGEgdHJhc2ggdmFyaWFibGUgd2hpY2ggSSByZXVzZSB0aHJvdWdob3V0IHRoZSBhbmFseXNlcy4KMy4gIHNtLiAgVGhpcyBpcyBhIHNob3J0IHNpbGVuY2VyIGZ1bmN0aW9uLiAgQSBsb3Qgb2YgbXkgZnVuY3Rpb25zIGFyZSB2ZXJ5CiAgICBjaGF0dHkgc28gdGhhdCBJIGNhbiB3YXRjaCB0aGVpciBvdXRwdXQgZm9yIHByb2JsZW1zLCBidXQgb25jZSBJIGFtCiAgICBjb25maWRlbnQgdGhhdCBhIGZ1bmN0aW9uIHdpbGwgd29yayBwcm9wZXJseSwgSSBvZnRlbiB3cmFwIGl0IGluIHNtKCkuCgojIENvbGxlY3QgYW5ub3RhdGlvbiBkYXRhCgpUaGlzIGZpcnN0IGRvY3VtZW50IHVzZXMgZGF0YSBmcm9tIHllYXN0Z2Vub21lLm9yZyB2aWEgYmlvbWFydCwgdGhlIHNxbGl0ZQphbm5vdGF0aW9uIGRhdGEgZnJvbSBiaW9jb25kdWN0b3IsIGFuZCBJRHMvYW5ub3RhdGlvbnMgZnJvbSB0aGUgZ2ZmIGZpbGUgdXNlZApkdXJpbmcgdGhlIHByZXByb2Nlc3Npbmcgc3RlcHMgaW4gb3JkZXIgdG8gY3JlYXRlIGEgY29tYmluZWQgdGFibGUgb2YKYW5ub3RhdGlvbnMuICBBbG9uZyB0aGUgd2F5IGl0IGFsc28gY29sbGVjdHMgR08gSURzIGFuZCBnZW5lIGxlbmd0aHMgKGFzc3VtaW5nCndlIGRlY2lkZSB0byB1c2UgZ29zZXEgbGF0ZXIpLgoKQXQgdGhlIGVuZCwgaXQgY3JlYXRlcyBhbiBleHByZXNzaW9uc2V0IHVzaW5nIHRoZSBzYW1wbGUgc2hlZXQgYW5kIHRoaXMKYW5ub3RhdGlvbiBkYXRhLCB0aGlzIGV4cHJlc3Npb25zZXQgaXMgdGhlIGJhc2UgZm9yIGFsbCBmdXR1cmUgYW5hbHlzZXMuCgpgYGB7ciBhbm5vdGF0aW9uLCBjaGlsZD0nMDFfYW5ub3RhdGlvbi5SbWQnfQpgYGAKCiMgU2FtcGxlIEVzdGltYXRpb24gKExvdHMgb2YgcGxvdHMhKQoKVGhlIGZvbGxvd2luZyBkb2N1bWVudCBwZXJmb3JtcyB0aGUgbW9zdCBzdWJqZWN0aXZlIHBvcnRpb24gb2YgdGhlIGFuYWx5c2lzLiAgSXQKYXR0ZW1wdHMgdG8gcHJlc2VudCB0aGUgZGF0YSBzbyB0aGF0IHdlIGNhbiBzZWUgaXRzIHN0cmVuZ3RocyBhbmQgd2Vha2Vzc2VzLiAgSW4KcHJldHR5IG11Y2ggZXZlcnkgYW5hbHlzaXMgKGFuZCB0aGlzIG9uZSksIHRoZSBwcmltYXJ5IHdlYWtuZXNzIGlzIGJhdGNoCmVmZmVjdC4gIFRodXMgaXQgcGVyZm9ybXMgYSBzZXJpZXMgb2YgYmF0Y2gvc3Vycm9nYXRlIGVzdGltYXRpb25zIGFuZCByZS1wbG90cwp0byBzZWUgaWYgYW55IHBhdHRlcm5zIGVtZXJnZSB3aGljaCAnbWFrZSBzZW5zZScuICBUaGF0IG9mIGNvdXJzZSBpcyB3aGVyZSB0aGUKc3ViamVjdGl2ZSBkYW5nZXIgcmVhcnMgaXRzIGhlYWQuCgpgYGB7ciBzYW1wbGVfZXN0aW1hdGlvbiwgY2hpbGQ9JzAyX3NhbXBsZV9lc3RpbWF0aW9uX21lcmdlZC5SbWQnfQpgYGAKCiMgRGlmZmVyZW50aWFsIEV4cHJlc3Npb24gKFRoaXMgdGFrZXMgZm9yZXZlcikKCk9uY2Ugd2UgY2hvb3NlIGEgZGF0YXNldCBhbmQgc2V0IG9mIGJhdGNoIGZhY3RvcnMgdG8gYWRkIHRvIHRoZSBzdGF0aXN0aWNhbAptb2RlbCwgd2UgcGFzcyB0aGVtIHRvIGRlc2VxL2VkZ2VyL2xpbW1hIGFuZCBzZWUgd2hhdCBoYXBwZW5zLiAgVGhlIGZvbGxvd2luZwpkb2N1bWVudCBoYW5kbGVzIHRoZXNlIHRhc2tzLiAgVGhvdWdoIHRoZSBzaG9ydGVzdCBkb2N1bWVudCwgdGhpcyBpcyB0aGUgbW9zdApjb21wdXRhdGlvbmFsbHkgaW50ZW5zaXZlIHRhc2suCgpgYGB7ciBkaWZmZXJlbnRpYWxfZXhwcmVzc2lvbiwgY2hpbGQ9JzAzX2RpZmZlcmVudGlhbF9leHByZXNzaW9uX21lcmdlZC5SbWQnfQpgYGAKCiMgR2VuZSBTZXQgRW5yaWNobWVudCBzZWFyY2hlcwoKSSBhbHdheXMgdGhpbmsgb2YgdGhpcyBhcyBnZW5lIG9udG9sb2d5IHNlYXJjaGluZywgYnV0IHRoYXQgaXMgbm90IGNvcnJlY3QsIGFzCnRoZSBmb2xsb3dpbmcgZG9jdW1lbnQgc2hvdWxkIHBlcmZvcm0gc2VhcmNoZXMgb2YgZ2VuZSBlbnJpY2htZW50IGFjcm9zcyBhIG51bWJlciBvZgpkYXRhYmFzZXMgd2l0aCBtdWx0aXBsZSBtZXRob2RzLiAgQ3VycmVudGx5IEkgdGhpbmsgaXQgaXMgbGltaXRlZCB0bwpnb3NlcS9nUHJvZmlsZVIgYW5kIHNvbWUgcGF0aHdheSBhbmFseXNlcy4gIE9uZSBvZiBteSBsb25nLXRlcm0gZ29hbHMKKGV4Y2VlZGluZ2x5IHVubGlrZWx5IHRvIGhhcHBlbiBmb3IgdGhpcyBkYXRhKSBpcyB0byBmaWd1cmUgb3V0IGEgcmVsaWFibGUgd2F5CnRvIGNvbWJpbmUvY29tcGFyZSB0aGVzZSBzZWFyY2hlcyBzaW1pbGFyIHRvIHdoYXQgSSBkbyB3aXRoIGFsbF9wYWlyd2lzZSgpIGluCnRoZSBwcmV2aW91cyBkb2N1bWVudC4KCmBgYHtyIGdlbmVfb250b2xvZ3ksIGNoaWxkPScwNF9nZW5lX29udG9sb2d5X21lcmdlZC5SbWQnfQpgYGAKCmBgYHtyIHNhdmVtZSwgZXZhbD1GQUxTRX0KbWVzc2FnZShwYXN0ZTAoIlRoaXMgaXMgaHBnbHRvb2xzIGNvbW1pdDogIiwgZ2V0X2dpdF9jb21taXQoKSkpCnRoaXNfc2F2ZSA8LSBwYXN0ZTAoZ3N1YihwYXR0ZXJuPSJcXC5SbWQiLCByZXBsYWNlPSIiLCB4PXJtZF9maWxlKSwgIi12IiwgdmVyLCAiLnJkYS54eiIpCm1lc3NhZ2UocGFzdGUwKCJTYXZpbmcgdG8gIiwgdGhpc19zYXZlKSkKdG1wIDwtIHNtKHNhdmVtZShmaWxlbmFtZT10aGlzX3NhdmUpKQpwYW5kZXI6OnBhbmRlcihzZXNzaW9uSW5mbygpKQpgYGAKCmBgYHtyIGxvYWRtZSwgZXZhbD1GQUxTRSwgaW5jbHVkZT1GQUxTRX0KIyMgU28gSSBjYW4gcmVsb2FkIGFuZCBsb29rIGZvciBlcnJvcnMuCnByZXZpb3VzX2ZpbGUgPC0gImNvbWJpbmVkLlJtZCIKdG1wIDwtIHRyeShzbShsb2FkbWUoZmlsZW5hbWU9cGFzdGUwKGdzdWIocGF0dGVybj0iXFwuUm1kIiwgcmVwbGFjZT0iIiwgeD1wcmV2aW91c19maWxlKSwgIi12IiwgdmVyLCAiLnJkYS54eiIpKSkpCmBgYAo=

S.cerevisiae 2017: Combined RNASeq analyses.

atb abelew@gmail.com

2018-03-26