I want to use this document to examine our first round of persistence samples. I checked my email from Najib and did not find a sample sheet but did find an explanation of the three sample types we expect.
In preparation for this, I downloaded a new hg38 genome. Since the panamensis asembly has not significantly changed (excepting the putative long read genome which I have not yet seen), I am just using the same one.
The hg38 genome I got is brand new (202405), so do not use the archive for a while.
## Ok, so useast.ensembl is failing today, let us use the jan2024 archive?
#hs_annot <- load_biomart_annotations(archive = FALSE, species = "hsapiens")
## Seems like the 202401 archive is a good choice, it is explicitly the hg38_111 release.
## and it is waaaaay faster (like 100x) than useast right now.
hs_annot <- load_biomart_annotations(archive = TRUE, species = "hsapiens",
year = 2024, month = "01")## The biomart annotations file already exists, loading from it.
panamensis_orgdb_idx <- grep(pattern = "^org.+panamen.+MHOM.+db$",
x = rownames(installed.packages()))
panamensis_orgdb <- tail(rownames(installed.packages())[panamensis_orgdb_idx], n = 1)
lp_annot <- load_orgdb_annotations(panamensis_orgdb, keytype = "gid")## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff, setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:hpgltools':
##
## annotation<-, conditions, conditions<-
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for
## packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:hpgltools':
##
## notes
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:hpgltools':
##
## trim
##
## Unable to find CDSNAME, setting it to ANNOT_EXTERNAL_DB_NAME.
## Unable to find CDSCHROM in the db, removing it.
## Unable to find CDSSTRAND in the db, removing it.
## Unable to find CDSSTART in the db, removing it.
## Unable to find CDSEND in the db, removing it.
## Extracted all gene ids.
## Attempting to select: ANNOT_EXTERNAL_DB_NAME, GENE_TYPE
## 'select()' returned 1:1 mapping between keys and columns
This is a little silly, but I am going to reload the annotations using the previous invocation to extract the annotation table without having to think. The previous block loads the orgdb for me, so I can just use that to get the fun annotations.
## Unable to find CDSNAME, setting it to ANNOT_EXTERNAL_DB_NAME.
## Unable to find CDSCHROM in the db, removing it.
## Unable to find CDSSTRAND in the db, removing it.
## Unable to find CDSSTART in the db, removing it.
## Unable to find CDSEND in the db, removing it.
## Extracted all gene ids.
## Attempting to select: ANNOT_EXTERNAL_DB_NAME, GENE_TYPE, ANNOT_AA_SEQUENCE_ID, ANNOT_ANNOTATED_GO_COMPONENT, ANNOT_ANNOTATED_GO_FUNCTION, ANNOT_ANNOTATED_GO_ID_COMPONENT, ANNOT_ANNOTATED_GO_ID_FUNCTION, ANNOT_ANNOTATED_GO_ID_PROCESS, ANNOT_ANNOTATED_GO_PROCESS, ANNOT_ANTICODON, ANNOT_APOLLO_LINK_OUT, ANNOT_APOLLO_TRANSCRIPT_DESCRIPTION, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_CODING_END, ANNOT_CODING_START, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_END_MAX, ANNOT_EXON_COUNT, ANNOT_EXTERNAL_DB_NAME, ANNOT_EXTERNAL_DB_VERSION, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_CONTEXT_END, ANNOT_GENE_CONTEXT_START, ANNOT_GENE_END_MAX, ANNOT_GENE_END_MAX_TEXT, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_ENTREZ_LINK, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_START_MIN, ANNOT_GENE_START_MIN_TEXT, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GENOMIC_SEQUENCE_LENGTH, ANNOT_GENUS_SPECIES, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_DEPRECATED, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MAP_LOCATION, ANNOT_MCMC_LOCATION, ANNOT_MOLECULAR_WEIGHT, ANNOT_NCBI_TAX_ID, ANNOT_ORTHOMCL_LINK, ANNOT_OVERVIEW, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PRIMARY_KEY, ANNOT_PROB_MAP, ANNOT_PROB_MCMC, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_PROTEIN_SOURCE_ID, ANNOT_PSEUDO_STRING, ANNOT_SEQUENCE_DATABASE_NAME, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SNPOVERVIEW, ANNOT_SO_ID, ANNOT_SO_TERM_DEFINITION, ANNOT_SO_TERM_NAME, ANNOT_SO_VERSION, ANNOT_START_MIN, ANNOT_STRAND, ANNOT_STRAND_PLUS_MINUS, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_IDS, ANNOT_UNIPROT_LINKS
## 'select()' returned 1:1 mapping between keys and columns
first_spec <- make_rnaseq_spec()
second_spec <- make_dnaseq_spec()
pre_meta <- gather_preprocessing_metadata(
starting_metadata = starting_sheet, id_column = "hpgl_identifier",
specification = first_spec,
basedir = "preprocessing", species = "lpanamensis_v68",
new_metadata = "sample_sheets/cs_samples_lpanamensis.xlsx")## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Writing new metadata to: sample_sheets/cs_samples_lpanamensis.xlsx
## Deleting the file sample_sheets/cs_samples_lpanamensis.xlsx before writing the tables.
post_meta <- gather_preprocessing_metadata(
starting_metadata = pre_meta[["new_meta"]], id_column = "hpgl_identifier",
specification = second_spec, basedir = "preprocessing", species = "lpanamensis_v68")## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: trimomatic_input already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: trimomatic_output already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: trimomatic_percent already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: fastqc_pct_gc already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_single_concordant already exists,
## replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_multi_concordant already exists,
## replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_single_all already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_multi_all already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_percent_log already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_genome_percent_log already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_alignment already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: input_r1 already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: input_r2 already exists, replacing it.
## Warning in gather_preprocessing_metadata(starting_metadata = pre_meta[["new_meta"]], : Column: hisat_count_table already exists, replacing it.
## Not writing new metadata.
I should have all my load_xyz_annotation functions return some of the same elements in their retlists.
cs_se <- create_se(post_meta[["new_meta"]], gene_info = lp_genes,
species = NULL, file_column = "hisat_count_table") %>%
set_conditions(fact = "initial_recurrent") %>%
set_batches(fact = "participant_code")## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 23 rows(samples) and 111 columns(metadata fields).
## Matched 8665 annotations and counts.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 8665 rows and 111 columns.
## The numbers of samples by condition are:
##
## I R
## 16 7
## The number of samples by batch are:
##
## PP1002 PP1003 PP1004 PP1007 PP1012 PP1013 PP1014 PP1015 PP1017 PP1018 PP2004 PP2008 PP2012 PP2015 PP2016 PP2019
## 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## The colors used in the expressionset are: #1B9E77, #7570B3.
## Library sizes of 23 samples,
## ranging from 3,530,715 to 7,324,639.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the hpgltools package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## A non-zero genes plot of 23 samples.
## These samples have an average 5.022 CPM coverage and 8558 genes observed, ranging from 8534 to
## 8587.
## The numbers of samples by condition are:
##
## PP1002 PP1003 PP1004 PP1007 PP1012 PP1013 PP1014 PP1015 PP1017 PP1018 PP2004 PP2008 PP2012 PP2015 PP2016 PP2019
## 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
cs_paired <- subset_se(tmp_cs, min_replicates = 2) %>%
set_conditions(fact = "initial_recurrent") %>%
subset_se(subset = "batch!='PP1012'")## Removing samples with less than 2 replicates.
## Removed: PRCS0015, PRCS0016, PRCS0017, PRCS0018, PRCS0019, PRCS0020, PRCS0021, PRCS0022, PRCS0023.
## The numbers of samples by condition are:
##
## I R
## 7 7
cs_paired_norm <- normalize(cs_paired, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 145 low-count genes (8520 remaining).
## transform_counts: Found 12 values equal to 0, adding 1 to the matrix.
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by I, R
## Shapes are defined by PP1002, PP1003, PP1004, PP1007, PP1013, PP1014.
cs_paired_combat <- normalize(cs_paired, transform = "log2", convert = "cpm",
filter = TRUE, batch = "combat")## Removing 145 low-count genes (8520 remaining).
## transform_counts: Found 45 values less than 0.
## transform_counts: Found 100 values equal to 0, adding 1 to the matrix.
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by I, R
## Shapes are defined by PP1002, PP1003, PP1004, PP1007, PP1013, PP1014.
cs_paired_nb <- normalize(cs_paired, transform = "log2", convert = "cpm",
filter = TRUE, batch = "sva")## Removing 145 low-count genes (8520 remaining).
## transform_counts: Found 10 values less than 0.
## transform_counts: Found 10 values equal to 0, adding 1 to the matrix.
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by I, R
## Shapes are defined by PP1002, PP1003, PP1004, PP1007, PP1013, PP1014.
## Writing the first sheet, containing a legend and some summary data.
## Warning in as.data.frame.DataFrame(pData(se), strinsAsFactors = FALSE): arguments in '...' ignored
## 1344 entries are 0. We are on a log scale, adding 1 to the data.
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the directlabels package.
## Please report the issue at <https://github.com/tdhock/directlabels/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Removing 145 low-count genes (8520 remaining).
## transform_counts: Found 78 values equal to 0, adding 1 to the matrix.
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'The factor I has 6 rows.
## The factor R has 6 rows.
cs_paired_de <- all_pairwise(cs_paired, model_svs = "sva", model_fstring = "~ 0 + condition",
filter = TRUE)## I R
## 6 6
## Removing 145 low-count genes (8520 remaining).
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 238 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## I R
## 6 6
## conditions
## I R
## 6 6
## conditions
## I R
## 6 6
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: sva.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## R_vs_I
## basic_vs_deseq 0.5107
## basic_vs_dream 0.4496
## basic_vs_ebseq 0.8951
## basic_vs_edger 0.5122
## basic_vs_limma 0.4956
## basic_vs_noiseq 0.9554
## deseq_vs_dream 0.8761
## deseq_vs_ebseq 0.5691
## deseq_vs_edger 0.9982
## deseq_vs_limma 0.8689
## deseq_vs_noiseq 0.5606
## dream_vs_ebseq 0.4742
## dream_vs_edger 0.8845
## dream_vs_limma 0.9741
## dream_vs_noiseq 0.4768
## ebseq_vs_edger 0.5722
## ebseq_vs_limma 0.4933
## ebseq_vs_noiseq 0.9343
## edger_vs_limma 0.8756
## edger_vs_noiseq 0.5620
## limma_vs_noiseq 0.4942
cs_paired_table <- combine_de_tables(
cs_paired_de,
excel = "excel/lp_cs_paired_initial_vs_recurrent_table.xlsx")## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup limma_sigdown
## 1 R_vs_I 0 0 0 0 0 0
## Only has information, cannot create an UpSet.
## Plot describing unique/shared genes in a differential expression table.
## NULL
cs_paired_sig <- extract_significant_genes(
cs_paired_table,
excel = "excel/cs_paired_initial_vs_recurrent_sig.xlsx")## Deleting the file excel/cs_paired_initial_vs_recurrent_sig.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up ebseq_down basic_up basic_down
## R_vs_I 0 0 0 0 0 0 0 0 0 0
## I R
## 6 6
## PP1002 PP1003 PP1004 PP1007 PP1013 PP1014
## 2 2 2 2 2 2
## Warning: attributes are not identical across measure variables; they will be dropped
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggsankey package.
## Please report the issue at <https://github.com/davidsjoberg/ggsankey/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Removing 145 low-count genes (8520 remaining).
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Warning in ggplot2::guide_legend(overwrite.aes = list(size = plot_size)): Arguments in `...` must be used.
## ✖ Problematic argument:
## • overwrite.aes = list(size = plot_size)
## ℹ Did you misspell an argument name?
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 238 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## I R
## 6 6
## conditions
## I R
## 6 6
## conditions
## I R
## 6 6
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## R_vs_I
## basic_vs_deseq 0.9575
## basic_vs_dream 0.9472
## basic_vs_ebseq 0.8951
## basic_vs_edger 0.9614
## basic_vs_limma 0.9664
## basic_vs_noiseq 0.9554
## deseq_vs_dream 0.9682
## deseq_vs_ebseq 0.9280
## deseq_vs_edger 0.9994
## deseq_vs_limma 0.9465
## deseq_vs_noiseq 0.9782
## dream_vs_ebseq 0.9313
## dream_vs_edger 0.9703
## dream_vs_limma 0.9751
## dream_vs_noiseq 0.9554
## ebseq_vs_edger 0.9320
## ebseq_vs_limma 0.9220
## ebseq_vs_noiseq 0.9343
## edger_vs_limma 0.9496
## edger_vs_noiseq 0.9799
## limma_vs_noiseq 0.9385
## Deleting the file excel/lp_cs_paired_de_batch_in_model.xlsx before writing the tables.
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup limma_sigdown
## 1 R_vs_I 0 0 0 0 0 0
## Only has information, cannot create an UpSet.
## Plot describing unique/shared genes in a differential expression table.
## NULL
## classifications <- classify_variants(colData(cs_se), coverage_column = NULL)
round1_variant_se <- count_snps(cs_se, annot_column = "freebayes_variants_table",
snp_column = "PAIRED")## Using the snp column: PAIRED from the sample annotations.
## Warning: NAs introduced by coercion
colData(round1_variant_se)[["variant_number"]] <- "low"
high_variant_samples <- colData(round1_variant_se)[["freebayes_observed"]] >= 20000
colData(round1_variant_se)[high_variant_samples, "variant_number"] <- "high"
colData(round1_variant_se)[["person_time"]] <- paste0(
colData(round1_variant_se)[["participant_code"]], "_",
colData(round1_variant_se)[["initial_recurrent"]])
round1_variant_se <- set_batches(round1_variant_se, fact = "initial_recurrent") %>%
set_conditions(fact = "participant_code")## The number of samples by batch are:
##
## I R
## 16 7
## The numbers of samples by condition are:
##
## PP1002 PP1003 PP1004 PP1007 PP1012 PP1013 PP1014 PP1015 PP1017 PP1018 PP2004 PP2008 PP2012 PP2015 PP2016 PP2019
## 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## Removing 0 low-count genes (146420 remaining).
## transform_counts: Found 2390744 values equal to 0, adding 1 to the matrix.
variant_pca <- plot_pca(variant_norm)
## variant_pca <- plot_pca(variant_norm, plot_labels = TRUE, max_overlaps = 1000)
pp(file = "images/variants_pca.png")
variant_pca$plot
dev.off()## png
## 2
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by PP1002, PP1003, PP1004, PP1007, PP1012, PP1013, PP1014, PP1015, PP1017, PP1018, PP2004, PP2008, PP2012, PP2015, PP2016, PP2019
## Shapes are defined by I, R.
## Removing samples with less than 2 replicates.
## Removed: PRCS0015, PRCS0016, PRCS0017, PRCS0018, PRCS0019, PRCS0020, PRCS0021, PRCS0022, PRCS0023.
## Removing 0 low-count genes (146420 remaining).
## transform_counts: Found 1532229 values equal to 0, adding 1 to the matrix.
paired_pca <- plot_pca(paired_norm, plot_labels = TRUE)
pp(file = "images/paired_pca.png")
paired_pca$plot
dev.off()## png
## 2
## Error:
## ! object 'paried_pca' not found
## The numbers of samples by condition are:
##
## PP1002 PP1003 PP1004 PP1007 PP1012 PP1013 PP1014 PP1015 PP1017 PP1018 PP2004 PP2008 PP2012 PP2015 PP2016 PP2019
## 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
paired <- c("PP1002", "PP1003", "PP1007", "PP1012", "PP1013", "PP1014")
paired_z23 <- c("PP1013", "PP1007")
paired_z22 <- c("PP1014", "PP1003", "PP1004")
paired_z21 <- c("PP1002")
paired_z23_samples <- colData(round1_variant_se)[["participant_code"]] %in% paired_z23
paired_z23_variant <- round1_variant_se[, paired_z23_samples]
paired_z22_samples <- colData(round1_variant_se)[["participant_code"]] %in% paired_z22
paired_z22_variant <- round1_variant_se[, paired_z22_samples]
paired_z21_samples <- colData(round1_variant_se)[["participant_code"]] %in% paired_z21
paired_z21_variant <- round1_variant_se[, paired_z21_samples]## The samples represent the following categories:
##
## PP1007_I PP1007_R PP1013_I PP1013_R
## 1 1 1 1
## Using a proportion of observed variants, converting the data to binary observations.
## The factor PP1007_I has only 1 row.
## The factor PP1007_R has only 1 row.
## The factor PP1013_I has only 1 row.
## The factor PP1013_R has only 1 row.
## Finished iterating over the chromosomes.
## $`0000`
## [1] ""
##
## $`1000`
## [1] "PP1007_I"
##
## $`0100`
## [1] "PP1007_R"
##
## $`1100`
## [1] "PP1007_I, PP1007_R"
##
## $`0010`
## [1] "PP1013_I"
##
## $`1010`
## [1] "PP1007_I, PP1013_I"
##
## $`0110`
## [1] "PP1007_R, PP1013_I"
##
## $`1110`
## [1] "PP1007_I, PP1007_R, PP1013_I"
##
## $`0001`
## [1] "PP1013_R"
##
## $`1001`
## [1] "PP1007_I, PP1013_R"
##
## $`0101`
## [1] "PP1007_R, PP1013_R"
##
## $`1101`
## [1] "PP1007_I, PP1007_R, PP1013_R"
##
## $`0011`
## [1] "PP1013_I, PP1013_R"
##
## $`1011`
## [1] "PP1007_I, PP1013_I, PP1013_R"
##
## $`0111`
## [1] "PP1007_R, PP1013_I, PP1013_R"
##
## $`1111`
## [1] "PP1007_I, PP1007_R, PP1013_I, PP1013_R"
## 1010 are the two initial samples and 0101 the recurrent
z23_shared_initial <- paired_z23_snp_sets[["intersections"]][["1010"]]
length(z23_shared_initial)## [1] 1252
z23_shared_recurrent <- paired_z23_snp_sets[["intersections"]][["0101"]]
length(z23_shared_recurrent)## [1] 1060
person_1107_vars <- paired_z23_snp_sets[["intersections"]][["1100"]]
person_1013_vars <- paired_z23_snp_sets[["intersections"]][["0011"]]
z23_proportion <- get_proportion_snp_sets(paired_z23_variant, factor = "person_time")## The samples represent the following categories:
##
## PP1007_I PP1007_R PP1013_I PP1013_R
## 1 1 1 1
## Using a proportion column to recast each position as a categorical, 0 (ref), 1 (heter), 2 (homo)
## The factor PP1007_I has only 1 row.
## The factor PP1007_R has only 1 row.
## The factor PP1013_I has only 1 row.
## The factor PP1013_R has only 1 row.
## Gathering the set of exclusive homozygous and heterozygous positions.
## Counting observed positions across 690 chromosomes/contigs.
## Counting homozygous positions by chromosome/contig.
## Counting heterozygous positions by chromosome/contig.
## Counting exclusive homozygous positions by chromosome/contig.
## Counting exclusive heterozygous positions by chromosome/contig.
## Counting exclusive notzygous positions by chromosome/contig.
z23_intersections <- snps_intersections(cs_se, paired_z23_snp_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")
## The top few genes with variants shared exclusively among the initial samples:
head(z23_intersections$gene_summaries[["PP1007_I, PP1013_I"]])## LPAL13_270011800 LPAL13_090013100 LPAL13_250017600 LPAL13_270011600 LPAL13_310018100 LPAL13_060013800
## 11 6 6 6 5 4
## LPAL13_000024200 LPAL13_100010100 LPAL13_260025500 LPAL13_260026900 LPAL13_330040200 LPAL13_340010300
## 3 2 2 2 2 2
z23_vs_genes <- snps_vs_genes(cs_se, paired_z23_snp_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")## The snp grange data has 146420 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000005, LPAL13_SCAF000007, LPAL13_SCAF000009
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021, LPAL13_SCAF000025
## There are 52510 overlapping variants and genes.
## LPAL13_090013100 LPAL13_340060500 LPAL13_200008400 LPAL13_350020500 LPAL13_330040200 LPAL13_270011800
## 97 86 85 73 71 70
## The samples represent the following categories:
##
## PP1003_I PP1003_R PP1004_I PP1004_R PP1014_I PP1014_R
## 1 1 1 1 1 1
## Using a proportion of observed variants, converting the data to binary observations.
## The factor PP1003_I has only 1 row.
## The factor PP1003_R has only 1 row.
## The factor PP1004_I has only 1 row.
## The factor PP1004_R has only 1 row.
## The factor PP1014_I has only 1 row.
## The factor PP1014_R has only 1 row.
## Finished iterating over the chromosomes.
## $`000000`
## [1] ""
##
## $`100000`
## [1] "PP1003_I"
##
## $`010000`
## [1] "PP1003_R"
##
## $`110000`
## [1] "PP1003_I, PP1003_R"
##
## $`001000`
## [1] "PP1004_I"
##
## $`101000`
## [1] "PP1003_I, PP1004_I"
##
## $`011000`
## [1] "PP1003_R, PP1004_I"
##
## $`111000`
## [1] "PP1003_I, PP1003_R, PP1004_I"
##
## $`000100`
## [1] "PP1004_R"
##
## $`100100`
## [1] "PP1003_I, PP1004_R"
##
## $`010100`
## [1] "PP1003_R, PP1004_R"
##
## $`110100`
## [1] "PP1003_I, PP1003_R, PP1004_R"
##
## $`001100`
## [1] "PP1004_I, PP1004_R"
##
## $`101100`
## [1] "PP1003_I, PP1004_I, PP1004_R"
##
## $`011100`
## [1] "PP1003_R, PP1004_I, PP1004_R"
##
## $`111100`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1004_R"
##
## $`000010`
## [1] "PP1014_I"
##
## $`100010`
## [1] "PP1003_I, PP1014_I"
##
## $`010010`
## [1] "PP1003_R, PP1014_I"
##
## $`110010`
## [1] "PP1003_I, PP1003_R, PP1014_I"
##
## $`001010`
## [1] "PP1004_I, PP1014_I"
##
## $`101010`
## [1] "PP1003_I, PP1004_I, PP1014_I"
##
## $`011010`
## [1] "PP1003_R, PP1004_I, PP1014_I"
##
## $`111010`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1014_I"
##
## $`000110`
## [1] "PP1004_R, PP1014_I"
##
## $`100110`
## [1] "PP1003_I, PP1004_R, PP1014_I"
##
## $`010110`
## [1] "PP1003_R, PP1004_R, PP1014_I"
##
## $`110110`
## [1] "PP1003_I, PP1003_R, PP1004_R, PP1014_I"
##
## $`001110`
## [1] "PP1004_I, PP1004_R, PP1014_I"
##
## $`101110`
## [1] "PP1003_I, PP1004_I, PP1004_R, PP1014_I"
##
## $`011110`
## [1] "PP1003_R, PP1004_I, PP1004_R, PP1014_I"
##
## $`111110`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1004_R, PP1014_I"
##
## $`000001`
## [1] "PP1014_R"
##
## $`100001`
## [1] "PP1003_I, PP1014_R"
##
## $`010001`
## [1] "PP1003_R, PP1014_R"
##
## $`110001`
## [1] "PP1003_I, PP1003_R, PP1014_R"
##
## $`001001`
## [1] "PP1004_I, PP1014_R"
##
## $`101001`
## [1] "PP1003_I, PP1004_I, PP1014_R"
##
## $`011001`
## [1] "PP1003_R, PP1004_I, PP1014_R"
##
## $`111001`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1014_R"
##
## $`000101`
## [1] "PP1004_R, PP1014_R"
##
## $`100101`
## [1] "PP1003_I, PP1004_R, PP1014_R"
##
## $`010101`
## [1] "PP1003_R, PP1004_R, PP1014_R"
##
## $`110101`
## [1] "PP1003_I, PP1003_R, PP1004_R, PP1014_R"
##
## $`001101`
## [1] "PP1004_I, PP1004_R, PP1014_R"
##
## $`101101`
## [1] "PP1003_I, PP1004_I, PP1004_R, PP1014_R"
##
## $`011101`
## [1] "PP1003_R, PP1004_I, PP1004_R, PP1014_R"
##
## $`111101`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1004_R, PP1014_R"
##
## $`000011`
## [1] "PP1014_I, PP1014_R"
##
## $`100011`
## [1] "PP1003_I, PP1014_I, PP1014_R"
##
## $`010011`
## [1] "PP1003_R, PP1014_I, PP1014_R"
##
## $`110011`
## [1] "PP1003_I, PP1003_R, PP1014_I, PP1014_R"
##
## $`001011`
## [1] "PP1004_I, PP1014_I, PP1014_R"
##
## $`101011`
## [1] "PP1003_I, PP1004_I, PP1014_I, PP1014_R"
##
## $`011011`
## [1] "PP1003_R, PP1004_I, PP1014_I, PP1014_R"
##
## $`111011`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1014_I, PP1014_R"
##
## $`000111`
## [1] "PP1004_R, PP1014_I, PP1014_R"
##
## $`100111`
## [1] "PP1003_I, PP1004_R, PP1014_I, PP1014_R"
##
## $`010111`
## [1] "PP1003_R, PP1004_R, PP1014_I, PP1014_R"
##
## $`110111`
## [1] "PP1003_I, PP1003_R, PP1004_R, PP1014_I, PP1014_R"
##
## $`001111`
## [1] "PP1004_I, PP1004_R, PP1014_I, PP1014_R"
##
## $`101111`
## [1] "PP1003_I, PP1004_I, PP1004_R, PP1014_I, PP1014_R"
##
## $`011111`
## [1] "PP1003_R, PP1004_I, PP1004_R, PP1014_I, PP1014_R"
##
## $`111111`
## [1] "PP1003_I, PP1003_R, PP1004_I, PP1004_R, PP1014_I, PP1014_R"
## 101010 are the initials once again
## 010101 are the recurrent samples once again
z22_shared_initial <- paired_z22_snp_sets[["intersections"]][["101010"]]
length(z22_shared_initial)## [1] 25
z22_shared_recurrent <- paired_z22_snp_sets[["intersections"]][["010101"]]
length(z22_shared_recurrent)## [1] 28
## The samples represent the following categories:
##
## PP1003_I PP1003_R PP1004_I PP1004_R PP1014_I PP1014_R
## 1 1 1 1 1 1
## Using a proportion column to recast each position as a categorical, 0 (ref), 1 (heter), 2 (homo)
## The factor PP1003_I has only 1 row.
## The factor PP1003_R has only 1 row.
## The factor PP1004_I has only 1 row.
## The factor PP1004_R has only 1 row.
## The factor PP1014_I has only 1 row.
## The factor PP1014_R has only 1 row.
## Gathering the set of exclusive homozygous and heterozygous positions.
## Counting observed positions across 690 chromosomes/contigs.
## Counting homozygous positions by chromosome/contig.
## Counting heterozygous positions by chromosome/contig.
## Counting exclusive homozygous positions by chromosome/contig.
## Counting exclusive heterozygous positions by chromosome/contig.
## Counting exclusive notzygous positions by chromosome/contig.
z22_intersections <- snps_intersections(cs_se, paired_z22_snp_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")
z22_vs_genes <- snps_vs_genes(cs_se, paired_z22_snp_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")## The snp grange data has 146420 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000005, LPAL13_SCAF000007, LPAL13_SCAF000009
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021, LPAL13_SCAF000025
## There are 52510 overlapping variants and genes.
## The samples represent the following categories:
##
## PP1002_I PP1002_R
## 1 1
## Using a proportion column to recast each position as a categorical, 0 (ref), 1 (heter), 2 (homo)
## The factor PP1002_I has only 1 row.
## The factor PP1002_R has only 1 row.
## Gathering the set of exclusive homozygous and heterozygous positions.
## Counting observed positions across 690 chromosomes/contigs.
## Counting homozygous positions by chromosome/contig.
## Counting heterozygous positions by chromosome/contig.
## Counting exclusive homozygous positions by chromosome/contig.
## Counting exclusive heterozygous positions by chromosome/contig.
## Counting exclusive notzygous positions by chromosome/contig.
In the following I hope to perform comparisons of the initial->recurrent comparisons of each individual against his/her own initial strain. I therefore created a series of files with high/low stringency filters for the AB tag produced by freebayes.
The relevant columns are ‘controlled_strain_high_filter’ and ‘controlled_strain_low_filter’.
## Note to self, 'high_filter' is the low stringency.
round2_high_variants <- count_snps(cs_se, annot_column = "controlled_strain_high_filter",
snp_column = "AB", scale = c(1,0)) %>%
set_batches(fact = "pca_zymodeme") %>%
set_conditions(fact = "initial_recurrent")## Using the snp column: AB from the sample annotations.
## The number of samples by batch are:
##
## Z2.1 Z2.2 Z2.3
## 3 6 5
## The numbers of samples by condition are:
##
## I R
## 7 7
## Removing 0 low-count genes (34504 remaining).
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by I, R
## Shapes are defined by Z2.1, Z2.2, Z2.3.
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
## The samples represent the following categories:
##
## I R
## 7 7
## Using a proportion of observed variants, converting the data to binary observations.
## The factor I has 7 rows.
## The factor R has 7 rows.
## Finished iterating over the chromosomes.
Ok, I have a problem, the AB column is from 0->1 where 0 is a high-confidence hit. I think, therefore, that I must do a 1-x to the result of count_snps.
round2_low_variants <- count_snps(cs_se, annot_column = "controlled_strain_low_filter",
snp_column = "AB", scale = c(1,0)) %>%
set_batches(fact = "pca_zymodeme")## Using the snp column: AB from the sample annotations.
## The number of samples by batch are:
##
## Z2.1 Z2.2 Z2.3
## 3 6 5
## Library sizes of 14 samples,
## ranging from 335.2 to 6,007.
round2_norm <- normalize(round2_low_variants, convert = "cpm", filter = TRUE, plot_labels = TRUE,
expt_names = "participant_code", max_overlaps = 1000)## Removing 0 low-count genes (18764 remaining).
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by I, R
## Shapes are defined by Z2.1, Z2.2, Z2.3.
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable convergence failure
Collect the variants by initial/recurrent
## The samples represent the following categories:
##
## I R
## 7 7
## Using a proportion of observed variants, converting the data to binary observations.
## The factor I has 7 rows.
## The factor R has 7 rows.
## Finished iterating over the chromosomes.
## $`00`
## [1] ""
##
## $`10`
## [1] "I"
##
## $`01`
## [1] "R"
##
## $`11`
## [1] "I, R"
## [1] "chr_LpaL13-20.1_pos_347915_ref_G_alt_T" "chr_LpaL13-20.1_pos_347924_ref_G_alt_T" "chr_LpaL13-34_pos_1237188_ref_T_alt_G"
## NULL
## [1] "chr_LpaL13-26_pos_938857_ref_C_alt_A"
low_sets provides a set of coordinates and their state with respect to if they were observed in only initial/recurrent/both.
get_proportion_snp_sets() should give us a catalog of variants by chromosome/contig with respect to sample type. There is a bit of a problem here: usually the tag I use is higher for more confident values, but AB is lower for more confident values; I need to do something to fix this…
## The samples represent the following categories:
##
## I R
## 7 7
## Using a proportion column to recast each position as a categorical, 0 (ref), 1 (heter), 2 (homo)
## The factor I has 7 rows.
## The factor R has 7 rows.
## Gathering the set of exclusive homozygous and heterozygous positions.
## Counting observed positions across 468 chromosomes/contigs.
## Counting homozygous positions by chromosome/contig.
## Counting heterozygous positions by chromosome/contig.
## Counting exclusive homozygous positions by chromosome/contig.
## Counting exclusive heterozygous positions by chromosome/contig.
## Counting exclusive notzygous positions by chromosome/contig.
low_intersection <- snps_intersections(
cs_se, low_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")
low_intersection## The combinations of variants, chromosomes, and genes which are unique to every factor
## and combination of factors in the data.
low_vs_genes <- snps_vs_genes(cs_se, low_sets, start_column = "annot_coding_start",
end_column = "annot_coding_end", chr_column = "annot_sequence_id")## The snp grange data has 18764 elements.
## The first few snp chromosomes are: LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000005, LPAL13_SCAF000009, LPAL13_SCAF000010, LPAL13_SCAF000011
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021, LPAL13_SCAF000025
## There are 5598 overlapping variants and genes.