Did the stuff on this morning’s TODO which came out of this morning’s meeting: do a PCA without the oddball strains (already done in the worksheet), highlight reference strains, and add L.major IDs and Descriptions (done by appending a collapsed version of the ortholog data to the all_lp_annot data).
Fixed human IDs for the macrophage data.
Changed input metadata sheets: primarily because I only remembered yesterday to finish the SL search for samples >TMRC20095. They are running now and will be added momentarily (I will have to redownload the sheet).
Setting up to make a hclust/phylogenetic tree of strains, use these are reference: 2168(2.3), 2272(2.2), for other 2.x choose arbitrarily (lower numbers are better).
Added another sanitize columns call for Antimony vs. antimony and None vs. none in the TMRC2 macrophage samples.
This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.
In all cases the processing performed was:
I am thinking that this meeting will bring Maria Adelaida fully back into the analyses of the parasite data, and therefore may focus primarily on the goals rather than the analyses?
In a couple of important ways the TMRC2 data is much more complex than the TMRC3:
Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.
sample_sheet <- glue("sample_sheets/tmrc2_samples_{ver}.xlsx")
macrophage_sheet <- glue("sample_sheets/tmrc2_macrophage_samples_{ver}.xlsx"
## Error: <text>:3:0: unexpected end of input
## 1: sample_sheet <- glue("sample_sheets/tmrc2_samples_{ver}.xlsx")
## 2: macrophage_sheet <- glue("sample_sheets/tmrc2_macrophage_samples_{ver}.xlsx"
## ^
Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.
The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.
The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.
tt <- sm(library(EuPathDB))
orgdb <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)
all_lp_annot <- sm(load_orgdb_annotations(
pan_db,
keytype = "gid",
fields = c("annot_gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths) <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(EuPathDB::extract_eupath_orthologs(db = pan_db))
Recently there was a request to include the Leishmania major gene IDs and descriptions. Thus I will extract them along with the orthologs and append that to the annotations used.
Having spent the time to run the following code, I realized that the orthologs data structure above actually already has the gene IDs and descriptions.
Thus I will leave my query in place to extract the major annotations, but follow it up with a collapse of the major orthologs and appending of that to the panamensis annotations.
orgdb <- "org.Lmajor.Friedlin.v49.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
major_db <- org.Lmajor.Friedlin.v49.eg.db
all_fields <- columns(pan_db)
all_lm_annot <- sm(load_orgdb_annotations(
major_db,
keytype = "gid",
fields = c("annot_gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
wanted_orthos_idx <- orthos[["ORTHOLOGS_SPECIES"]] == "Leishmania major strain Friedlin"
sum(wanted_orthos_idx)
## [1] 10796
wanted_orthos <- orthos[wanted_orthos_idx, ]
wanted_orthos <- wanted_orthos[, c("GID", "ORTHOLOGS_ID", "ORTHOLOGS_NAME")]
collapsed_orthos <- wanted_orthos %>%
group_by(GID) %>%
summarise(collapsed_id = stringr::str_c(ORTHOLOGS_ID, collapse=" ; "),
collapsed_name = stringr::str_c(ORTHOLOGS_NAME, collapse=" ; "))
all_lp_annot <- merge(all_lp_annot, collapsed_orthos, by.x = "row.names",
by.y = "GID", all.x = TRUE)
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")
The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.
meta <- sm(EuPathDB::download_eupath_metadata(webservice = "tritrypdb"))
lp_entry <- EuPathDB::get_eupath_entry(species = "Leishmania panamensis", metadata = meta)
## Found the following hits: Leishmania panamensis MHOM/COL/81/L13, Leishmania panamensis strain MHOM/PA/94/PSC-1, choosing the first.
## Using: Leishmania panamensis MHOM/COL/81/L13.
colnames(lp_entry)
## [1] "AnnotationVersion" "AnnotationSource" "BiocVersion"
## [4] "DataProvider" "Genome" "GenomeSource"
## [7] "GenomeVersion" "NumArrayGene" "NumChipChipGene"
## [10] "NumChromosome" "NumCodingGene" "NumCommunity"
## [13] "NumContig" "NumEC" "NumEST"
## [16] "NumGene" "NumGO" "NumOrtholog"
## [19] "NumOtherGene" "NumPopSet" "NumProteomics"
## [22] "NumPseudogene" "NumRNASeq" "NumRTPCR"
## [25] "NumSNP" "NumTFBS" "Organellar"
## [28] "ReferenceStrain" "MegaBP" "PrimaryKey"
## [31] "ProjectID" "RecordClassName" "SourceID"
## [34] "SourceVersion" "TaxonomyID" "TaxonomyName"
## [37] "URLGenome" "URLGFF" "URLProtein"
## [40] "Coordinate_1_based" "Maintainer" "SourceUrl"
## [43] "Tags" "BsgenomePkg" "GrangesPkg"
## [46] "OrganismdbiPkg" "OrgdbPkg" "TxdbPkg"
## [49] "Taxon" "Genus" "Species"
## [52] "Strain" "BsgenomeFile" "GrangesFile"
## [55] "OrganismdbiFile" "OrgdbFile" "TxdbFile"
## [58] "GenusSpecies" "TaxonUnmodified" "TaxonCanonical"
## [61] "TaxonXref"
testing_panamensis <- "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53"
## testing_panamensis <- EuPathDB::make_eupath_bsgenome(entry = lp_entry, eu_version = "v46")
library(as.character(testing_panamensis), character.only = TRUE)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: rtracklayer
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome")
The process of sample estimation takes two primary inputs:
An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.
The following samples are much lower coverage:
The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.
color_choices <- list(
"strain" = list(
## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
"z2.0" = "#555555",
"z3.0" = "#777777",
"z2.1" = "#874400",
"z2.2" = "#0000cc",
"z2.3" = "#cc0000",
"z2.4" = "#df7000",
"z3.2" = "#888888",
"z1.0" = "#cc00cc",
"z1.5" = "#cc00cc",
"b2904" = "#cc00cc",
"unknown" = "#cbcbcb"),
## "null" = "#000000"),
"cf" = list(
"cure" = "#006f00",
"fail" = "#9dffa0",
"unknown" = "#cbcbcb",
"notapplicable" = "#000000"),
"susceptibility" = list(
"resistant" = "#8563a7",
"sensitive" = "#8d0000",
"ambiguous" = "#cbcbcb",
"unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")
The data structure ‘lp_expt’ contains the data for all samples which have hisat2 count tables, and which pass a few initial quality tests (e.g. they must have more than 8550 genes with >0 counts and >5e6 reads which mapped to a gene); genes which are annotated with a few key redundant categories (leishmanolysin for example) are also culled.
There are a few metadata columns which we really want to make certain are standardized.
sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
"zymodemecategorical")
lp_expt <- create_expt(sample_sheet,
gene_info = all_lp_annot,
annotation_name = orgdb,
savefile = glue::glue("rda/tmrc2_lp_expt_all_raw-v{ver}.rda"),
id_column = "hpglidentifier",
file_column = "lpanamensisv36hisatfile") %>%
set_expt_conditions(fact = "zymodemecategorical") %>%
subset_expt(nonzero = 8550) %>%
set_expt_colors(color_choices[["strain"]]) %>%
subset_expt(coverage = 5000000) %>%
set_expt_colors(color_choices[["strain"]]) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product") %>%
sanitize_expt_metadata(columns = sanitize_columns) %>%
set_expt_factors(columns = sanitize_columns, class = "factor")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'exprs': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'sample_sheet' not found
data_structures <- c(data_structures, "lp_expt")
save(list = "lp_expt", file = glue::glue("rda/tmrc2_lp_expt_all_sanitized-v{ver}.rda"))
## Error in save(list = "lp_expt", file = glue::glue("rda/tmrc2_lp_expt_all_sanitized-v{ver}.rda")): object 'lp_expt' not found
table(pData(lp_expt)[["zymodemecategorical"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
table(pData(lp_expt)[["clinicalresponse"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
ncol(exprs(lp_expt))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'ncol': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'lp_expt' not found
All the following data will derive from this starting point.
Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:
Note that these cutoffs are only valid for the historical data. The newer susceptibility data uses a cutoff of 0.78 for sensitive. I will set ambiguous to 0.5 to 0.78?
max_resist_historical <- 0.35
min_sensitive_historical <- 0.49
max_resist_current <- 0.50
min_sensitive_current <- 0.77
The sanitize_percent() function seeks to make the percentage values recorded by excel more reliable. Unfortunately, sometimes excel displays the value ‘49%’ when the information recorded in the worksheet is any one of the following:
Thus, the following block will sanitize these percentage values into a single decimal number and make a categorical variable from it using pre-defined values for resistant/ambiguous/sensitive. This categorical variable will be stored in a new column: ‘sus_category_historical’.
st <- pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
starting <- sanitize_percent(st)
## Error in is.factor(x): object 'st' not found
st
## Error in eval(expr, envir, enclos): object 'st' not found
starting
## Error in eval(expr, envir, enclos): object 'starting' not found
sus_categorical <- starting
## Error in eval(expr, envir, enclos): object 'starting' not found
na_idx <- is.na(starting)
## Error in eval(expr, envir, enclos): object 'starting' not found
sum(na_idx)
## Error in eval(expr, envir, enclos): object 'na_idx' not found
sus_categorical[na_idx] <- "unknown"
## Error in sus_categorical[na_idx] <- "unknown": object 'sus_categorical' not found
resist_idx <- starting <= max_resist_historical
## Error in eval(expr, envir, enclos): object 'starting' not found
sus_categorical[resist_idx] <- "resistant"
## Error in sus_categorical[resist_idx] <- "resistant": object 'sus_categorical' not found
indeterminant_idx <- starting > max_resist_historical &
starting < min_sensitive_historical
## Error in eval(expr, envir, enclos): object 'starting' not found
sus_categorical[indeterminant_idx] <- "ambiguous"
## Error in sus_categorical[indeterminant_idx] <- "ambiguous": object 'sus_categorical' not found
susceptible_idx <- starting >= min_sensitive_historical
## Error in eval(expr, envir, enclos): object 'starting' not found
sus_categorical[susceptible_idx] <- "sensitive"
## Error in sus_categorical[susceptible_idx] <- "sensitive": object 'sus_categorical' not found
sus_categorical <- as.factor(sus_categorical)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'sus_categorical' not found
pData(lp_expt)[["sus_category_historical"]] <- sus_categorical
## Error in eval(expr, envir, enclos): object 'sus_categorical' not found
table(sus_categorical)
## Error in eval(quote(list(...)), env): object 'sus_categorical' not found
The same process will be repeated for the current iteration of the sensitivity assay and stored in the ‘sus_category_current’ column.
starting_current <- sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
sus_categorical_current <- starting_current
## Error in eval(expr, envir, enclos): object 'starting_current' not found
na_idx <- is.na(starting_current)
## Error in eval(expr, envir, enclos): object 'starting_current' not found
sum(na_idx)
## Error in eval(expr, envir, enclos): object 'na_idx' not found
sus_categorical_current[na_idx] <- "unknown"
## Error in sus_categorical_current[na_idx] <- "unknown": object 'sus_categorical_current' not found
resist_idx <- starting_current <= max_resist_current
## Error in eval(expr, envir, enclos): object 'starting_current' not found
sus_categorical_current[resist_idx] <- "resistant"
## Error in sus_categorical_current[resist_idx] <- "resistant": object 'sus_categorical_current' not found
indeterminant_idx <- starting_current > max_resist_current &
starting_current < min_sensitive_current
## Error in eval(expr, envir, enclos): object 'starting_current' not found
sus_categorical_current[indeterminant_idx] <- "ambiguous"
## Error in sus_categorical_current[indeterminant_idx] <- "ambiguous": object 'sus_categorical_current' not found
susceptible_idx <- starting_current >= min_sensitive_current
## Error in eval(expr, envir, enclos): object 'starting_current' not found
sus_categorical_current[susceptible_idx] <- "sensitive"
## Error in sus_categorical_current[susceptible_idx] <- "sensitive": object 'sus_categorical_current' not found
sus_categorical_current <- as.factor(sus_categorical_current)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'sus_categorical_current' not found
pData(lp_expt)[["sus_category_current"]] <- sus_categorical_current
## Error in eval(expr, envir, enclos): object 'sus_categorical_current' not found
table(sus_categorical_current)
## Error in eval(quote(list(...)), env): object 'sus_categorical_current' not found
Here is a table of my current classifier’s interpretation of the strains.
table(pData(lp_expt)[["knnv2classification"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
In many queries, we will seek to compare only the two primary strains, zymodeme 2.2 and 2.3. The following block will extract only those samples.
lp_strain <- lp_expt %>%
set_expt_batches(fact = sus_categorical_current) %>%
set_expt_colors(color_choices[["strain"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
table(pData(lp_strain)[["condition"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_strain' not found
save(list = "lp_strain", file = glue::glue("rda/tmrc2_lp_strain-v{ver}.rda"))
## Error in save(list = "lp_strain", file = glue::glue("rda/tmrc2_lp_strain-v{ver}.rda")): object 'lp_strain' not found
data_structures <- c(data_structures, "lp_strain")
lp_two_strains <- subset_expt(lp_strain, subset = "condition=='z2.3'|condition=='z2.2'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'lp_strain' not found
save(list = "lp_two_strains",
file = glue::glue("rda/tmrc2_lp_two_strains-v{ver}.rda"))
## Error in save(list = "lp_two_strains", file = glue::glue("rda/tmrc2_lp_two_strains-v{ver}.rda")): object 'lp_two_strains' not found
data_structures <- c(data_structures, "lp_two_strains")
Clinical outcome is by far the most problematic comparison in this data, but here is the recategorization of the data using it:
lp_cf <- set_expt_conditions(lp_expt, fact = "clinicalcategorical") %>%
set_expt_batches(fact = sus_categorical_current) %>%
set_expt_colors(color_choices[["cf"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
table(pData(lp_cf)[["condition"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_cf' not found
data_structures <- c(data_structures, "lp_cf")
save(list = "lp_cf",
file = glue::glue("rda/tmrc2_lp_cf-v{ver}.rda"))
## Error in save(list = "lp_cf", file = glue::glue("rda/tmrc2_lp_cf-v{ver}.rda")): object 'lp_cf' not found
lp_cf_known <- subset_expt(lp_cf, subset="condition!='unknown'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'lp_cf' not found
data_structures <- c(data_structures, "lp_cf_known")
save(list = "lp_cf_known",
file = glue::glue("rda/tmrc2_lp_cf_known-v{ver}.rda"))
## Error in save(list = "lp_cf_known", file = glue::glue("rda/tmrc2_lp_cf_known-v{ver}.rda")): object 'lp_cf_known' not found
Use the factorized version of susceptibility to categorize the samples by the historical data.
lp_susceptibility_historical <- set_expt_conditions(lp_expt, fact = "sus_category_historical") %>%
set_expt_batches(fact = "clinicalcategorical") %>%
set_expt_colors(colors = color_choices[["susceptibility"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
save(list = "lp_susceptibility_historical",
file = glue::glue("rda/tmrc2_lp_susceptibility_historical-v{ver}.rda"))
## Error in save(list = "lp_susceptibility_historical", file = glue::glue("rda/tmrc2_lp_susceptibility_historical-v{ver}.rda")): object 'lp_susceptibility_historical' not found
data_structures <- c(data_structures, "lp_susceptibility_historical")
Use the factorized version of susceptibility to categorize the samples by the historical data.
This will likely be our canonical susceptibility dataset, so I will remove the suffix and just call it ‘lp_susceptibility’.
lp_susceptibility <- set_expt_conditions(lp_expt, fact = "sus_category_current") %>%
set_expt_batches(fact = "clinicalcategorical") %>%
set_expt_colors(colors = color_choices[["susceptibility"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
save(list = "lp_susceptibility",
file = glue::glue("rda/tmrc2_lp_susceptibility-v{ver}.rda"))
## Error in save(list = "lp_susceptibility", file = glue::glue("rda/tmrc2_lp_susceptibility-v{ver}.rda")): object 'lp_susceptibility' not found
data_structures <- c(data_structures, "lp_susceptibility")
I think this is redundant with a previous block, but I am leaving it until I am certain that it is not required in a following document.
lp_zymo <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'lp_expt' not found
data_structures <- c(data_structures, "lp_zymo")
save(list = "lp_zymo",
file = glue::glue("rda/tmrc2_lp_zymo-v{ver}.rda"))
## Error in save(list = "lp_zymo", file = glue::glue("rda/tmrc2_lp_zymo-v{ver}.rda")): object 'lp_zymo' not found
The following section will create some initial data structures of the observed variants in the parasite samples. This will include some of our 2016 samples for some classification queries.
I changed and improved the mapping and variant detection methods from what we used for the 2016 data. So some small changes will be required to merge them.
lp_previous <- create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
file_column = "tophat2file",
savefile = glue::glue("rda/lp_previous-v{ver}.rda"))
## Reading the sample metadata.
## Dropped 13 rows from the sample metadata because the sample ID is blank.
## The sample definitions comprises: 50 rows(samples) and 38 columns(metadata fields).
## Warning in create_expt("sample_sheets/tmrc2_samples_20191203.xlsx", file_column
## = "tophat2file", : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 8841 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 8841 features and 33 samples.
tt <- lp_previous$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\-1$", replacement = "", x = rownames(tt))
lp_previous$expressionset <- tt
rm(tt)
data_structures <- c(data_structures, "lp_previous")
The count_expt_snps() function uses our expressionset data and a metadata column in order to extract the mpileup or freebayes-based variant calls and create matrices of the likelihood that each position-per-sample is in fact a variant.
There is an important caveat here which changed on 202301: I was interpreting using the PAIRED tag, which is only used for, unsurprisingly, paired-end samples. A couple samples are not paired and so were failing silently. The QA tag looks like it is more appropriate and should work across both types. One way to find out, I am setting it here and will look to see if the results make more sense for my test samples (TMRC2001, TMRC2005, TMRC2007).
## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset = "!is.na(pData(lp_expt)[['freebayessummary']])")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'lp_expt' not found
new_snps <- count_expt_snps(lp_snp, annot_column = "freebayessummary", snp_column = "QA")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_snp' not found
## Lets see if we get numbers which make sense.
summary(exprs(new_snps)[["tmrc20001"]]) ## My weirdo sample
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'new_snps' not found
summary(exprs(new_snps)[["tmrc20072"]]) ## Another sample chosen at random
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'new_snps' not found
summary(exprs(new_snps)[["tmrc20021"]]) ## Another sample chosen at random
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'new_snps' not found
## Now that we are reasonably confident that things make more sense, lets save and move on...
data_structures <- c(data_structures, "new_snps")
tt <- normalize_expt(new_snps, transform = "log2")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'normalize_expt': object 'new_snps' not found
plot_boxplot(tt)
## Error in plot_boxplot(tt): object 'tt' not found
Now let us pull in the 2016 data.
old_snps <- count_expt_snps(lp_previous, annot_column = "bcftable", snp_column = 2)
## The rownames are missing the chromosome identifier,
## they probably came from an older version of this method.
data_structures <- c(data_structures, "old_snps")
save(list = "lp_snp",
file = glue::glue("rda/lp_snp-v{ver}.rda"))
## Error in save(list = "lp_snp", file = glue::glue("rda/lp_snp-v{ver}.rda")): object 'lp_snp' not found
data_structures <- c(data_structures, "lp_snp")
save(list = "new_snps",
file = glue::glue("rda/new_snps-v{ver}.rda"))
## Error in save(list = "new_snps", file = glue::glue("rda/new_snps-v{ver}.rda")): object 'new_snps' not found
data_structures <- c(data_structures, "new_snps")
save(list = "old_snps",
file = glue::glue("rda/old_snps-v{ver}.rda"))
data_structures <- c(data_structures, "old_snps")
nonzero_snps <- exprs(new_snps) != 0
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'new_snps' not found
colSums(nonzero_snps)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colSums': object 'nonzero_snps' not found
As far as I can tell, freebayes and mpileup are reasonably similar in their sensitivity/specificity; so combining the two datasets like this is expected to work with minimal problems. The most likely problem is that my mpileup-based pipeline is unable to handle indels.
## My old_snps is using an older annotation incorrectly, so fix it here:
Biobase::annotation(old_snps$expressionset) <- Biobase::annotation(new_snps$expressionset)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'annotation': object 'new_snps' not found
both_snps <- combine_expts(new_snps, old_snps)
## Error in combine_expts(new_snps, old_snps): object 'new_snps' not found
save(list = "both_snps",
file = glue::glue("rda/both_snps-v{ver}.rda"))
## Error in save(list = "both_snps", file = glue::glue("rda/both_snps-v{ver}.rda")): object 'both_snps' not found
data_structures <- c(data_structures, "both_snps")
I am taking a heatmap from our variant data and manually identifying sample groups.
All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.
The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.
hs_annot <- load_biomart_annotations(year="2020")
## The biomart annotations file already exists, loading from it.
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]
sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
hs_macrophage <- create_expt(
macrophage_sheet,
gene_info = hs_annot,
file_column = "hg38100hisatfile") %>%
set_expt_conditions(fact = "macrophagetreatment") %>%
set_expt_batches(fact = "macrophagezymodeme") %>%
sanitize_expt_metadata(columns = sanitize_columns) %>%
subset_expt(nonzero = 12000)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'macrophage_sheet' not found
fixed_genenames <- gsub(x = rownames(exprs(hs_macrophage)), pattern = "^gene:",
replacement = "")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'gsub': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'hs_macrophage' not found
hs_macrophage <- set_expt_genenames(hs_macrophage, ids = fixed_genenames)
## Error in set_expt_genenames(hs_macrophage, ids = fixed_genenames): object 'hs_macrophage' not found
table(pData(hs_macrophage)$condition)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macrophage' not found
## The following 3 lines were copy/pasted to datastructures and should be removed soon.
no_strain_idx <- pData(hs_macrophage)[["strainid"]] == "none"
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macrophage' not found
pData(hs_macrophage)[["strainid"]] <- paste0("s", pData(hs_macrophage)[["strainid"]],
"_", pData(hs_macrophage)[["macrophagezymodeme"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macrophage' not found
pData(hs_macrophage)[no_strain_idx, "strainid"] <- "none"
## Error in pData(hs_macrophage)[no_strain_idx, "strainid"] <- "none": object 'hs_macrophage' not found
data_structures <- c(data_structures, "hs_macrophage")
Finally, split off the U937 samples.
hs_u937 <- subset_expt(hs_macrophage, subset = "typeofcells!='Macrophages'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'hs_macrophage' not found
data_structures <- c(data_structures, "hs_u937")
In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.
lp_macrophage <- create_expt(macrophage_sheet,
file_column="lpanamensisv36hisatfile",
gene_info=all_lp_annot,
savefile = glue::glue("rda/lp_macrophage-v{ver}.rda"),
annotation="org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
set_expt_conditions(fact="macrophagezymodeme") %>%
set_expt_batches(fact="macrophagetreatment") %>%
subset_expt(nonzero = 8550) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'macrophage_sheet' not found
data_structures <- c(data_structures, "lp_macrophage")
lp_macrophage_nosb <- subset_expt(lp_macrophage, subset="batch!='inf_sb'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt' in selecting a method for function 'subset_expt': object 'lp_macrophage' not found
data_structures <- c(data_structures, "lp_macrophage_nosb")
save(list = data_structures, file = glue::glue("rda/tmrc2_data_structures-v{ver}.rda"))
## Error in save(list = data_structures, file = glue::glue("rda/tmrc2_data_structures-v{ver}.rda")): objects 'lp_expt', 'lp_strain', 'lp_two_strains', 'lp_cf', 'lp_cf_known', 'lp_susceptibility_historical', 'lp_susceptibility', 'lp_zymo', 'new_snps', 'lp_snp', 'new_snps', 'both_snps', 'hs_macrophage', 'hs_u937', 'lp_macrophage', 'lp_macrophage_nosb' not found
pander::pander(sessionInfo())
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53(v.2021.07), BSgenome(v.1.66.3), rtracklayer(v.1.58.0), Biostrings(v.2.66.0), XVector(v.0.38.0), futile.logger(v.1.4.3), org.Lmajor.Friedlin.v49.eg.db(v.2020.11), org.Lpanamensis.MHOMCOL81L13.v46.eg.db(v.2020.07), AnnotationDbi(v.1.60.2), EuPathDB(v.1.6.0), GenomeInfoDbData(v.1.2.9), Heatplus(v.3.6.0), hpgltools(v.1.0), testthat(v.3.1.7), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.9), IRanges(v.2.32.0), S4Vectors(v.0.36.2), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.3), AnnotationForge(v.1.40.2), tidyr(v.1.3.0), ggplot2(v.3.4.2), clusterGeneration(v.1.3.7), bit64(v.4.0.5), knitr(v.1.42), DelayedArray(v.0.24.0), data.table(v.1.14.8), KEGGREST(v.1.38.0), RCurl(v.1.98-1.12), doParallel(v.1.0.17), generics(v.0.1.3), GenomicFeatures(v.1.50.4), lambda.r(v.1.2.4), callr(v.3.7.3), RhpcBLASctl(v.0.23-42), cowplot(v.1.1.1), usethis(v.2.1.6), RSQLite(v.2.3.1), shadowtext(v.0.1.2), tzdb(v.0.3.0), bit(v.4.0.5), enrichplot(v.1.18.3), xml2(v.1.3.3), httpuv(v.1.6.9), viridis(v.0.6.2), xfun(v.0.38), hms(v.1.1.3), jquerylib(v.0.1.4), evaluate(v.0.20), promises(v.1.2.0.1), fansi(v.1.0.4), restfulr(v.0.0.15), progress(v.1.2.2), caTools(v.1.18.2), dbplyr(v.2.3.2), igraph(v.1.4.1), DBI(v.1.1.3), htmlwidgets(v.1.6.2), stringdist(v.0.9.10), purrr(v.1.0.1), ellipsis(v.0.3.2), dplyr(v.1.1.1), backports(v.1.4.1), annotate(v.1.76.0), aod(v.1.3.2), biomaRt(v.2.54.1), vctrs(v.0.6.1), remotes(v.2.4.2), cachem(v.1.0.7), withr(v.2.5.0), ggforce(v.0.4.1), HDO.db(v.0.99.1), vroom(v.1.6.1), AnnotationHubData(v.1.28.0), GenomicAlignments(v.1.34.1), treeio(v.1.22.0), prettyunits(v.1.1.1), DOSE(v.3.24.2), ape(v.5.7-1), lazyeval(v.0.2.2), crayon(v.1.5.2), genefilter(v.1.80.3), edgeR(v.3.40.2), pkgconfig(v.2.0.3), tweenr(v.2.0.2), nlme(v.3.1-162), pkgload(v.1.3.2), devtools(v.2.4.5), rlang(v.1.1.0), lifecycle(v.1.0.3), miniUI(v.0.1.1.1), downloader(v.0.4), filelock(v.1.0.2), BiocFileCache(v.2.6.1), AnnotationHub(v.3.6.0), rprojroot(v.2.0.3), polyclip(v.1.10-4), graph(v.1.76.0), Matrix(v.1.5-4), aplot(v.0.1.10), boot(v.1.3-28.1), processx(v.3.8.0), png(v.0.1-8), viridisLite(v.0.4.1), rjson(v.0.2.21), bitops(v.1.0-7), gson(v.0.1.0), KernSmooth(v.2.23-20), pander(v.0.6.5), blob(v.1.2.4), stringr(v.1.5.0), qvalue(v.2.30.0), readr(v.2.1.4), remaCor(v.0.0.11), gridGraphics(v.0.5-1), scales(v.1.2.1), memoise(v.2.0.1), GSEABase(v.1.60.0), magrittr(v.2.0.3), plyr(v.1.8.8), gplots(v.3.1.3), zlibbioc(v.1.44.0), compiler(v.4.2.0), scatterpie(v.0.1.8), BiocIO(v.1.8.0), RColorBrewer(v.1.1-3), lme4(v.1.1-32), Rsamtools(v.2.14.0), cli(v.3.6.1), urlchecker(v.1.0.1), patchwork(v.1.1.2), ps(v.1.7.4), formatR(v.1.14), MASS(v.7.3-58.3), mgcv(v.1.8-41), tidyselect(v.1.2.0), stringi(v.1.7.12), yaml(v.2.3.7), GOSemSim(v.2.24.0), locfit(v.1.5-9.7), ggrepel(v.0.9.3), biocViews(v.1.66.3), grid(v.4.2.0), sass(v.0.4.5), fastmatch(v.1.1-3), tools(v.4.2.0), parallel(v.4.2.0), rstudioapi(v.0.14), foreach(v.1.5.2), gridExtra(v.2.3), farver(v.2.1.1), ggraph(v.2.1.0), digest(v.0.6.31), BiocManager(v.1.30.20), shiny(v.1.7.4), Rcpp(v.1.0.10), broom(v.1.0.4), BiocVersion(v.3.16.0), later(v.1.3.0), OrganismDbi(v.1.40.0), httr(v.1.4.5), Rdpack(v.2.4), colorspace(v.2.1-0), rvest(v.1.0.3), brio(v.1.1.3), XML(v.3.99-0.14), fs(v.1.6.1), splines(v.4.2.0), BiocCheck(v.1.34.3), RBGL(v.1.74.0), yulab.utils(v.0.0.6), PROPER(v.1.30.0), tidytree(v.0.4.2), graphlayouts(v.0.8.4), ggplotify(v.0.1.0), plotly(v.4.10.1), sessioninfo(v.1.2.2), xtable(v.1.8-4), futile.options(v.1.0.1), jsonlite(v.1.8.4), nloptr(v.2.0.3), ggtree(v.3.6.2), tidygraph(v.1.2.3), ggfun(v.0.0.9), R6(v.2.5.1), RUnit(v.0.4.32), profvis(v.0.3.7), pillar(v.1.9.0), htmltools(v.0.5.5), mime(v.0.12), glue(v.1.6.2), fastmap(v.1.1.1), minqa(v.1.2.5), clusterProfiler(v.4.6.2), BiocParallel(v.1.32.6), interactiveDisplayBase(v.1.36.0), codetools(v.0.2-19), fgsea(v.1.24.0), pkgbuild(v.1.4.0), mvtnorm(v.1.1-3), utf8(v.1.2.3), lattice(v.0.20-45), bslib(v.0.4.2), tibble(v.3.2.1), sva(v.3.46.0), pbkrtest(v.0.5.2), curl(v.5.0.0), gtools(v.3.9.4), zip(v.2.2.2), openxlsx(v.4.2.5.2), GO.db(v.3.16.0), survival(v.3.5-5), limma(v.3.54.2), rmarkdown(v.2.21), desc(v.1.4.2), munsell(v.0.5.0), iterators(v.1.0.14), variancePartition(v.1.28.9), reshape2(v.1.4.4), gtable(v.0.3.3) and rbibutils(v.2.2.13)
message("This is hpgltools commit: ", get_git_commit())
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 993bbd9339296a17d626955e59f4e7c44bddb719
## This is hpgltools commit: Wed Apr 12 14:21:24 2023 -0400: 993bbd9339296a17d626955e59f4e7c44bddb719
message("Saving to ", savefile)
## Saving to tmrc2_datasets_202304.rda.xz
tmp <- sm(saveme(filename = savefile))
tmp <- loadme(filename = savefile)