This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.
In all cases the processing performed was:
I am thinking that this meeting will bring Maria Adelaida fully back into the analyses of the parasite data, and therefore may focus primarily on the goals rather than the analyses?
In a couple of important ways the TMRC2 data is much more complex than the TMRC3:
Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.
sample_sheet <- glue::glue("sample_sheets/tmrc2_samples_202301.xlsx")
Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.
The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.
The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.
tt <- sm(library(EuPathDB))
orgdb <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
tt <- sm(library(orgdb, character.only=TRUE))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)
all_lp_annot <- sm(load_orgdb_annotations(
pan_db,
keytype = "gid",
fields = c("annot_gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
lp_go <- sm(load_orgdb_go(pan_db))
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths) <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(EuPathDB::extract_eupath_orthologs(db = pan_db))
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")
The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.
meta <- sm(EuPathDB::download_eupath_metadata(webservice="tritrypdb"))
lp_entry <- EuPathDB::get_eupath_entry(species="Leishmania panamensis", metadata=meta)
## Found the following hits: Leishmania panamensis MHOM/COL/81/L13, Leishmania panamensis strain MHOM/PA/94/PSC-1, choosing the first.
## Using: Leishmania panamensis MHOM/COL/81/L13.
colnames(lp_entry)
## [1] "AnnotationVersion" "AnnotationSource" "BiocVersion"
## [4] "DataProvider" "Genome" "GenomeSource"
## [7] "GenomeVersion" "NumArrayGene" "NumChipChipGene"
## [10] "NumChromosome" "NumCodingGene" "NumCommunity"
## [13] "NumContig" "NumEC" "NumEST"
## [16] "NumGene" "NumGO" "NumOrtholog"
## [19] "NumOtherGene" "NumPopSet" "NumProteomics"
## [22] "NumPseudogene" "NumRNASeq" "NumRTPCR"
## [25] "NumSNP" "NumTFBS" "Organellar"
## [28] "ReferenceStrain" "MegaBP" "PrimaryKey"
## [31] "ProjectID" "RecordClassName" "SourceID"
## [34] "SourceVersion" "TaxonomyID" "TaxonomyName"
## [37] "URLGenome" "URLGFF" "URLProtein"
## [40] "Coordinate_1_based" "Maintainer" "SourceUrl"
## [43] "Tags" "BsgenomePkg" "GrangesPkg"
## [46] "OrganismdbiPkg" "OrgdbPkg" "TxdbPkg"
## [49] "Taxon" "Genus" "Species"
## [52] "Strain" "BsgenomeFile" "GrangesFile"
## [55] "OrganismdbiFile" "OrgdbFile" "TxdbFile"
## [58] "GenusSpecies" "TaxonUnmodified" "TaxonCanonical"
## [61] "TaxonXref"
testing_panamensis <- "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53"
## testing_panamensis <- EuPathDB::make_eupath_bsgenome(entry=lp_entry, eu_version="v46")
library(as.character(testing_panamensis), character.only=TRUE)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: rtracklayer
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome")
The process of sample estimation takes two primary inputs:
An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.
The following samples are much lower coverage:
The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.
color_choices <- list(
"strain" = list(
## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
"z2.0" = "#555555",
"z3.0" = "#777777",
"z2.1" = "#874400",
"z2.2" = "#0000cc",
"z2.3" = "#cc0000",
"z2.4" = "#df7000",
"z3.2" = "#888888",
"z1.0" = "#cc00cc",
"z1.5" = "#cc00cc",
"b2904" = "#cc00cc",
"unknown" = "#cbcbcb"),
## "null" = "#000000"),
"cf" = list(
"cure" = "#006f00",
"fail" = "#9dffa0",
"unknown" = "#cbcbcb",
"notapplicable" = "#000000"),
"susceptibility" = list(
"resistant" = "#8563a7",
"sensitive" = "#8d0000",
"ambiguous" = "#cbcbcb",
"unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")
The data structure ‘lp_expt’ contains the data for all samples which have hisat2 count tables, and which pass a few initial quality tests (e.g. they must have more than 8550 genes with >0 counts and >5e6 reads which mapped to a gene); genes which are annotated with a few key redundant categories (leishmanolysin for example) are also culled.
There are a few metadata columns which we really want to make certain are standardized.
sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
"zymodemecategorical")
lp_expt <- create_expt(sample_sheet,
gene_info = all_lp_annot,
annotation_name = orgdb,
savefile = glue::glue("rda/tmrc2_lp_expt_all_raw-v{ver}.rda"),
id_column = "hpglidentifier",
file_column = "lpanamensisv36hisatfile") %>%
set_expt_conditions(fact = "zymodemecategorical") %>%
subset_expt(nonzero = 8550) %>%
set_expt_colors(color_choices[["strain"]]) %>%
subset_expt(coverage = 5000000) %>%
set_expt_colors(color_choices[["strain"]]) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product") %>%
sanitize_expt_metadata(columns = sanitize_columns) %>%
set_expt_factors(columns = sanitize_columns, class = "factor")
## Reading the sample metadata.
## Dropped 11 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 110 rows(samples) and 71 columns(metadata fields).
## Warning in create_expt(sample_sheet, gene_info = all_lp_annot, annotation_name
## = orgdb, : Some samples were removed when cross referencing the samples against
## the count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 8778 features and 105 samples.
## The samples (and read coverage) removed when filtering 8550 non-zero genes are:
## TMRC20002 TMRC20006
## 11681227 6670348
## subset_expt(): There were 105, now there are 103 samples.
## The samples removed (and read coverage) when filtering samples with less than 5e+06 reads are:
## TMRC20004 TMRC20029
## 564812 1658096
## subset_expt(): There were 103, now there are 101 samples.
## semantic_expt_filter(): Removed 68 genes.
data_structures <- c(data_structures, "lp_expt")
save(list = "lp_expt", file = glue::glue("rda/tmrc2_lp_expt_all_sanitized-v{ver}.rda"))
table(pData(lp_expt)[["zymodemecategorical"]])
##
## b2904 unknown z10 z15 z20 z21 z22 z23 z24 z30
## 1 2 1 1 1 7 43 41 2 1
## z32
## 1
table(pData(lp_expt)[["clinicalresponse"]])
##
## cure failure
## 38 38
## laboratory line laboratory line miltefosine resistant
## 1 1
## nd reference strain
## 19 4
ncol(exprs(lp_expt))
## [1] 101
All the following data will derive from this starting point.
Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:
max_resist <- 0.35
min_sensitive <- 0.49
The sanitize_percent() function seeks to make the percentage values recorded by excel more reliable. Unfortunately, sometimes excel displays the value ‘49%’ when the information recorded in the worksheet is any one of the following:
Thus, the following block will sanitize these percentage values into a single decimal number and make a categorical variable from it using out pre-defined values for resistant/ambiguous/sensitive. This categorical variable will be stored in a new column: ‘sus_category_historical’.
st <- pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]]
starting <- sanitize_percent(st)
st
## [1] "0.45" "0.14000000000000001" "0.97"
## [4] NA NA NA
## [7] NA NA NA
## [10] "0" "0.97" "0"
## [13] "0" "0.46" "0.45"
## [16] "0.97" "0.56000000000000005" "0.99"
## [19] "0.46" "0.7" "0.99"
## [22] "0.99" "0.45" "0.98"
## [25] "0.99" "0.49" "No data"
## [28] "No data" "0.99" "0.66"
## [31] "0.99" "No data" "0.99"
## [34] "1" "1" "0.94"
## [37] "0.94" "No data" "No data"
## [40] "No data" "No data" "No data"
## [43] "No data" "No data" "No data"
## [46] "No data" "No data" "No data"
## [49] "No data" "No data" "No data"
## [52] "0.99" "0.99" "No data"
## [55] "0.98" "0.97" "0.96"
## [58] "0.96" "0" "0"
## [61] "0" "0.06" "0.94"
## [64] "0.94" "0.03" "0.94"
## [67] "0" "0.25" "0.95"
## [70] "0.27" "No data" "No data"
## [73] "No data" "No data" "No data"
## [76] "No data" "No data" "No data"
## [79] "No data" "No data" "No data"
## [82] "No data" "No data" "No data"
## [85] "No data" "No data" "No data"
## [88] "No data" "No data" "No data"
## [91] "No data" "No data" "No data"
## [94] "No data" "No data" "No data"
## [97] "No data" "No data" "No data"
## [100] "No data" "No data"
starting
## [1] 0.45 0.14 0.97 NA NA NA NA NA NA 0.00 0.97 0.00 0.00 0.46 0.45
## [16] 0.97 0.56 0.99 0.46 0.70 0.99 0.99 0.45 0.98 0.99 0.49 NA NA 0.99 0.66
## [31] 0.99 NA 0.99 1.00 1.00 0.94 0.94 NA NA NA NA NA NA NA NA
## [46] NA NA NA NA NA NA 0.99 0.99 NA 0.98 0.97 0.96 0.96 0.00 0.00
## [61] 0.00 0.06 0.94 0.94 0.03 0.94 0.00 0.25 0.95 0.27 NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [91] NA NA NA NA NA NA NA NA NA NA NA
sus_categorical <- starting
na_idx <- is.na(starting)
sum(na_idx)
## [1] 55
sus_categorical[na_idx] <- "unknown"
resist_idx <- starting <= max_resist
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting > max_resist & starting < min_sensitive
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= min_sensitive
sus_categorical[susceptible_idx] <- "sensitive"
sus_categorical <- as.factor(sus_categorical)
pData(lp_expt)[["sus_category_historical"]] <- sus_categorical
table(sus_categorical)
## sus_categorical
## ambiguous resistant sensitive unknown
## 5 12 29 55
The same process will be repeated for the current iteration of the sensitivity assay and stored in the ‘sus_category_current’ column.
starting_current <- sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]])
## Warning in
## sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]]):
## NAs introduced by coercion
## Warning in
## sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]]):
## NAs introduced by coercion
## Warning in
## sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]]):
## NAs introduced by coercion
## Warning in
## sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]]):
## NAs introduced by coercion
## Warning in
## sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]]):
## NAs introduced by coercion
sus_categorical_current <- starting_current
na_idx <- is.na(starting_current)
sum(na_idx)
## [1] 24
sus_categorical_current[na_idx] <- "unknown"
resist_idx <- starting_current <= max_resist
sus_categorical_current[resist_idx] <- "resistant"
indeterminant_idx <- starting_current > max_resist & starting_current < min_sensitive
sus_categorical_current[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting_current >= min_sensitive
sus_categorical_current[susceptible_idx] <- "sensitive"
sus_categorical_current <- as.factor(sus_categorical_current)
pData(lp_expt)[["sus_category_current"]] <- sus_categorical_current
table(sus_categorical_current)
## sus_categorical_current
## ambiguous resistant sensitive unknown
## 13 12 52 24
In many queries, we will seek to compare only the two primary strains, zymodeme 2.2 and 2.3. The following block will extract only those samples.
lp_strain <- lp_expt %>%
set_expt_batches(fact = sus_categorical_current) %>%
set_expt_colors(color_choices[["strain"]])
table(pData(lp_strain)[["condition"]])
##
## b2904 unknown z1.0 z1.5 z2.0 z2.1 z2.2 z2.3 z2.4 z3.0
## 1 2 1 1 1 7 43 41 2 1
## z3.2
## 1
save(list = "lp_strain", file = glue::glue("rda/tmrc2_lp_strain-v{ver}.rda"))
data_structures <- c(data_structures, "lp_strain")
lp_two_strains <- subset_expt(lp_strain, subset = "condition=='z2.3'|condition=='z2.2'")
## subset_expt(): There were 101, now there are 84 samples.
save(list = "lp_two_strains",
file = glue::glue("rda/tmrc2_lp_two_strains-v{ver}.rda"))
data_structures <- c(data_structures, "lp_two_strains")
Clinical outcome is by far the most problematic comparison in this data, but here is the recategorization of the data using it:
lp_cf <- set_expt_conditions(lp_expt, fact = "clinicalcategorical") %>%
set_expt_batches(fact = sus_categorical_current) %>%
set_expt_colors(color_choices[["cf"]])
## Warning in set_expt_colors(., color_choices[["cf"]]): Colors for the following
## categories are not being used: notapplicable.
table(pData(lp_cf)[["condition"]])
##
## cure fail unknown
## 38 38 25
data_structures <- c(data_structures, "lp_cf")
save(list = "lp_cf",
file = glue::glue("rda/tmrc2_lp_cf-v{ver}.rda"))
lp_cf_known <- subset_expt(lp_cf, subset="condition!='unknown'")
## subset_expt(): There were 101, now there are 76 samples.
data_structures <- c(data_structures, "lp_cf_known")
save(list = "lp_cf_known",
file = glue::glue("rda/tmrc2_lp_cf_known-v{ver}.rda"))
Use the factorized version of susceptibility to categorize the samples by the historical data.
lp_susceptibility_historical <- set_expt_conditions(lp_expt, fact = "sus_category_historical") %>%
set_expt_batches(fact = "clinicalcategorical") %>%
set_expt_colors(colors = color_choices[["susceptibility"]])
save(list = "lp_susceptibility_historical",
file = glue::glue("rda/tmrc2_lp_susceptibility_historical-v{ver}.rda"))
data_structures <- c(data_structures, "lp_susceptibility_historical")
Use the factorized version of susceptibility to categorize the samples by the historical data.
This will likely be our canonical susceptibility dataset, so I will remove the suffix and just call it ‘lp_susceptibility’.
lp_susceptibility <- set_expt_conditions(lp_expt, fact = "sus_category_current") %>%
set_expt_batches(fact = "clinicalcategorical") %>%
set_expt_colors(colors = color_choices[["susceptibility"]])
save(list = "lp_susceptibility",
file = glue::glue("rda/tmrc2_lp_susceptibility-v{ver}.rda"))
data_structures <- c(data_structures, "lp_susceptibility")
I think this is redundant with a previous block, but I am leaving it until I am certain that it is not required in a following document.
lp_zymo <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
## subset_expt(): There were 101, now there are 84 samples.
data_structures <- c(data_structures, "lp_zymo")
save(list = "lp_zymo",
file = glue::glue("rda/tmrc2_lp_zymo-v{ver}.rda"))
The following section will create some initial data structures of the observed variants in the parasite samples. This will include some of our 2016 samples for some classification queries.
I changed and improved the mapping and variant detection methods from what we used for the 2016 data. So some small changes will be required to merge them.
lp_previous <- create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
file_column = "tophat2file",
savefile = glue::glue("rda/lp_previous-v{ver}.rda"))
## Reading the sample metadata.
## Dropped 13 rows from the sample metadata because the sample ID is blank.
## The sample definitions comprises: 50 rows(samples) and 38 columns(metadata fields).
## Warning in create_expt("sample_sheets/tmrc2_samples_20191203.xlsx", file_column
## = "tophat2file", : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 8841 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 8841 features and 33 samples.
data_structures <- c(data_structures, "lp_previous")
tt <- lp_previous$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
lp_previous$expressionset <- tt
rm(tt)
The count_expt_snps() function uses our expressionset data and a metadata column in order to extract the mpileup or freebayes-based variant calls and create matrices of the likelihood that each position-per-sample is in fact a variant.
## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset="!is.na(pData(lp_expt)[['freebayessummary']])")
## subset_expt(): There were 101, now there are 101 samples.
new_snps <- count_expt_snps(lp_snp, annot_column = "freebayessummary", snp_column="PAIRED")
## New names:
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## Warning: NAs introduced by coercion
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
data_structures <- c(data_structures, "new_snps")
old_snps <- count_expt_snps(lp_previous, annot_column = "bcftable", snp_column = 2)
## The rownames are missing the chromosome identifier,
## they probably came from an older version of this method.
data_structures <- c(data_structures, "old_snps")
save(list = "lp_snp",
file = glue::glue("rda/lp_snp-v{ver}.rda"))
data_structures <- c(data_structures, "lp_snp")
save(list = "new_snps",
file = glue::glue("rda/new_snps-v{ver}.rda"))
data_structures <- c(data_structures, "new_snps")
save(list = "old_snps",
file = glue::glue("rda/old_snps-v{ver}.rda"))
data_structures <- c(data_structures, "old_snps")
nonzero_snps <- exprs(new_snps) != 0
colSums(nonzero_snps)
## tmrc20001 tmrc20065 tmrc20005 tmrc20007 tmrc20008 tmrc20027 tmrc20028 tmrc20032
## 0 93649 0 0 0 351343 338580 146302
## tmrc20040 tmrc20066 tmrc20039 tmrc20037 tmrc20038 tmrc20067 tmrc20068 tmrc20041
## 58753 93615 25115 98958 97676 93954 96583 53184
## tmrc20015 tmrc20009 tmrc20010 tmrc20016 tmrc20011 tmrc20012 tmrc20013 tmrc20017
## 96398 15890 93816 146124 13914 456 94766 48288
## tmrc20014 tmrc20018 tmrc20019 tmrc20070 tmrc20020 tmrc20021 tmrc20022 tmrc20025
## 17245 140438 14829 97336 15484 101127 18143 364240
## tmrc20024 tmrc20036 tmrc20069 tmrc20033 tmrc20026 tmrc20031 tmrc20076 tmrc20073
## 18471 60087 18792 33663 15074 19139 18385 96169
## tmrc20055 tmrc20079 tmrc20071 tmrc20078 tmrc20094 tmrc20042 tmrc20058 tmrc20072
## 22246 96224 94353 18836 87878 19734 94524 50292
## tmrc20059 tmrc20048 tmrc20057 tmrc20088 tmrc20056 tmrc20060 tmrc20077 tmrc20074
## 94091 97164 48944 15594 22683 21506 18773 22132
## tmrc20063 tmrc20053 tmrc20052 tmrc20064 tmrc20075 tmrc20051 tmrc20050 tmrc20049
## 28254 20181 100709 93173 97982 94125 17200 16168
## tmrc20062 tmrc20110 tmrc20080 tmrc20043 tmrc20083 tmrc20054 tmrc20085 tmrc20046
## 93677 16997 96528 95623 21167 93603 89765 48608
## tmrc20093 tmrc20089 tmrc20047 tmrc20090 tmrc20044 tmrc20045 tmrc20061 tmrc20105
## 48254 90421 92637 91564 14861 50403 116906 86758
## tmrc20108 tmrc20109 tmrc20098 tmrc20096 tmrc20097 tmrc20101 tmrc20092 tmrc20082
## 97005 17932 92927 17534 46863 17753 16578 108121
## tmrc20102 tmrc20099 tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103 tmrc20104
## 92380 91383 94381 15059 46548 14947 49368 94237
## tmrc20086 tmrc20107 tmrc20081 tmrc20106 tmrc20095
## 15813 95370 19533 18830 81200
As far as I can tell, freebayes and mpileup are reasonably similar in their sensitivity/specificity; so combining the two datasets like this is expected to work with minimal problems. The most likely problem is that my mpileup-based pipeline is unable to handle indels.
## My old_snps is using an older annotation incorrectly, so fix it here:
Biobase::annotation(old_snps$expressionset) <- Biobase::annotation(new_snps$expressionset)
both_snps <- combine_expts(new_snps, old_snps)
save(list = "both_snps",
file = glue::glue("rda/both_snps-v{ver}.rda"))
data_structures <- c(data_structures, "both_snps")
I am taking a heatmap from our variant data and manually identifying sample groups.
All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.
The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.
macrophage_sheet <- "sample_sheets/tmrc2_macrophage_samples_202212.xlsx"
hs_annot <- load_biomart_annotations(year="2020")
## The biomart annotations file already exists, loading from it.
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique=TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]
hs_macrophage <- create_expt(
macrophage_sheet,
gene_info = hs_annot,
file_column = "hg38100hisatfile") %>%
set_expt_conditions(fact="macrophagetreatment") %>%
set_expt_batches(fact="macrophagezymodeme") %>%
subset_expt(nonzero = 12000)
## Reading the sample metadata.
## The sample definitions comprises: 70 rows(samples) and 69 columns(metadata fields).
## Matched 21447 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 21481 features and 70 samples.
## The samples (and read coverage) removed when filtering 12000 non-zero genes are:
## TMRC30162 TMRC30329
## 521145 1253
## subset_expt(): There were 70, now there are 68 samples.
fixed_genenames <- gsub(x = rownames(exprs(hs_macrophage)), pattern = "^gene:",
replacement = "")
hs_macrophage <- set_expt_genenames(hs_macrophage, ids = fixed_genenames)
table(pData(hs_macrophage)$condition)
##
## inf inf_sb uninf uninf_sb
## 29 29 5 5
data_structures <- c(data_structures, "hs_macrophage")
Finally, split off the U937 samples.
hs_u937 <- subset_expt(hs_macrophage, subset = "typeofcells!='Macrophages'")
## subset_expt(): There were 68, now there are 14 samples.
data_structures <- c(data_structures, "hs_u937")
In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.
lp_macrophage <- create_expt(macrophage_sheet,
file_column="lpanamensisv36hisatfile",
gene_info=all_lp_annot,
savefile = glue::glue("rda/lp_macrophage-v{ver}.rda"),
annotation="org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
set_expt_conditions(fact="macrophagezymodeme") %>%
set_expt_batches(fact="macrophagetreatment") %>%
subset_expt(nonzero = 8550) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product")
## Reading the sample metadata.
## The sample definitions comprises: 70 rows(samples) and 69 columns(metadata fields).
## Warning in create_expt(macrophage_sheet, file_column =
## "lpanamensisv36hisatfile", : Some samples were removed when cross referencing
## the samples against the count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 8778 features and 67 samples.
## The samples (and read coverage) removed when filtering 8550 non-zero genes are:
## TMRC30051 TMRC30061 TMRC30062 TMRC30065 TMRC30066 TMRC30069 TMRC30117 TMRC30244
## 190577 636331 38270 34155 3080 424761 1147 1662
## TMRC30246 TMRC30249 TMRC30251 TMRC30252 TMRC30266 TMRC30268 TMRC30326 TMRC30323
## 2834 14507 84934 634958 822 3444 375 84
## TMRC30328 TMRC30329 TMRC30319 TMRC30325 TMRC30321 TMRC30327 TMRC30312 TMRC30297
## 934205 15 374 356 549353 129 76 720922
## TMRC30298 TMRC30300 TMRC30295 TMRC30296 TMRC30303 TMRC30304 TMRC30301 TMRC30302
## 183737 21418 366317 59403 676036 289 286954 9297
## TMRC30314 TMRC30315 TMRC30313 TMRC30309 TMRC30293 TMRC30294 TMRC30292 TMRC30307
## 653444 115279 96 188 782860 307275 10780 288441
## TMRC30308 TMRC30331 TMRC30311 TMRC30332 TMRC30305 TMRC30306 TMRC30330
## 90798 8324 158193 5974 409241 122620 181
## subset_expt(): There were 67, now there are 20 samples.
## semantic_expt_filter(): Removed 68 genes.
data_structures <- c(data_structures, "lp_macrophage")
lp_macrophage_nosb <- subset_expt(lp_macrophage, subset="batch!='inf_sb'")
## subset_expt(): There were 20, now there are 18 samples.
data_structures <- c(data_structures, "lp_macrophage_nosb")
save(list = data_structures, file = glue::glue("rda/tmrc2_data_structures-v{ver}.rda"))
pander::pander(sessionInfo())
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53(v.2021.07), BSgenome(v.1.66.1), rtracklayer(v.1.58.0), Biostrings(v.2.66.0), XVector(v.0.38.0), futile.logger(v.1.4.3), org.Lpanamensis.MHOMCOL81L13.v46.eg.db(v.2020.07), AnnotationDbi(v.1.60.0), EuPathDB(v.1.6.0), GenomeInfoDbData(v.1.2.9), Heatplus(v.3.6.0), hpgltools(v.1.0), testthat(v.3.1.6), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.6), IRanges(v.2.32.0), S4Vectors(v.0.36.1), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.3), AnnotationForge(v.1.40.0), tidyr(v.1.2.1), ggplot2(v.3.4.0), bit64(v.4.0.5), knitr(v.1.41), DelayedArray(v.0.24.0), data.table(v.1.14.6), KEGGREST(v.1.38.0), RCurl(v.1.98-1.9), doParallel(v.1.0.17), generics(v.0.1.3), GenomicFeatures(v.1.50.3), lambda.r(v.1.2.4), callr(v.3.7.3), RhpcBLASctl(v.0.21-247.1), cowplot(v.1.1.1), usethis(v.2.1.6), RSQLite(v.2.2.20), shadowtext(v.0.1.2), tzdb(v.0.3.0), bit(v.4.0.5), enrichplot(v.1.18.3), xml2(v.1.3.3), httpuv(v.1.6.7), assertthat(v.0.2.1), viridis(v.0.6.2), xfun(v.0.36), hms(v.1.1.2), jquerylib(v.0.1.4), evaluate(v.0.19), promises(v.1.2.0.1), fansi(v.1.0.3), restfulr(v.0.0.15), progress(v.1.2.2), caTools(v.1.18.2), dbplyr(v.2.2.1), igraph(v.1.3.5), DBI(v.1.1.3), htmlwidgets(v.1.6.0), stringdist(v.0.9.10), purrr(v.1.0.0), ellipsis(v.0.3.2), dplyr(v.1.0.10), backports(v.1.4.1), annotate(v.1.76.0), aod(v.1.3.2), biomaRt(v.2.54.0), vctrs(v.0.5.1), remotes(v.2.4.2), cachem(v.1.0.6), withr(v.2.5.0), ggforce(v.0.4.1), HDO.db(v.0.99.1), vroom(v.1.6.0), AnnotationHubData(v.1.28.0), GenomicAlignments(v.1.34.0), treeio(v.1.22.0), prettyunits(v.1.1.1), DOSE(v.3.24.2), ape(v.5.6-2), lazyeval(v.0.2.2), crayon(v.1.5.2), genefilter(v.1.80.2), edgeR(v.3.40.1), pkgconfig(v.2.0.3), tweenr(v.2.0.2), nlme(v.3.1-161), pkgload(v.1.3.2), devtools(v.2.4.5), rlang(v.1.0.6), lifecycle(v.1.0.3), miniUI(v.0.1.1.1), downloader(v.0.4), filelock(v.1.0.2), BiocFileCache(v.2.6.0), AnnotationHub(v.3.6.0), rprojroot(v.2.0.3), polyclip(v.1.10-4), graph(v.1.76.0), Matrix(v.1.5-3), aplot(v.0.1.9), boot(v.1.3-28.1), processx(v.3.8.0), png(v.0.1-8), viridisLite(v.0.4.1), rjson(v.0.2.21), bitops(v.1.0-7), gson(v.0.0.9), KernSmooth(v.2.23-20), pander(v.0.6.5), blob(v.1.2.3), stringr(v.1.5.0), qvalue(v.2.30.0), readr(v.2.1.3), gridGraphics(v.0.5-1), scales(v.1.2.1), memoise(v.2.0.1), GSEABase(v.1.60.0), magrittr(v.2.0.3), plyr(v.1.8.8), gplots(v.3.1.3), zlibbioc(v.1.44.0), compiler(v.4.2.0), scatterpie(v.0.1.8), BiocIO(v.1.8.0), RColorBrewer(v.1.1-3), lme4(v.1.1-31), Rsamtools(v.2.14.0), cli(v.3.5.0), urlchecker(v.1.0.1), patchwork(v.1.1.2), ps(v.1.7.2), formatR(v.1.13), MASS(v.7.3-58.1), mgcv(v.1.8-41), tidyselect(v.1.2.0), stringi(v.1.7.8), yaml(v.2.3.6), GOSemSim(v.2.24.0), locfit(v.1.5-9.7), ggrepel(v.0.9.2), biocViews(v.1.66.2), grid(v.4.2.0), sass(v.0.4.4), fastmatch(v.1.1-3), tools(v.4.2.0), parallel(v.4.2.0), rstudioapi(v.0.14), foreach(v.1.5.2), gridExtra(v.2.3), farver(v.2.1.1), ggraph(v.2.1.0), digest(v.0.6.31), BiocManager(v.1.30.19), shiny(v.1.7.4), Rcpp(v.1.0.9), broom(v.1.0.2), BiocVersion(v.3.16.0), later(v.1.3.0), OrganismDbi(v.1.40.0), httr(v.1.4.4), Rdpack(v.2.4), colorspace(v.2.0-3), rvest(v.1.0.3), brio(v.1.1.3), XML(v.3.99-0.13), fs(v.1.5.2), splines(v.4.2.0), BiocCheck(v.1.34.2), RBGL(v.1.74.0), yulab.utils(v.0.0.6), PROPER(v.1.30.0), tidytree(v.0.4.2), graphlayouts(v.0.8.4), ggplotify(v.0.1.0), plotly(v.4.10.1), sessioninfo(v.1.2.2), xtable(v.1.8-4), futile.options(v.1.0.1), jsonlite(v.1.8.4), nloptr(v.2.0.3), ggtree(v.3.6.2), tidygraph(v.1.2.2), ggfun(v.0.0.9), RUnit(v.0.4.32), R6(v.2.5.1), profvis(v.0.3.7), pillar(v.1.8.1), htmltools(v.0.5.4), mime(v.0.12), glue(v.1.6.2), fastmap(v.1.1.0), minqa(v.1.2.5), clusterProfiler(v.4.6.0), BiocParallel(v.1.32.5), interactiveDisplayBase(v.1.36.0), codetools(v.0.2-18), fgsea(v.1.24.0), pkgbuild(v.1.4.0), utf8(v.1.2.2), lattice(v.0.20-45), bslib(v.0.4.2), tibble(v.3.1.8), sva(v.3.46.0), pbkrtest(v.0.5.1), curl(v.4.3.3), gtools(v.3.9.4), zip(v.2.2.2), openxlsx(v.4.2.5.1), GO.db(v.3.16.0), survival(v.3.4-0), limma(v.3.54.0), rmarkdown(v.2.19), desc(v.1.4.2), munsell(v.0.5.0), iterators(v.1.0.14), variancePartition(v.1.28.0), reshape2(v.1.4.4), gtable(v.0.3.1) and rbibutils(v.2.2.11)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 911e7d4beebdc73267ec4be631a305574289efd3
## This is hpgltools commit: Tue Jan 17 10:36:44 2023 -0500: 911e7d4beebdc73267ec4be631a305574289efd3
message(paste0("Saving to ", savefile))
## Saving to tmrc2_datasets_202301.rda.xz
tmp <- sm(saveme(filename = savefile))
tmp <- loadme(filename = savefile)