1 TODO

  1. Have a set where we merge 2.1/2.2, 2.3/2.4.
  2. Represent the transition from a global view of the samples without any classification, then state the subpopulations/zymodemes, then add sensitivity/resistance, then cure/fail.
  3. Perhaps make an explicit plot where all samples are one color excepting a relatively small number of previously assayed set? The samples which would be colored in this view would be from Olga’s 2014 paper I think.
  4. Make a flow diagram going from s/r -> subpopulation -> c/f/u. (sankey)
  5. Make a table similar to the TMRC3 containing the statuses of the samples.
  6. Explicitly consider metadata column ‘P’ for reference strains – make an all grey plot with a few samples colored taken from this column.

2 Changelog

2.1 20230410

  • Updating the version number due to some moderately intrusive changes I made in order to more carefully create plots of the differential expresison data. I don’t think anything I did should actually change any of the data, but some of the analyses are definitely affected (note that the only change in results is due to a mistake I made in defining one of the contrasts, all other changes are just plot aesthetic improvements)

2.2 20230205

  • Did the stuff on this morning’s TODO which came out of this morning’s meeting: do a PCA without the oddball strains (already done in the worksheet), highlight reference strains, and add L.major IDs and Descriptions (done by appending a collapsed version of the ortholog data to the all_lp_annot data).

  • Fixed human IDs for the macrophage data.

  • Changed input metadata sheets: primarily because I only remembered yesterday to finish the SL search for samples >TMRC20095. They are running now and will be added momentarily (I will have to redownload the sheet).

  • Setting up to make a hclust/phylogenetic tree of strains, use these are reference: 2168(2.3), 2272(2.2), for other 2.x choose arbitrarily (lower numbers are better).

  • Added another sanitize columns call for Antimony vs. antimony and None vs. none in the TMRC2 macrophage samples.

3 Introduction

This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.

In all cases the processing performed was:

  1. Default trimming was performed.
  2. Hisat2 was used to map the remaining reads against the Leishmania panamensis genome revision 36.
  3. The alignments from hisat2 were used to count reads/gene against the revision 36 annotations with htseq.
  4. These alignments were also passed to the pileup functionality of samtools and the vcf/bcf utilities in order to make a matrix of all observed differences between each sample with respect to the reference.
  5. The freebayes variant estimation tool was used in addition to #4 to search for variant positions in a more robust fashion.
  6. The trimmed reads were passed to kraken2 using a viral database in order to look for samples with potential LRV sequence.
  7. An explicit, grep-based search for spliced leader reads was used against all human-derived samples. The results from this were copy/pasted into the sample sheet.

4 Notes 20221206 meeting

I am thinking that this meeting will bring Maria Adelaida fully back into the analyses of the parasite data, and therefore may focus primarily on the goals rather than the analyses?

  • Maria Adelaida meeting with Olgla/Mariana: integrating transcriptomics/genomics question.
  • Paper on relationship btwn primary metadata factors via transcriptome/genome.
  • Second on drug susceptibility without those factors (I think this means the macrophages)
  • Definition of species? MAG: Define consensus sequences for various strains/species. We effectively have this on hand, though the quality may be a little less good for 2.3.
  • Resulting goal: Create a tree of the strains (I am just going to call zymodemes strains from now on). ** What organisms would we include in a tree to describe these relationships: guyanensis, braziliensis 2904, 2.2, 2.3, 2.1, 2.4, panamensis reference, peruviania(sp? I have not seen this genome), panama, 2903; actually this may be tricky because we have always done this with a specific reference strain (panamensis col) which is one of the strains in the comparison. hmm… ** Check the most variant strains for identity (Luc) ** Methods for creating tree, traditional phylogeny vs. variant hclust?
  • PCR queries, works well if one performs sanger sequencing.

4.1 Multiple datasets

In a couple of important ways the TMRC2 data is much more complex than the TMRC3:

  1. It comprises multiple, completely separate queries:
    1. Sequencing the parasite samples
    2. Sequencing a set of human macrophage samples which were infected with specific parasite samples.
  2. The parasite transcriptomic samples comprise multiple different types of queries:
    1. Differential expression to look at strain, susceptibility, and clinical outcomes.
    2. Individual variant searches to look for potentially useful SNPs for classification of parasite samples.
  3. The human macrophage samples may be used to query both the host and parasite transcriptomes because (at least when not drug treated) there is a tremendous population of parasite reads in them.

4.2 Sample sheet(s)

Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.

sample_sheet <- glue("sample_sheets/tmrc2_samples_{ver}.xlsx")
macrophage_sheet <- glue("sample_sheets/tmrc2_macrophage_samples_{ver}_modified.xlsx")

5 Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.

The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.

tt <- sm(library(EuPathDB))
orgdb <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
pan_db <- org.Lpanamensis.MHOMCOL81L13.v46.eg.db
all_fields <- columns(pan_db)
all_lp_annot <- sm(load_orgdb_annotations(
    pan_db,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

testing <- load_orgdb_annotations(pan_db, keytype = "gid", fields = "all")
## Selecting the following fields, this might be too many: 
## ANNOT_BFD3_CDS, ANNOT_BFD3_MODEL, ANNOT_BFD6_CDS, ANNOT_BFD6_MODEL, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_DIF_CDS, ANNOT_DIF_MODEL, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_EXON_COUNT, ANNOT_FC_BFD3_CDS, ANNOT_FC_BFD3_MODEL, ANNOT_FC_BFD6_CDS, ANNOT_FC_BFD6_MODEL, ANNOT_FC_DIF_CDS, ANNOT_FC_DIF_MODEL, ANNOT_FC_PF_CDS, ANNOT_FC_PF_MODEL, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_SOURCE_ID, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GO_COMPONENT, ANNOT_GO_FUNCTION, ANNOT_GO_ID_COMPONENT, ANNOT_GO_ID_FUNCTION, ANNOT_GO_ID_PROCESS, ANNOT_GO_PROCESS, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MATCHED_RESULT, ANNOT_MOLECULAR_WEIGHT, ANNOT_NO_TET_CDS, ANNOT_NO_TET_MODEL, ANNOT_ORGANISM, ANNOT_PF_CDS, ANNOT_PF_MODEL, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PROJECT_ID, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SIGNALP_SCORES, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SOURCE_ID, ANNOT_STRAND, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_ID, ANNOT_URI, ANNOT_WDK_WEIGHT
## Extracted all gene ids.
## Attempting to select: ANNOT_BFD3_CDS, ANNOT_BFD3_MODEL, ANNOT_BFD6_CDS, ANNOT_BFD6_MODEL, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_DIF_CDS, ANNOT_DIF_MODEL, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_EXON_COUNT, ANNOT_FC_BFD3_CDS, ANNOT_FC_BFD3_MODEL, ANNOT_FC_BFD6_CDS, ANNOT_FC_BFD6_MODEL, ANNOT_FC_DIF_CDS, ANNOT_FC_DIF_MODEL, ANNOT_FC_PF_CDS, ANNOT_FC_PF_MODEL, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_SOURCE_ID, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GO_COMPONENT, ANNOT_GO_FUNCTION, ANNOT_GO_ID_COMPONENT, ANNOT_GO_ID_FUNCTION, ANNOT_GO_ID_PROCESS, ANNOT_GO_PROCESS, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MATCHED_RESULT, ANNOT_MOLECULAR_WEIGHT, ANNOT_NO_TET_CDS, ANNOT_NO_TET_MODEL, ANNOT_ORGANISM, ANNOT_PF_CDS, ANNOT_PF_MODEL, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PROJECT_ID, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SIGNALP_SCORES, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SOURCE_ID, ANNOT_STRAND, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_ID, ANNOT_URI, ANNOT_WDK_WEIGHT
## 'select()' returned 1:1 mapping between keys and columns
lp_go <- load_orgdb_go(pan_db)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(EuPathDB::extract_eupath_orthologs(db = pan_db))

5.1 Repeat for the L.major annotations

Recently there was a request to include the Leishmania major gene IDs and descriptions. Thus I will extract them along with the orthologs and append that to the annotations used.

Having spent the time to run the following code, I realized that the orthologs data structure above actually already has the gene IDs and descriptions.

Thus I will leave my query in place to extract the major annotations, but follow it up with a collapse of the major orthologs and appending of that to the panamensis annotations.

orgdb <- "org.Lmajor.Friedlin.v49.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
major_db <- org.Lmajor.Friedlin.v49.eg.db
all_fields <- columns(pan_db)
all_lm_annot <- sm(load_orgdb_annotations(
    major_db,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

wanted_orthos_idx <- orthos[["ORTHOLOGS_SPECIES"]] == "Leishmania major strain Friedlin"
sum(wanted_orthos_idx)
## [1] 10796
wanted_orthos <- orthos[wanted_orthos_idx, ]
wanted_orthos <- wanted_orthos[, c("GID", "ORTHOLOGS_ID", "ORTHOLOGS_NAME")]

collapsed_orthos <- wanted_orthos %>%
  group_by(GID) %>%
  summarise(collapsed_id = stringr::str_c(ORTHOLOGS_ID, collapse=" ; "),
            collapsed_name = stringr::str_c(ORTHOLOGS_NAME, collapse=" ; "))
all_lp_annot <- merge(all_lp_annot, collapsed_orthos, by.x = "row.names",
                      by.y = "GID", all.x = TRUE)
rownames(all_lp_annot) <- all_lp_annot[["Row.names"]]
all_lp_annot[["Row.names"]] <- NULL
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")

6 Load a genome

The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.

meta <- sm(EuPathDB::download_eupath_metadata(webservice = "tritrypdb"))
lp_entry <- EuPathDB::get_eupath_entry(species = "Leishmania panamensis", metadata = meta)
## Found the following hits: Leishmania panamensis MHOM/COL/81/L13, Leishmania panamensis strain MHOM/PA/94/PSC-1, choosing the first.
## Using: Leishmania panamensis MHOM/COL/81/L13.
colnames(lp_entry)
##  [1] "AnnotationVersion"  "AnnotationSource"   "BiocVersion"        "DataProvider"       "Genome"             "GenomeSource"       "GenomeVersion"     
##  [8] "NumArrayGene"       "NumChipChipGene"    "NumChromosome"      "NumCodingGene"      "NumCommunity"       "NumContig"          "NumEC"             
## [15] "NumEST"             "NumGene"            "NumGO"              "NumOrtholog"        "NumOtherGene"       "NumPopSet"          "NumProteomics"     
## [22] "NumPseudogene"      "NumRNASeq"          "NumRTPCR"           "NumSNP"             "NumTFBS"            "Organellar"         "ReferenceStrain"   
## [29] "MegaBP"             "PrimaryKey"         "ProjectID"          "RecordClassName"    "SourceID"           "SourceVersion"      "TaxonomyID"        
## [36] "TaxonomyName"       "URLGenome"          "URLGFF"             "URLProtein"         "Coordinate_1_based" "Maintainer"         "SourceUrl"         
## [43] "Tags"               "BsgenomePkg"        "GrangesPkg"         "OrganismdbiPkg"     "OrgdbPkg"           "TxdbPkg"            "Taxon"             
## [50] "Genus"              "Species"            "Strain"             "BsgenomeFile"       "GrangesFile"        "OrganismdbiFile"    "OrgdbFile"         
## [57] "TxdbFile"           "GenusSpecies"       "TaxonUnmodified"    "TaxonCanonical"     "TaxonXref"
testing_panamensis <- "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53"
## testing_panamensis <- EuPathDB::make_eupath_bsgenome(entry = lp_entry, eu_version = "v46")
library(as.character(testing_panamensis), character.only = TRUE)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: rtracklayer
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome")

7 Generate Expressionsets and Sample Estimation

The process of sample estimation takes two primary inputs:

  1. The sample sheet, which contains all the metadata we currently have on hand, including filenames for the outputs of #3 and #4 above.
  2. The gene annotations.

An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.

7.1 Notes

The following samples are much lower coverage:

  • TMRC20002
  • TMRC20006
  • TMRC20007
  • TMRC20008

There is a set of strains which acquired resistance in vitro. These are included in the dataset, but there are not likely enough of them to query that question explicitly.

7.2 Define colors

The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.

color_choices <- list(
    "strain" = list(
        ## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
        "z2.0" = "#555555",
        "z3.0" = "#777777",
        "z2.1" = "#874400",
        "z2.2" = "#0000cc",
        "z2.3" = "#cc0000",
        "z2.4" = "#df7000",
        "z3.2" = "#888888",
        "z1.0" = "#cc00cc",
        "z1.5" = "#cc00cc",
        "b2904" = "#cc00cc",
        "unknown" = "#cbcbcb"),
    ## "null" = "#000000"),
    "cf" = list(
        "cure" = "#006f00",
        "fail" = "#9dffa0",
        "unknown" = "#cbcbcb",
        "notapplicable" = "#000000"),
    "susceptibility" = list(
        "resistant" = "#8563a7",
        "sensitive" = "#8d0000",
        "ambiguous" = "#cbcbcb",
        "unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")

8 Parasite-only data structure

The data structure ‘lp_expt’ contains the data for all samples which have hisat2 count tables, and which pass a few initial quality tests (e.g. they must have more than 8550 genes with >0 counts and >5e6 reads which mapped to a gene); genes which are annotated with a few key redundant categories (leishmanolysin for example) are also culled.

8.1 All (almost) samples

There are a few metadata columns which we really want to make certain are standardized.

sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
                      "zymodemecategorical")
lp_expt <- create_expt(sample_sheet,
                       gene_info = all_lp_annot,
                       annotation_name = orgdb,
                       savefile = glue("rda/tmrc2_lp_expt_all_raw-v{ver}.rda"),
                       id_column = "hpglidentifier",
                       file_column = "lpanamensisv36hisatfile") %>%
  set_expt_conditions(fact = "zymodemecategorical") %>%
  subset_expt(nonzero = 8550) %>%
  set_expt_colors(color_choices[["strain"]]) %>%
  subset_expt(coverage = 5000000) %>%
  semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                       semantic_column = "annot_gene_product") %>%
  sanitize_expt_metadata(columns = sanitize_columns) %>%
  set_expt_factors(columns = sanitize_columns, class = "factor")
## Reading the sample metadata.
## Dropped 11 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 110 rows(samples) and 71 columns(metadata fields).
## Warning in create_expt(sample_sheet, gene_info = all_lp_annot, annotation_name = orgdb, : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 8778 features and 105 samples.
## 
##   b2904 unknown    z1.0    z1.5    z2.0    z2.1    z2.2    z2.3    z2.4    z3.0    z3.2 
##       1       4       1       1       1       7      45      41       2       1       1
## The samples (and read coverage) removed when filtering 8550 non-zero genes are:
## subset_expt(): There were 105, now there are 103 samples.
## The samples removed (and read coverage) when filtering samples with less than 5e+06 reads are:
## TMRC20004 TMRC20029 
##    564812   1658096
## subset_expt(): There were 103, now there are 101 samples.
## semantic_expt_filter(): Removed 68 genes.
data_structures <- c(data_structures, "lp_expt")
save(list = "lp_expt", file = glue("rda/tmrc2_lp_expt_all_sanitized-v{ver}.rda"))

table(pData(lp_expt)[["zymodemecategorical"]])
## 
##   b2904 unknown     z10     z15     z20     z21     z22     z23     z24     z30     z32 
##       1       2       1       1       1       7      43      41       2       1       1
table(pData(lp_expt)[["clinicalresponse"]])
## 
##                                  cure                               failure                       laboratory line laboratory line miltefosine resistant 
##                                    38                                    38                                     1                                     1 
##                                    nd                      reference strain 
##                                    19                                     4
ncol(exprs(lp_expt))
## [1] 101

All the following data will derive from this starting point.

8.2 Extract historical susceptibility data

Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:

  • 0 <= x <= 35 is resistant
  • 36 <= x <= 48 is ambiguous
  • 49 <= x is sensitive

Note that these cutoffs are only valid for the historical data. The newer susceptibility data uses a cutoff of 0.78 for sensitive. I will set ambiguous to 0.5 to 0.78?

max_resist_historical <- 0.35
min_sensitive_historical <- 0.49

## 202305: Removed ambiguous category for the current set.g
max_resist_current <- 0.76
min_sensitive_current <- 0.77

The sanitize_percent() function seeks to make the percentage values recorded by excel more reliable. Unfortunately, sometimes excel displays the value ‘49%’ when the information recorded in the worksheet is any one of the following:

  • ’49%
  • 0.49
  • “0.49”

Thus, the following block will sanitize these percentage values into a single decimal number and make a categorical variable from it using pre-defined values for resistant/ambiguous/sensitive. This categorical variable will be stored in a new column: ‘sus_category_historical’.

st <- pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]]
starting <- sanitize_percent(st)
st
##   [1] "0.45"                "0.14000000000000001" "0.97"                NA                    NA                    NA                   
##   [7] NA                    NA                    NA                    "0"                   "0.97"                "0"                  
##  [13] "0"                   "0.46"                "0.45"                "0.97"                "0.56000000000000005" "0.99"               
##  [19] "0.46"                "0.7"                 "0.99"                "0.99"                "0.45"                "0.98"               
##  [25] "0.99"                "0.49"                "No data"             "No data"             "0.99"                "0.66"               
##  [31] "0.99"                "No data"             "0.99"                "1"                   "1"                   "0.94"               
##  [37] "0.94"                "No data"             "No data"             "No data"             "No data"             "No data"            
##  [43] "No data"             "No data"             "No data"             "No data"             "No data"             "No data"            
##  [49] "No data"             "No data"             "No data"             "0.99"                "0.99"                "No data"            
##  [55] "0.98"                "0.97"                "0.96"                "0.96"                "0"                   "0"                  
##  [61] "0"                   "0.06"                "0.94"                "0.94"                "0.03"                "0.94"               
##  [67] "0"                   "0.25"                "0.95"                "0.27"                "No data"             "No data"            
##  [73] "No data"             "No data"             "No data"             "No data"             "No data"             "No data"            
##  [79] "No data"             "No data"             "No data"             "No data"             "No data"             "No data"            
##  [85] "No data"             "No data"             "No data"             "No data"             "No data"             "No data"            
##  [91] "No data"             "No data"             "No data"             "No data"             "No data"             "No data"            
##  [97] "No data"             "No data"             "No data"             "No data"             "No data"
starting
##   [1] 0.45 0.14 0.97   NA   NA   NA   NA   NA   NA 0.00 0.97 0.00 0.00 0.46 0.45 0.97 0.56 0.99 0.46 0.70 0.99 0.99 0.45 0.98 0.99 0.49   NA   NA 0.99 0.66
##  [31] 0.99   NA 0.99 1.00 1.00 0.94 0.94   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 0.99 0.99   NA 0.98 0.97 0.96 0.96 0.00 0.00
##  [61] 0.00 0.06 0.94 0.94 0.03 0.94 0.00 0.25 0.95 0.27   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [91]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
sus_categorical <- starting
na_idx <- is.na(starting)
sum(na_idx)
## [1] 55
sus_categorical[na_idx] <- "unknown"

resist_idx <- starting <= max_resist_historical
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting > max_resist_historical &
  starting < min_sensitive_historical
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= min_sensitive_historical
sus_categorical[susceptible_idx] <- "sensitive"

sus_categorical <- as.factor(sus_categorical)
pData(lp_expt)[["sus_category_historical"]] <- sus_categorical
table(sus_categorical)
## sus_categorical
## ambiguous resistant sensitive   unknown 
##         5        12        29        55

8.3 Extract current susceptibility data

The same process will be repeated for the current iteration of the sensitivity assay and stored in the ‘sus_category_current’ column.

starting_current <- sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]])
sus_categorical_current <- starting_current
na_idx <- is.na(starting_current)
sum(na_idx)
## [1] 5
sus_categorical_current[na_idx] <- "unknown"

resist_idx <- starting_current <= max_resist_current
sus_categorical_current[resist_idx] <- "resistant"
indeterminant_idx <- starting_current > max_resist_current &
  starting_current < min_sensitive_current
sus_categorical_current[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting_current >= min_sensitive_current
sus_categorical_current[susceptible_idx] <- "sensitive"
sus_categorical_current <- as.factor(sus_categorical_current)

pData(lp_expt)[["sus_category_current"]] <- sus_categorical_current
table(sus_categorical_current)
## sus_categorical_current
## resistant sensitive   unknown 
##        47        49         5

8.4 Extract samples from only the two ‘canonical’ strains

8.4.1 Quick divergence

Here is a table of my current classifier’s interpretation of the strains.

table(pData(lp_expt)[["knnv2classification"]])
## 
## z10 z21 z22 z23 z24 z32 
##   3   6  47  41   2   2

In many queries, we will seek to compare only the two primary strains, zymodeme 2.2 and 2.3. The following block will extract only those samples.

lp_strain <- lp_expt %>%
  set_expt_batches(fact = sus_categorical_current) %>%
  set_expt_colors(color_choices[["strain"]])
## 
## resistant sensitive   unknown 
##        47        49         5
table(pData(lp_strain)[["condition"]])
## 
##   b2904 unknown    z1.0    z1.5    z2.0    z2.1    z2.2    z2.3    z2.4    z3.0    z3.2 
##       1       2       1       1       1       7      43      41       2       1       1
save(list = "lp_strain", file = glue("rda/tmrc2_lp_strain-v{ver}.rda"))
data_structures <- c(data_structures, "lp_strain")

lp_two_strains <- subset_expt(lp_strain, subset = "condition=='z2.3'|condition=='z2.2'")
## subset_expt(): There were 101, now there are 84 samples.
save(list = "lp_two_strains",
     file = glue("rda/tmrc2_lp_two_strains-v{ver}.rda"))
data_structures <- c(data_structures, "lp_two_strains")

8.5 Clinical outcome

Clinical outcome is by far the most problematic comparison in this data, but here is the recategorization of the data using it:

lp_cf <- set_expt_conditions(lp_expt, fact = "clinicalcategorical") %>%
  set_expt_batches(fact = sus_categorical_current) %>%
  set_expt_colors(color_choices[["cf"]])
## 
##    cure    fail unknown 
##      38      38      25 
## 
## resistant sensitive   unknown 
##        47        49         5
## Warning in set_expt_colors(., color_choices[["cf"]]): Colors for the following categories are not being used: notapplicable.
table(pData(lp_cf)[["condition"]])
## 
##    cure    fail unknown 
##      38      38      25
data_structures <- c(data_structures, "lp_cf")
save(list = "lp_cf",
     file = glue("rda/tmrc2_lp_cf-v{ver}.rda"))

lp_cf_known <- subset_expt(lp_cf, subset="condition!='unknown'")
## subset_expt(): There were 101, now there are 76 samples.
data_structures <- c(data_structures, "lp_cf_known")
save(list = "lp_cf_known",
     file = glue("rda/tmrc2_lp_cf_known-v{ver}.rda"))

8.6 Create a historical susceptibility dataset

Use the factorized version of susceptibility to categorize the samples by the historical data.

lp_susceptibility_historical <- set_expt_conditions(lp_expt, fact = "sus_category_historical") %>%
  set_expt_batches(fact = "clinicalcategorical") %>%
  set_expt_colors(colors = color_choices[["susceptibility"]])
## 
## ambiguous resistant sensitive   unknown 
##         5        12        29        55 
## 
##    cure    fail unknown 
##      38      38      25
save(list = "lp_susceptibility_historical",
     file = glue("rda/tmrc2_lp_susceptibility_historical-v{ver}.rda"))
data_structures <- c(data_structures, "lp_susceptibility_historical")

8.7 Create a current susceptibility dataset

Use the factorized version of susceptibility to categorize the samples by the historical data.

This will likely be our canonical susceptibility dataset, so I will remove the suffix and just call it ‘lp_susceptibility’.

lp_susceptibility <- set_expt_conditions(lp_expt, fact = "sus_category_current") %>%
  set_expt_batches(fact = "clinicalcategorical") %>%
  set_expt_colors(colors = color_choices[["susceptibility"]])
## 
## resistant sensitive   unknown 
##        47        49         5 
## 
##    cure    fail unknown 
##      38      38      25
## Warning in set_expt_colors(., colors = color_choices[["susceptibility"]]): Colors for the following categories are not being used: ambiguous.
save(list = "lp_susceptibility",
     file = glue("rda/tmrc2_lp_susceptibility-v{ver}.rda"))
data_structures <- c(data_structures, "lp_susceptibility")

8.8 Pull out only the samples with two zymodemes

I think this is redundant with a previous block, but I am leaving it until I am certain that it is not required in a following document.

lp_zymo <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
## subset_expt(): There were 101, now there are 84 samples.
data_structures <- c(data_structures, "lp_zymo")
save(list = "lp_zymo",
     file = glue("rda/tmrc2_lp_zymo-v{ver}.rda"))

9 Variant data using parasite RNASeq reads

The following section will create some initial data structures of the observed variants in the parasite samples. This will include some of our 2016 samples for some classification queries.

9.1 The 2016 variant data

I changed and improved the mapping and variant detection methods from what we used for the 2016 data. So some small changes will be required to merge them.

lp_previous <- create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
                           file_column = "tophat2file",
                           savefile = glue("rda/lp_previous-v{ver}.rda"))
## Reading the sample metadata.
## Dropped 13 rows from the sample metadata because the sample ID is blank.
## The sample definitions comprises: 50 rows(samples) and 38 columns(metadata fields).
## Warning in create_expt("sample_sheets/tmrc2_samples_20191203.xlsx", file_column = "tophat2file", : Some samples were removed when cross referencing the
## samples against the count data.
## Matched 8841 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 8841 features and 33 samples.
tt <- lp_previous$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\-1$", replacement = "", x = rownames(tt))
lp_previous$expressionset <- tt
rm(tt)
data_structures <- c(data_structures, "lp_previous")

9.2 Create the SNP expressionset

The count_expt_snps() function uses our expressionset data and a metadata column in order to extract the mpileup or freebayes-based variant calls and create matrices of the likelihood that each position-per-sample is in fact a variant.

There is an important caveat here which changed on 202301: I was interpreting using the PAIRED tag, which is only used for, unsurprisingly, paired-end samples. A couple samples are not paired and so were failing silently. The QA tag looks like it is more appropriate and should work across both types. One way to find out, I am setting it here and will look to see if the results make more sense for my test samples (TMRC2001, TMRC2005, TMRC2007).

## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset = "!is.na(pData(lp_expt)[['freebayessummary']])")
## subset_expt(): There were 101, now there are 101 samples.
new_snps <- count_expt_snps(lp_snp, annot_column = "freebayessummary", snp_column = "QA")
## New names:
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `DP` -> `DP...3`
## • `RO` -> `RO...8`
## • `AO` -> `AO...9`
## • `QR` -> `QR...12`
## • `QA` -> `QA...13`
## • `DP` -> `DP...42`
## • `RO` -> `RO...43`
## • `QR` -> `QR...44`
## • `AO` -> `AO...45`
## • `QA` -> `QA...46`
## Lets see if we get numbers which make sense.
summary(exprs(new_snps)[["tmrc20001"]])  ## My weirdo sample
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    13.9     0.0  2217.0
summary(exprs(new_snps)[["tmrc20072"]])  ## Another sample chosen at random
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      64       0  247568
summary(exprs(new_snps)[["tmrc20021"]])  ## Another sample chosen at random
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     686       0 1708458
## Now that we are reasonably confident that things make more sense, lets save and move on...
data_structures <- c(data_structures, "new_snps")

tt <- normalize_expt(new_snps, transform = "log2")
## transform_counts: Found 146144951 values equal to 0, adding 1 to the matrix.
plot_boxplot(tt)

Now let us pull in the 2016 data.

old_snps <- count_expt_snps(lp_previous, annot_column = "bcftable", snp_column = 2)
## The rownames are missing the chromosome identifier,
## they probably came from an older version of this method.
data_structures <- c(data_structures, "old_snps")

save(list = "lp_snp",
     file = glue("rda/lp_snp-v{ver}.rda"))
data_structures <- c(data_structures, "lp_snp")
save(list = "new_snps",
     file = glue("rda/new_snps-v{ver}.rda"))
data_structures <- c(data_structures, "new_snps")
save(list = "old_snps",
     file = glue("rda/old_snps-v{ver}.rda"))
data_structures <- c(data_structures, "old_snps")

nonzero_snps <- exprs(new_snps) != 0
colSums(nonzero_snps)
## tmrc20001 tmrc20065 tmrc20005 tmrc20007 tmrc20008 tmrc20027 tmrc20028 tmrc20032 tmrc20040 tmrc20066 tmrc20039 tmrc20037 tmrc20038 tmrc20067 tmrc20068 
##     74798     93649     12389      8675       975    351343    338580    146302     58753     93615     25115     98958     97676     93954     96583 
## tmrc20041 tmrc20015 tmrc20009 tmrc20010 tmrc20016 tmrc20011 tmrc20012 tmrc20013 tmrc20017 tmrc20014 tmrc20018 tmrc20019 tmrc20070 tmrc20020 tmrc20021 
##     53184     96398     16031     94079    146124     14055       456     95040     48288     17245    140438     14829     97336     15484    101127 
## tmrc20022 tmrc20025 tmrc20024 tmrc20036 tmrc20069 tmrc20033 tmrc20026 tmrc20031 tmrc20076 tmrc20073 tmrc20055 tmrc20079 tmrc20071 tmrc20078 tmrc20094 
##     18143    364240     18471     60087     18792     33663     15074     19139     17559     96169     22246     96224     94353     18836     87878 
## tmrc20042 tmrc20058 tmrc20072 tmrc20059 tmrc20048 tmrc20057 tmrc20088 tmrc20056 tmrc20060 tmrc20077 tmrc20074 tmrc20063 tmrc20053 tmrc20052 tmrc20064 
##     19734     94524     50292     94091     97164     48944     15594     22683     21506     17925     21015     28254     20181    100709     93173 
## tmrc20075 tmrc20051 tmrc20050 tmrc20049 tmrc20062 tmrc20110 tmrc20080 tmrc20043 tmrc20083 tmrc20054 tmrc20085 tmrc20046 tmrc20093 tmrc20089 tmrc20047 
##     96625     94125     17200     16168     93677     15487     95775     95623     21167     93603     89765     48608     48254     90421     92637 
## tmrc20090 tmrc20044 tmrc20045 tmrc20061 tmrc20105 tmrc20108 tmrc20109 tmrc20098 tmrc20096 tmrc20097 tmrc20101 tmrc20092 tmrc20082 tmrc20102 tmrc20099 
##     91564     14861     50403    116906     86758     96831     17709     92927     17534     46863     17753     16578    109909     92380     91383 
## tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103 tmrc20104 tmrc20086 tmrc20107 tmrc20081 tmrc20106 tmrc20095 
##     94381     15059     46548     14947     49091     93979     15813     95144     19533     18545     81200

9.3 Combine the previous and current data

As far as I can tell, freebayes and mpileup are reasonably similar in their sensitivity/specificity; so combining the two datasets like this is expected to work with minimal problems. The most likely problem is that my mpileup-based pipeline is unable to handle indels.

## My old_snps is using an older annotation incorrectly, so fix it here:
Biobase::annotation(old_snps$expressionset) <- Biobase::annotation(new_snps$expressionset)
both_snps <- combine_expts(new_snps, old_snps)
## Warning in combine_expts(new_snps, old_snps): There are many gene IDs which are not shared among the two datasets, this may fail.
## There are many gene IDs which are not shared among the two datasets.
## Here are some from the first: chr_LPAL13-SCAF000001_pos_1019_ref_G_alt_A, chr_LPAL13-SCAF000001_pos_1023_ref_C_alt_G, chr_LPAL13-SCAF000001_pos_1038_ref_C_alt_T, chr_LPAL13-SCAF000001_pos_1066_ref_T_alt_C, chr_LPAL13-SCAF000001_pos_106_ref_A_alt_G, chr_LPAL13-SCAF000001_pos_1092_ref_A_alt_G
## Here are some from the second: chr_LPAL13-SCAF000001_pos_1019_ref_G_alt_A, chr_LPAL13-SCAF000001_pos_106_ref_A_alt_G, chr_LPAL13-SCAF000001_pos_1092_ref_A_alt_G, chr_LPAL13-SCAF000001_pos_1138_ref_C_alt_A, chr_LPAL13-SCAF000001_pos_1149_ref_G_alt_A, chr_LPAL13-SCAF000001_pos_1183_ref_C_alt_T
## 
##    z2.3    z2.2 unknown    z1.0   b2904    z3.0    z2.0    z1.5    z2.1    z2.4    z3.2   undef      sh     chr     inf 
##      41      43       2       1       1       1       1       1       7       2       1       0      13      14       6
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'exprs' for signature '"function"'
save(list = "both_snps",
     file = glue("rda/both_snps-v{ver}.rda"))
## Error in save(list = "both_snps", file = glue("rda/both_snps-v{ver}.rda")): object 'both_snps' not found
data_structures <- c(data_structures, "both_snps")

10 Subclade manual interpretation

I am taking a heatmap from our variant data and manually identifying sample groups.

  • A: TMRC20025, TMRC20027, TMRC20028
  • B: hpgl0641, hpgl0247, hpgl0631, hpgl0658, close to A
  • C: TMRC20008, TMRC20007, TMRC20001, TMRC20005, hpgl0318, TMRC20012
  • D: hpgl0643, hpgl0316, hpgl0320, hpgl0641, close to C
  • E: TMRC20032, TMRC20061
  • F: TMRC20040, TMRC20036, hpgl0245, TMRC20103, TMRC20093, TMRC20045, TMRC20041, TMRC20072, TMRC20046, TMRC20057, TMRC20097, TMRC20084, close to E
  • G: hpgl0632, hpgl0652, hpgl0248, hpgl0659
  • H: hpgl0654, hpgl0634, hpgl0243, hpgl0243, closest to G
  • I: hpgl0242, hpgl0322, hpgl0636, hpgl0663, hpgl0638, close to H
  • J: TMRC20017, TMRC20033, TMRC20053, TMRC20063, TMRC20056, TMRC20074, TMRC20055, TMRC20022, TMRC20026, TMRC20083, TMRC20077, TMRC20060
  • K: TMRC20050, TMRC20042, TMRC20078, TMRC20049, TMRC20069, TMRC20044, close to J
  • L: TMRC20076, TMRC20024, TMRC2009
  • M: TMRC20019, TMRC20020, TMRC20031, TMRC20014, TMRC20011, close to L
  • N: TMRC20096, TMRC20081, TMRC20110, TMRC20092, TMRC20088, TMRC20101, TMRC20106, TMRC20091, TMRC20109, TMRC20087, TMRC20086, closeish to M
  • O: TMRC20095, TMRC20016, TMRC20018, quite far from everyone
  • P: TMRC20082, TMRC20075, pretty separate too
  • Q: hpgl0246, hpgl0653, hpgl0633, hpgl0244, hpgl0635, hpgl0655, hpgl0639, hpgl0662
  • R: TMRC20059, TMRC20089, TMRC20021, TMRC20048, TMRC20067
  • S: TMRC20013, TMRC20010, TMRC20037, TMRC20066, TMRC20062, TMRC20038, close to R
  • T: TMRC20015, TMRC20108, TMRC20099, TMRC20102, TMRC20085, TMRC20090, TMRC20104, TMRC20098, TMRC20100, TMRC20107
  • U: TMRC20047, TMRC20068, TMRC20080, TMRC20105, TMRC20094, TMRC20065, TMRC20071, TMRC20064, TMRC20043, TMRC20070, TMRC20062, TMRC20051, TMRC20079, TMRC20073, TMRC20058, TMRC20054

11 Macrophage data

All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.

11.1 Macrophage host data

The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.

hs_annot <- load_biomart_annotations(year = "2020")
## The biomart annotations file already exists, loading from it.
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]

sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
macr_annot <- hs_annot
rownames(macr_annot) <- gsub(x = rownames(macr_annot),
                             pattern = "^gene:",
                             replacement = "")
hs_macrophage <- create_expt(
    macrophage_sheet,
    gene_info = hs_annot,
    file_column = "hg38100hisatfile") %>%
  set_expt_conditions(fact = "macrophagetreatment") %>%
  set_expt_batches(fact = "macrophagezymodeme") %>%
  sanitize_expt_metadata(columns = sanitize_columns) %>%
  subset_expt(nonzero = 12000)
## Reading the sample metadata.
## Dropped 1 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Warning in create_expt(macrophage_sheet, gene_info = hs_annot, file_column = "hg38100hisatfile"): Even after changing the rownames in gene info, they do not
## match the count table.
## Even after changing the rownames in gene info, they do not match the count table.
## Here are the first few rownames from the count tables:
## ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938
## Here are the first few rownames from the gene information table:
## gene:ENSG00000004059, gene:ENSG00000003056, gene:ENSG00000173153, gene:ENSG00000004478, gene:ENSG00000003137, gene:ENSG00000003509
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 21481 features and 69 samples.
## 
##      inf   inf_sb    uninf uninf_sb 
##       30       29        5        5 
## 
## none z2.2 z2.3 
##   10   30   29
## The samples (and read coverage) removed when filtering 12000 non-zero genes are:
## subset_expt(): There were 69, now there are 68 samples.
fixed_genenames <- gsub(x = rownames(exprs(hs_macrophage)), pattern = "^gene:",
                        replacement = "")
hs_macrophage <- set_expt_genenames(hs_macrophage, ids = fixed_genenames)
table(pData(hs_macrophage)$condition)
## 
##      inf   inf_sb    uninf uninf_sb 
##       29       29        5        5
## The following 3 lines were copy/pasted to datastructures and should be removed soon.
nostrain <- is.na(pData(hs_macrophage)[["strainid"]])
pData(hs_macrophage)[nostrain, "strainid"] <- "none"

pData(hs_macrophage)[["strain_zymo"]] <- paste0("s", pData(hs_macrophage)[["strainid"]],
                                                "_", pData(hs_macrophage)[["macrophagezymodeme"]])
uninfected <- pData(hs_macrophage)[["strain_zymo"]] == "snone_none"
pData(hs_macrophage)[uninfected, "strain_zymo"] <- "uninfected"

data_structures <- c(data_structures, "hs_macrophage")

Finally, split off the U937 samples.

hs_u937 <- subset_expt(hs_macrophage, subset = "typeofcells!='Macrophages'")
## subset_expt(): There were 68, now there are 14 samples.
data_structures <- c(data_structures, "hs_u937")

11.2 Macrophage parasite data

In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.

lp_macrophage <- create_expt(macrophage_sheet,
                             file_column = "lpanamensisv36hisatfile",
                             gene_info = all_lp_annot,
                             savefile = glue("rda/lp_macrophage-v{ver}.rda"),
                             annotation = "org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
set_expt_conditions(fact = "macrophagezymodeme") %>%
  set_expt_batches(fact = "macrophagetreatment")
## Reading the sample metadata.
## Dropped 1 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Warning in create_expt(macrophage_sheet, file_column = "lpanamensisv36hisatfile", : Some samples were removed when cross referencing the samples against the
## count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final expressionset has 8778 features and 66 samples.
## 
## none z2.2 z2.3 
##    8   29   29 
## 
##      inf   inf_sb    uninf uninf_sb 
##       29       29        4        4
unfilt_written <- write_expt(
  lp_macrophage,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_unfiltered-v{ver}.xlsx"))
## Writing the first sheet, containing a legend and some summary data.
## The following samples have less than 5705.7 genes.
##  [1] "TMRC30066" "TMRC30117" "TMRC30244" "TMRC30246" "TMRC30249" "TMRC30266" "TMRC30268" "TMRC30326" "TMRC30323" "TMRC30319" "TMRC30325" "TMRC30327"
## [13] "TMRC30312" "TMRC30300" "TMRC30304" "TMRC30302" "TMRC30313" "TMRC30309" "TMRC30292" "TMRC30331" "TMRC30332" "TMRC30330"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Not putting labels on the plot.
## 
## 175550 entries are 0.  We are on a log scale, adding 1 to the data.
## 
## Changed 175550 zero count features.
## 
## Naively calculating coefficient of variation/dispersion with respect to condition.
## 
## Finished calculating dispersion estimates.
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## `geom_smooth()` using formula = 'y ~ x'
## Error in checkModelStatus(fit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff,  : 
##   The variables specified in this model are redundant,
## so the design matrix is not full rank
## Retrying with only condition in the model.
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## 
## 
## Total:68 s
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## `geom_smooth()` using formula = 'y ~ x'
## Error in checkModelStatus(fit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff,  : 
##   The variables specified in this model are redundant,
## so the design matrix is not full rank
## Retrying with only condition in the model.
## 
## Total:65 s
## Error : rownames(fData(object)) not equal to rownames(value).
## Lengths differ: 8778 is not 8691
lp_macrophage_filt <- subset_expt(lp_macrophage, nonzero = 2500) %>%
  semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                       semantic_column = "annot_gene_product")
## The samples (and read coverage) removed when filtering 2500 non-zero genes are: 
## subset_expt(): There were 66, now there are 50 samples.
## semantic_expt_filter(): Removed 68 genes.
data_structures <- c(data_structures, "lp_macrophage", "lp_macrophage_filt")
filt_written <- write_expt(
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_filtered-v{ver}.xlsx"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'exprs': argument "expt" is missing, with no default
lp_macrophage <- lp_macrophage_filt


lp_macrophage_nosb <- subset_expt(lp_macrophage, subset="batch!='inf_sb'")
## subset_expt(): There were 50, now there are 29 samples.
lp_nosb_write <- write_expt(
  lp_macrophage_nosb,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_nosb_reads-v{ver}.xlsx"))
## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.Not putting labels on the plot.
## 6396 entries are 0.  We are on a log scale, adding 1 to the data.
## Changed 6396 zero count features.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## `geom_smooth()` using formula = 'y ~ x'
## varpart sees only 1 batch, adjusting the model accordingly.
## Error in .fitExtractVarPartModel(exprObj, formula, data, REML = REML,  : 
##   Variable in formula not found in data: condition
## Retrying with only condition in the model.
## 
## Total:59 s
## `geom_smooth()` using formula = 'y ~ x'varpart sees only 1 batch, adjusting the model accordingly.
## Error in .fitExtractVarPartModel(exprObj, formula, data, REML = REML,  : 
##   Variable in formula not found in data: condition
## Retrying with only condition in the model.
## 
## Total:53 s
data_structures <- c(data_structures, "lp_macrophage_nosb")

spec <- make_rnaseq_spec()
test <- gather_preprocessing_metadata(macrophage_sheet, specification = spec)
## Using provided specification
## Dropped 1 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Starting trimomatic_input.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*trimomatic/*-trimomatic.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*trimomatic/*-trimomatic.stderr.
## Starting trimomatic_output.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*trimomatic/*-trimomatic.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*trimomatic/*-trimomatic.stderr.
## Starting trimomatic_ratio.
## Checking input_file_spec: .
## The numerator column is: trimomatic_output.
## The denominator column is: trimomatic_input.
## Starting fastqc_pct_gc.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Example filename: preprocessing/TMRC30051/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Starting fastqc_most_overrepresented.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*fastqc/*_fastqc/fastqc_data.txt.
## Skipping for now
## Not including new entries for: fastqc_most_overrepresented.
## Starting hisat_rrna_single_concordant.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*rRNA*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*rRNA*.stderr.
## Starting hisat_rrna_multi_concordant.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*rRNA*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*rRNA*.stderr.
## Starting hisat_rrna_percent.
## Checking input_file_spec: .
## The numerator column is: hisat_rrna_multi_concordant.
## The denominator column is: trimomatic_output.
## Starting hisat_genome_single_concordant.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*genome*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*genome*.stderr.
## Starting hisat_genome_multi_concordant.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*genome*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*genome*.stderr.
## Starting hisat_genome_single_all.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*genome*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*genome*.stderr.
## Starting hisat_genome_multi_all.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/hisat2_*genome*.stderr.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/hisat2_*genome*.stderr.
## Starting hisat_genome_percent.
## Checking input_file_spec: .
## The numerator column is: hisat_genome_single_concordant.
## The denominator column is: trimomatic_output.
## Starting hisat_count_table.
## Checking input_file_spec: {basedir}/{meta[['sampleid']]}/outputs/*hisat2_{species}/{species}_{type}*.count.xz.
## Example filename: preprocessing/TMRC30051/outputs/*hisat2_*/*_genome*.count.xz.
## Writing new metadata to: sample_sheets/tmrc2_macrophage_samples_202304_modified_modified.xlsx

12 Plot SL Reads on a per condition basis

lp_meta <- pData(lp_macrophage)
lp_meta[["slvsreads_log"]] <- log10(lp_meta[["slvsreads"]])
inf_values <- is.infinite(lp_meta[["slvsreads_log"]])
lp_meta[inf_values, "slvsreads_log"] <- -10

color_vector <- as.character(color_choices[["strain"]])
names(color_vector) <- names(color_choices[["strain"]])
color_vector <- color_vector[c("z2.2", "z2.3", "unknown")]
names(color_vector) <- c("z2.2", "z2.3", "none")
sl_violin <- ggplot(lp_meta,
                    aes(x = .data[["condition"]], y = .data[["slvsreads_log"]],
                        fill = .data[["condition"]])) +
  geom_violin() +
  geom_point() +
  scale_fill_manual(values = color_vector)
## Error in geom_violin(): could not find function "geom_violin"
sl_violin
## Error in eval(expr, envir, enclos): object 'sl_violin' not found
ggstatsplot::ggbetweenstats(lp_meta, x = "condition", y = "slvsreads_log")

13 Save all data structures into one rda

save(list = data_structures, file = glue("rda/tmrc2_data_structures-v{ver}.rda"))
## Error in save(list = data_structures, file = glue("rda/tmrc2_data_structures-v{ver}.rda")): object 'both_snps' not found
pander::pander(sessionInfo())

R version 4.2.0 (2022-04-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: splines, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: ruv(v.0.9.7.1), lme4(v.1.1-33), Matrix(v.1.5-4), BiocParallel(v.1.32.6), variancePartition(v.1.28.9), BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53(v.2021.07), BSgenome(v.1.66.3), rtracklayer(v.1.58.0), Biostrings(v.2.66.0), XVector(v.0.38.0), futile.logger(v.1.4.3), org.Lmajor.Friedlin.v49.eg.db(v.2020.11), org.Lpanamensis.MHOMCOL81L13.v46.eg.db(v.2020.07), AnnotationDbi(v.1.60.2), EuPathDB(v.1.6.0), GenomeInfoDbData(v.1.2.9), hpgltools(v.1.0), testthat(v.3.1.8), Heatplus(v.3.6.0), glue(v.1.6.2), SummarizedExperiment(v.1.28.0), GenomicRanges(v.1.50.2), GenomeInfoDb(v.1.34.9), IRanges(v.2.32.0), S4Vectors(v.0.36.2), MatrixGenerics(v.1.10.0), matrixStats(v.0.63.0), Biobase(v.2.58.0) and BiocGenerics(v.0.44.0)

loaded via a namespace (and not attached): stringdist(v.0.9.10), corpcor(v.1.6.10), ps(v.1.7.5), Rsamtools(v.2.14.0), foreach(v.1.5.2), rprojroot(v.2.0.3), crayon(v.1.5.2), rbibutils(v.2.2.13), MASS(v.7.3-60), nlme(v.3.1-162), backports(v.1.4.1), sva(v.3.46.0), GOSemSim(v.2.24.0), rlang(v.1.1.1), HDO.db(v.0.99.1), nloptr(v.2.0.3), callr(v.3.7.3), limma(v.3.54.2), filelock(v.1.0.2), rjson(v.0.2.21), bit64(v.4.0.5), pbkrtest(v.0.5.2), parallel(v.4.2.0), processx(v.3.8.1), ggstatsplot(v.0.11.1), DOSE(v.3.24.2), tidyselect(v.1.2.0), usethis(v.2.1.6), BiocCheck(v.1.34.3), XML(v.3.99-0.14), tidyr(v.1.3.0), zoo(v.1.8-12), GenomicAlignments(v.1.34.1), MatrixModels(v.0.5-1), xtable(v.1.8-4), magrittr(v.2.0.3), evaluate(v.0.21), ggplot2(v.3.4.2), Rdpack(v.2.4), cli(v.3.6.1), zlibbioc(v.1.44.0), rstudioapi(v.0.14), miniUI(v.0.1.1.1), bslib(v.0.4.2), fastmatch(v.1.1-3), aod(v.1.3.2), lambda.r(v.1.2.4), treeio(v.1.22.0), shiny(v.1.7.4), xfun(v.0.39), parameters(v.0.21.0), pkgbuild(v.1.4.0), gson(v.0.1.0), caTools(v.1.18.2), tidygraph(v.1.2.3), AnnotationHubData(v.1.28.0), KEGGREST(v.1.38.0), clusterGeneration(v.1.3.7), tibble(v.3.2.1), interactiveDisplayBase(v.1.36.0), ggrepel(v.0.9.3), ape(v.5.7-1), png(v.0.1-8), zeallot(v.0.1.0), withr(v.2.5.0), bitops(v.1.0-7), ggforce(v.0.4.1), RBGL(v.1.74.0), plyr(v.1.8.8), GSEABase(v.1.60.0), coda(v.0.19-4), pillar(v.1.9.0), biocViews(v.1.66.3), gplots(v.3.1.3), cachem(v.1.0.8), GenomicFeatures(v.1.50.4), multcomp(v.1.4-23), fs(v.1.6.2), paletteer(v.1.5.0), clusterProfiler(v.4.6.2), RUnit(v.0.4.32), vctrs(v.0.6.2), ellipsis(v.0.3.2), generics(v.0.1.3), devtools(v.2.4.5), tools(v.4.2.0), remaCor(v.0.0.11), munsell(v.0.5.0), tweenr(v.2.0.2), fgsea(v.1.24.0), emmeans(v.1.8.5), DelayedArray(v.0.24.0), fastmap(v.1.1.1), compiler(v.4.2.0), pkgload(v.1.3.2), httpuv(v.1.6.10), sessioninfo(v.1.2.2), plotly(v.4.10.1), gridExtra(v.2.3), edgeR(v.3.40.2), lattice(v.0.21-8), AnnotationForge(v.1.40.2), utf8(v.1.2.3), later(v.1.3.1), dplyr(v.1.1.2), prismatic(v.1.1.1), BiocFileCache(v.2.6.1), jsonlite(v.1.8.4), scales(v.1.2.1), graph(v.1.76.0), pbapply(v.1.7-0), tidytree(v.0.4.2), estimability(v.1.4.1), genefilter(v.1.80.3), lazyeval(v.0.2.2), promises(v.1.2.0.1), doParallel(v.1.0.17), effectsize(v.0.8.3), rmarkdown(v.2.21), openxlsx(v.4.2.5.2), sandwich(v.3.0-2), cowplot(v.1.1.1), Rtsne(v.0.16), pander(v.0.6.5), downloader(v.0.4), igraph(v.1.4.2), survival(v.3.5-5), yaml(v.2.3.7), htmltools(v.0.5.5), memoise(v.2.0.1), profvis(v.0.3.8), BiocIO(v.1.8.0), locfit(v.1.5-9.7), graphlayouts(v.1.0.0), quadprog(v.1.5-8), viridisLite(v.0.4.2), digest(v.0.6.31), RhpcBLASctl(v.0.23-42), mime(v.0.12), rappdirs(v.0.3.3), futile.options(v.1.0.1), bayestestR(v.0.13.1), RSQLite(v.2.3.1), yulab.utils(v.0.0.6), remotes(v.2.4.2), data.table(v.1.14.8), urlchecker(v.1.0.1), blob(v.1.2.4), labeling(v.0.4.2), rematch2(v.2.1.2), AnnotationHub(v.3.6.0), OrganismDbi(v.1.40.0), RCurl(v.1.98-1.12), broom(v.1.0.4), hms(v.1.1.3), colorspace(v.2.1-0), BiocManager(v.1.30.20), aplot(v.0.1.10), sass(v.0.4.6), Rcpp(v.1.0.10), mvtnorm(v.1.1-3), enrichplot(v.1.18.4), fansi(v.1.0.4), tzdb(v.0.3.0), brio(v.1.1.3), R6(v.2.5.1), grid(v.4.2.0), lifecycle(v.1.0.3), formatR(v.1.14), statsExpressions(v.1.5.0), BayesFactor(v.0.9.12-4.4), zip(v.2.3.0), datawizard(v.0.7.1), curl(v.5.0.0), minqa(v.1.2.5), jquerylib(v.0.1.4), fastcluster(v.1.2.3), PROPER(v.1.30.0), qvalue(v.2.30.0), TH.data(v.1.1-2), desc(v.1.4.2), RColorBrewer(v.1.1-3), iterators(v.1.0.14), stringr(v.1.5.0), directlabels(v.2021.1.13), htmlwidgets(v.1.6.2), polyclip(v.1.10-4), biomaRt(v.2.54.1), purrr(v.1.0.1), shadowtext(v.0.1.2), gridGraphics(v.0.5-1), rvest(v.1.0.3), mgcv(v.1.8-42), insight(v.0.19.1), patchwork(v.1.1.2), codetools(v.0.2-19), GO.db(v.3.16.0), gtools(v.3.9.4), prettyunits(v.1.1.1), dbplyr(v.2.3.2), correlation(v.0.8.4), gtable(v.0.3.3), DBI(v.1.1.3), ggfun(v.0.0.9), httr(v.1.4.6), highr(v.0.10), KernSmooth(v.2.23-21), stringi(v.1.7.12), vroom(v.1.6.3), progress(v.1.2.2), reshape2(v.1.4.4), farver(v.2.1.1), annotate(v.1.76.0), viridis(v.0.6.3), ggtree(v.3.6.2), xml2(v.1.3.4), boot(v.1.3-28.1), restfulr(v.0.0.15), readr(v.2.1.4), ggplotify(v.0.1.0), BiocVersion(v.3.16.0), bit(v.4.0.5), scatterpie(v.0.1.9), ggraph(v.2.1.0), pkgconfig(v.2.0.3) and knitr(v.1.42)

message("This is hpgltools commit: ", get_git_commit())
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 3e0de6383ee41f791dac5025f8e6790ddc8072bd
## This is hpgltools commit: Mon May 15 12:43:27 2023 -0400: 3e0de6383ee41f791dac5025f8e6790ddc8072bd
message("Saving to ", savefile)
## Saving to tmrc2_datasets_202304.rda.xz
tmp <- sm(saveme(filename = savefile))
tmp <- loadme(filename = savefile)
