Let us check out some new cruzi infections following the deletion of a specific gene.
I thought I also did the interrogation of the CLBrener transcriptome, but that appears untrue. I think I may have forgotten to copy the genome in place…
a pROCK plasmid containing CAS9 followed by GFP and GAPDH waws linearized in order to integrate the CAS9 into a specific location in the cruzi genome. Tc tubulin is flanking a NotI RE site, so I would assume the integration is at one of the tubulin loci. This plasmid has both M13 fwd and M13 rev; M13 rev is pointing toward the GAPDH and m13 forward is pointing to the bacterial origin of replication and AmpR. (This is a streptococcus CAS9)
We received an email flagging the following genes as CRISPR/Cas9 targets for the knockouts. I therefore would like to have screenshots of each of these regions to show what differences are observable between the three strains. Note that the lower coverage of the last few samples may mean that we need to stick to the first group.
expected_lower <- c("TcCLB.508173.120", "TcCLB.509495.30", "TcCLB.510055.20", "TcCLB.506961.25",
"TcCLB.510787.10", "TcCLB.511667.30", "TcCLB.507085.30",
"TcCLB.507427.10", "TcCLB.508913.25", "TcCLB.508857.30",
"TcCLB.503993.10", "TcCLB.511323.10", "TcCLB.508089.10",
"TcCLB.508717.60", "TcCLB.506975.80", "TcCLB.505931.30",
"TcCLB.507979.30", "TcCLB.509817.50", "TcCLB.506841.20")Note: I am remapping these samples with slightly different parameters which may make this more sensitive for multi gene families, but I do not think it will change anything.
I therefore opened up the freebayes output sorted by CDS and looked for nonsense mutations introduced in one ko and one AB sample.
I found 43 in the KO and 79 in the AB.
I have a pretty new genome downloaded (202509), so I will (for now) just let my annotation function grab whatever it thinks is reasonable. It chose the 202410 set. Seems good to me.
## The biomart annotations file already exists, loading from it.
tc_annot <- load_gff_annotations("~/libraries/genome/gff/tcruzi_all.gff",
type = "mRNA", id_col = "Parent")## Returning a df with 24 columns and 23305 rows.
rownames(tc_annot) <- gsub(x = make.names(tc_annot[["Name"]], unique = TRUE),
pattern = "\\.\\d+$", replacement = "")
esmer_db <- "org.Tcruzi.CL.Brener.Esmeraldo.like.v68.eg.db"
library(esmer_db, character.only = TRUE)## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following object is masked from 'package:dplyr':
##
## explain
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:hpgltools':
##
## annotation<-, conditions, conditions<-, IQR, mad, sd, var, xtabs
## The following object is masked from 'package:dplyr':
##
## combine
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, is.unsorted, lapply, Map, mapply, match, mget, order, paste,
## pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rownames, sapply, saveRDS, table, tapply, unique, unsplit, which.max,
## which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'.
## To cite Bioconductor, see 'citation("Biobase")', and for packages
## 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:hpgltools':
##
## notes
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:tidyr':
##
## expand
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:hpgltools':
##
## trim
## The following object is masked from 'package:glue':
##
## trim
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
##
esmer_db <- get0(esmer_db)
all_keytypes <- keytypes(esmer_db)
wanted_idx <- grepl(x = all_keytypes, pattern = "^ANNOT_")
wanted_fields <- all_keytypes[wanted_idx]
nonesmer_db <- "org.Tcruzi.CL.Brener.Non.Esmeraldo.like.v68.eg.db"
unas_db <- "org.Tcruzi.CL.Brener.v68.eg.db"
tc_esmer <- load_orgdb_annotations(esmer_db, keytype = "gid", fields = wanted_fields)## Unable to find CDSNAME, setting it to ANNOT_EXTERNAL_DB_NAME.
## Unable to find CDSCHROM in the db, removing it.
## Unable to find CDSSTRAND in the db, removing it.
## Unable to find CDSSTART in the db, removing it.
## Unable to find CDSEND in the db, removing it.
## Extracted all gene ids.
## Attempting to select: ANNOT_EXTERNAL_DB_NAME, GENE_TYPE, ANNOT_AA_SEQUENCE_ID, ANNOT_ANNOTATED_GO_COMPONENT, ANNOT_ANNOTATED_GO_FUNCTION, ANNOT_ANNOTATED_GO_ID_COMPONENT, ANNOT_ANNOTATED_GO_ID_FUNCTION, ANNOT_ANNOTATED_GO_ID_PROCESS, ANNOT_ANNOTATED_GO_PROCESS, ANNOT_ANTICODON, ANNOT_APOLLO_LINK_OUT, ANNOT_APOLLO_TRANSCRIPT_DESCRIPTION, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_CODING_END, ANNOT_CODING_START, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_END_MAX, ANNOT_EXON_COUNT, ANNOT_EXTERNAL_DB_NAME, ANNOT_EXTERNAL_DB_VERSION, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_CONTEXT_END, ANNOT_GENE_CONTEXT_START, ANNOT_GENE_END_MAX, ANNOT_GENE_END_MAX_TEXT, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_ENTREZ_LINK_DISPLAYTEXT, ANNOT_GENE_ENTREZ_LINK_URL, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_START_MIN, ANNOT_GENE_START_MIN_TEXT, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GENOMIC_SEQUENCE_LENGTH, ANNOT_GENUS_SPECIES, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_DEPRECATED, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MAP_LOCATION, ANNOT_MCMC_LOCATION, ANNOT_MOLECULAR_WEIGHT, ANNOT_NCBI_TAX_ID, ANNOT_ORTHOMCL_LINK, ANNOT_OVERVIEW, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PRIMARY_KEY, ANNOT_PROB_MAP, ANNOT_PROB_MCMC, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_PROTEIN_SOURCE_ID, ANNOT_PSEUDO_STRING, ANNOT_SEQUENCE_DATABASE_NAME, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SNPOVERVIEW, ANNOT_SO_ID, ANNOT_SO_TERM_DEFINITION, ANNOT_SO_TERM_NAME, ANNOT_SO_VERSION, ANNOT_START_MIN, ANNOT_STRAND, ANNOT_STRAND_PLUS_MINUS, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_IDS, ANNOT_UNIPROT_LINKS
## 'select()' returned 1:1 mapping between keys and columns
##
## Unable to find CDSNAME, setting it to ANNOT_EXTERNAL_DB_NAME.
## Unable to find CDSCHROM in the db, removing it.
## Unable to find CDSSTRAND in the db, removing it.
## Unable to find CDSSTART in the db, removing it.
## Unable to find CDSEND in the db, removing it.
## Extracted all gene ids.
## Attempting to select: ANNOT_EXTERNAL_DB_NAME, GENE_TYPE, ANNOT_AA_SEQUENCE_ID, ANNOT_ANNOTATED_GO_COMPONENT, ANNOT_ANNOTATED_GO_FUNCTION, ANNOT_ANNOTATED_GO_ID_COMPONENT, ANNOT_ANNOTATED_GO_ID_FUNCTION, ANNOT_ANNOTATED_GO_ID_PROCESS, ANNOT_ANNOTATED_GO_PROCESS, ANNOT_ANTICODON, ANNOT_APOLLO_LINK_OUT, ANNOT_APOLLO_TRANSCRIPT_DESCRIPTION, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_CODING_END, ANNOT_CODING_START, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_END_MAX, ANNOT_EXON_COUNT, ANNOT_EXTERNAL_DB_NAME, ANNOT_EXTERNAL_DB_VERSION, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_CONTEXT_END, ANNOT_GENE_CONTEXT_START, ANNOT_GENE_END_MAX, ANNOT_GENE_END_MAX_TEXT, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_ENTREZ_LINK_DISPLAYTEXT, ANNOT_GENE_ENTREZ_LINK_URL, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_START_MIN, ANNOT_GENE_START_MIN_TEXT, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GENOMIC_SEQUENCE_LENGTH, ANNOT_GENUS_SPECIES, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_DEPRECATED, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MAP_LOCATION, ANNOT_MCMC_LOCATION, ANNOT_MOLECULAR_WEIGHT, ANNOT_NCBI_TAX_ID, ANNOT_ORTHOMCL_LINK, ANNOT_OVERVIEW, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PRIMARY_KEY, ANNOT_PROB_MAP, ANNOT_PROB_MCMC, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_PROTEIN_SOURCE_ID, ANNOT_PSEUDO_STRING, ANNOT_SEQUENCE_DATABASE_NAME, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SNPOVERVIEW, ANNOT_SO_ID, ANNOT_SO_TERM_DEFINITION, ANNOT_SO_TERM_NAME, ANNOT_SO_VERSION, ANNOT_START_MIN, ANNOT_STRAND, ANNOT_STRAND_PLUS_MINUS, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_IDS, ANNOT_UNIPROT_LINKS
## 'select()' returned 1:1 mapping between keys and columns
##
## Unable to find CDSNAME, setting it to ANNOT_EXTERNAL_DB_NAME.
## Unable to find CDSCHROM in the db, removing it.
## Unable to find CDSSTRAND in the db, removing it.
## Unable to find CDSSTART in the db, removing it.
## Unable to find CDSEND in the db, removing it.
## Extracted all gene ids.
## Attempting to select: ANNOT_EXTERNAL_DB_NAME, GENE_TYPE, ANNOT_AA_SEQUENCE_ID, ANNOT_ANNOTATED_GO_COMPONENT, ANNOT_ANNOTATED_GO_FUNCTION, ANNOT_ANNOTATED_GO_ID_COMPONENT, ANNOT_ANNOTATED_GO_ID_FUNCTION, ANNOT_ANNOTATED_GO_ID_PROCESS, ANNOT_ANNOTATED_GO_PROCESS, ANNOT_ANTICODON, ANNOT_APOLLO_LINK_OUT, ANNOT_APOLLO_TRANSCRIPT_DESCRIPTION, ANNOT_CDS, ANNOT_CDS_LENGTH, ANNOT_CHROMOSOME, ANNOT_CODING_END, ANNOT_CODING_START, ANNOT_EC_NUMBERS, ANNOT_EC_NUMBERS_DERIVED, ANNOT_END_MAX, ANNOT_EXON_COUNT, ANNOT_EXTERNAL_DB_NAME, ANNOT_EXTERNAL_DB_VERSION, ANNOT_FIVE_PRIME_UTR_LENGTH, ANNOT_GENE_CONTEXT_END, ANNOT_GENE_CONTEXT_START, ANNOT_GENE_END_MAX, ANNOT_GENE_END_MAX_TEXT, ANNOT_GENE_ENTREZ_ID, ANNOT_GENE_ENTREZ_LINK_DISPLAYTEXT, ANNOT_GENE_ENTREZ_LINK_URL, ANNOT_GENE_EXON_COUNT, ANNOT_GENE_HTS_NONCODING_SNPS, ANNOT_GENE_HTS_NONSYN_SYN_RATIO, ANNOT_GENE_HTS_NONSYNONYMOUS_SNPS, ANNOT_GENE_HTS_STOP_CODON_SNPS, ANNOT_GENE_HTS_SYNONYMOUS_SNPS, ANNOT_GENE_LOCATION_TEXT, ANNOT_GENE_NAME, ANNOT_GENE_ORTHOLOG_NUMBER, ANNOT_GENE_ORTHOMCL_NAME, ANNOT_GENE_PARALOG_NUMBER, ANNOT_GENE_PREVIOUS_IDS, ANNOT_GENE_PRODUCT, ANNOT_GENE_START_MIN, ANNOT_GENE_START_MIN_TEXT, ANNOT_GENE_TOTAL_HTS_SNPS, ANNOT_GENE_TRANSCRIPT_COUNT, ANNOT_GENE_TYPE, ANNOT_GENOMIC_SEQUENCE_LENGTH, ANNOT_GENUS_SPECIES, ANNOT_HAS_MISSING_TRANSCRIPTS, ANNOT_INTERPRO_DESCRIPTION, ANNOT_INTERPRO_ID, ANNOT_IS_DEPRECATED, ANNOT_IS_PSEUDO, ANNOT_ISOELECTRIC_POINT, ANNOT_LOCATION_TEXT, ANNOT_MAP_LOCATION, ANNOT_MCMC_LOCATION, ANNOT_MOLECULAR_WEIGHT, ANNOT_NCBI_TAX_ID, ANNOT_ORTHOMCL_LINK, ANNOT_OVERVIEW, ANNOT_PFAM_DESCRIPTION, ANNOT_PFAM_ID, ANNOT_PIRSF_DESCRIPTION, ANNOT_PIRSF_ID, ANNOT_PREDICTED_GO_COMPONENT, ANNOT_PREDICTED_GO_FUNCTION, ANNOT_PREDICTED_GO_ID_COMPONENT, ANNOT_PREDICTED_GO_ID_FUNCTION, ANNOT_PREDICTED_GO_ID_PROCESS, ANNOT_PREDICTED_GO_PROCESS, ANNOT_PRIMARY_KEY, ANNOT_PROB_MAP, ANNOT_PROB_MCMC, ANNOT_PROSITEPROFILES_DESCRIPTION, ANNOT_PROSITEPROFILES_ID, ANNOT_PROTEIN_LENGTH, ANNOT_PROTEIN_SEQUENCE, ANNOT_PROTEIN_SOURCE_ID, ANNOT_PSEUDO_STRING, ANNOT_SEQUENCE_DATABASE_NAME, ANNOT_SEQUENCE_ID, ANNOT_SIGNALP_PEPTIDE, ANNOT_SMART_DESCRIPTION, ANNOT_SMART_ID, ANNOT_SNPOVERVIEW, ANNOT_SO_ID, ANNOT_SO_TERM_DEFINITION, ANNOT_SO_TERM_NAME, ANNOT_SO_VERSION, ANNOT_START_MIN, ANNOT_STRAND, ANNOT_STRAND_PLUS_MINUS, ANNOT_SUPERFAMILY_DESCRIPTION, ANNOT_SUPERFAMILY_ID, ANNOT_THREE_PRIME_UTR_LENGTH, ANNOT_TIGRFAM_DESCRIPTION, ANNOT_TIGRFAM_ID, ANNOT_TM_COUNT, ANNOT_TRANS_FOUND_PER_GENE_INTERNAL, ANNOT_TRANSCRIPT_INDEX_PER_GENE, ANNOT_TRANSCRIPT_LENGTH, ANNOT_TRANSCRIPT_LINK, ANNOT_TRANSCRIPT_PRODUCT, ANNOT_TRANSCRIPT_SEQUENCE, ANNOT_TRANSCRIPTS_FOUND_PER_GENE, ANNOT_UNIPROT_IDS, ANNOT_UNIPROT_LINKS
## 'select()' returned 1:1 mapping between keys and columns
tc_more <- rbind(tc_esmer$genes, tc_nonesmer$genes, tc_unas$genes)
tc_annot <- merge(tc_annot, tc_more, by = "row.names")
rownames(tc_annot) <- tc_annot[["gid"]]
tc_annot[["gid"]] <- NULL
dim(tc_annot)## [1] 23304 135
## This is an orgdb, good.
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## This is an orgdb, good.
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## This is an orgdb, good.
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
I asked for one from Najib/Amalie but unless I am mistaken it has not arrived. That is not a problem, given two helpful things: April provides one, I also named the directories so that the sample IDs are built in; so I will just make a fake one for now and then merge in whatever I get from them…
sample_sheet <- "sample_sheets/all_samples.xlsx"
meta_sankey <- plot_meta_sankey(as.data.frame(extract_metadata(sample_sheet)),
factors = c("background", "exp_number"))## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Warning: attributes are not identical across measure variables; they will be dropped
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the ggsankey package.
## Please report the issue at <https://github.com/davidsjoberg/ggsankey/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Let us see how well my preprocess gatherer does…
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Warning in dispatch_regex_search(meta, search, replace, input_file_spec, : NAs
## introduced by coercion
## Warning in dispatch_regex_search(meta, search, replace, input_file_spec, : NAs
## introduced by coercion
## Writing new metadata to: sample_sheets/all_samples_modified.xlsx
## Deleting the file sample_sheets/all_samples_modified.xlsx before writing the tables.
## sampleid samplenumber celltype background hpi
## X02_HeLa_control_60h 02_HeLa_control_60h 2 HeLa control t60h
## X04_HeLa_WT_60hpi 04_HeLa_WT_60hpi 4 HeLa wt t60h
## X06_HeLa_KO7_60hpi 06_HeLa_KO7_60hpi 6 HeLa ko7 t60h
## X08_HeLa_Cas_60hpi 08_HeLa_Cas_60hpi 8 HeLa cas t60h
## X18_HeLa_control_60h 18_HeLa_control_60h 18 HeLa control t60h
## X20_HeLa_WT_60hpi 20_HeLa_WT_60hpi 20 HeLa wt t60h
## exp_number round amount_in_10ul amount_fact
## X02_HeLa_control_60h e1 r1 183 low
## X04_HeLa_WT_60hpi e1 r1 304 mid
## X06_HeLa_KO7_60hpi e1 r1 298 mid
## X08_HeLa_Cas_60hpi e1 r1 284 mid
## X18_HeLa_control_60h e2 r2 62 low
## X20_HeLa_WT_60hpi e2 r2 228 mid
## freebayes_table
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031freebayes_tcruzi_all/all_tags_q-10_c-2_m0.5_M-1.0_ctag-DP_mtag-AB.txt.xz
## condition batch sampleid_backup trimomatic_input
## X02_HeLa_control_60h undefined undefined 02_HeLa_control_60h 34421670
## X04_HeLa_WT_60hpi undefined undefined 04_HeLa_WT_60hpi 33338315
## X06_HeLa_KO7_60hpi undefined undefined 06_HeLa_KO7_60hpi 36904955
## X08_HeLa_Cas_60hpi undefined undefined 08_HeLa_Cas_60hpi 34230672
## X18_HeLa_control_60h undefined undefined 18_HeLa_control_60h 31154298
## X20_HeLa_WT_60hpi undefined undefined 20_HeLa_WT_60hpi 35726918
## trimomatic_output trimomatic_percent fastqc_pct_gc
## X02_HeLa_control_60h 31723102 0.922 52
## X04_HeLa_WT_60hpi 30831462 0.925 50
## X06_HeLa_KO7_60hpi 34168992 0.926 50
## X08_HeLa_Cas_60hpi 30953413 0.904 50
## X18_HeLa_control_60h 28104898 0.902 51
## X20_HeLa_WT_60hpi 32916331 0.921 50
## kraken_bacterial_classified kraken_bacterial_unclassified
## X02_HeLa_control_60h 147699 418871
## X04_HeLa_WT_60hpi 285754 6263711
## X06_HeLa_KO7_60hpi 414463 8109109
## X08_HeLa_Cas_60hpi 309973 7277804
## X18_HeLa_control_60h 147359 374703
## X20_HeLa_WT_60hpi 323491 8424975
## kraken_first_bacterial_species
## X02_HeLa_control_60h Porphyrobacter sp. GA68
## X04_HeLa_WT_60hpi Mycoplasmopsis arginini
## X06_HeLa_KO7_60hpi Mycoplasmopsis arginini
## X08_HeLa_Cas_60hpi Mycoplasmopsis arginini
## X18_HeLa_control_60h Porphyrobacter sp. GA68
## X20_HeLa_WT_60hpi Klebsiella pneumoniae
## kraken_first_bacterial_species_reads
## X02_HeLa_control_60h 34515
## X04_HeLa_WT_60hpi 20649
## X06_HeLa_KO7_60hpi 95574
## X08_HeLa_Cas_60hpi 22086
## X18_HeLa_control_60h 22324
## X20_HeLa_WT_60hpi 4599
## kraken_matrix_bacterial
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031kraken_bacteria/kraken_report_matrix.tsv
## hisat_genome_input_reads_hg38_115
## X02_HeLa_control_60h 31723102
## X04_HeLa_WT_60hpi 30831462
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 30953413
## X18_HeLa_control_60h 28104898
## X20_HeLa_WT_60hpi 32916331
## hisat_genome_input_reads_tcruzi_all
## X02_HeLa_control_60h 31723102
## X04_HeLa_WT_60hpi 30831462
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 30953413
## X18_HeLa_control_60h 28104898
## X20_HeLa_WT_60hpi 32916331
## hisat_genome_single_concordant_hg38_115
## X02_HeLa_control_60h 27374698
## X04_HeLa_WT_60hpi 21550886
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 20831115
## X18_HeLa_control_60h 24646849
## X20_HeLa_WT_60hpi 21560373
## hisat_genome_single_concordant_tcruzi_all
## X02_HeLa_control_60h 5363
## X04_HeLa_WT_60hpi 3984432
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 4602984
## X18_HeLa_control_60h 9351
## X20_HeLa_WT_60hpi 5394425
## hisat_genome_multi_concordant_hg38_115
## X02_HeLa_control_60h 3781834
## X04_HeLa_WT_60hpi 2731111
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 2534521
## X18_HeLa_control_60h 2935987
## X20_HeLa_WT_60hpi 2607492
## hisat_genome_multi_concordant_tcruzi_all
## X02_HeLa_control_60h 3176
## X04_HeLa_WT_60hpi 1739149
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 2063574
## X18_HeLa_control_60h 6690
## X20_HeLa_WT_60hpi 2363417
## hisat_genome_single_all_hg38_115
## X02_HeLa_control_60h 393579
## X04_HeLa_WT_60hpi 386791
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 370232
## X18_HeLa_control_60h 371885
## X20_HeLa_WT_60hpi 394781
## hisat_genome_single_all_tcruzi_all
## X02_HeLa_control_60h 66941
## X04_HeLa_WT_60hpi 223361
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 232208
## X18_HeLa_control_60h 77290
## X20_HeLa_WT_60hpi 288620
## hisat_genome_multi_all_hg38_115
## X02_HeLa_control_60h 147888
## X04_HeLa_WT_60hpi 125185
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 118754
## X18_HeLa_control_60h 118560
## X20_HeLa_WT_60hpi 124747
## hisat_genome_multi_all_tcruzi_all hisat_unmapped_hg38_115
## X02_HeLa_control_60h 41174 485321
## X04_HeLa_WT_60hpi 110555 12501300
## X06_HeLa_KO7_60hpi NA NA
## X08_HeLa_Cas_60hpi 116543 14599664
## X18_HeLa_control_60h 38204 474391
## X20_HeLa_WT_60hpi 132039 16893802
## hisat_unmapped_tcruzi_all hisat_genome_percent_log_hg38_115
## X02_HeLa_control_60h 63320953 99.24
## X04_HeLa_WT_60hpi 49859944 79.73
## X06_HeLa_KO7_60hpi NA NA
## X08_HeLa_Cas_60hpi 48200809 76.42
## X18_HeLa_control_60h 56062102 99.16
## X20_HeLa_WT_60hpi 49864471 74.34
## hisat_genome_percent_log_tcruzi_all
## X02_HeLa_control_60h 0.20
## X04_HeLa_WT_60hpi 19.14
## X06_HeLa_KO7_60hpi NA
## X08_HeLa_Cas_60hpi 22.14
## X18_HeLa_control_60h 0.26
## X20_HeLa_WT_60hpi 24.26
## hisat_alignment_hg38_115
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome.bam
## hisat_alignment_tcruzi_all
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome.bam
## salmon_percent_hg38_115 salmon_percent_tcruzi_all
## X02_HeLa_control_60h 45.40 0.008861
## X04_HeLa_WT_60hpi 35.05 9.584910
## X06_HeLa_KO7_60hpi 33.15 11.219200
## X08_HeLa_Cas_60hpi 33.47 10.873800
## X18_HeLa_control_60h 43.10 0.009966
## X20_HeLa_WT_60hpi 34.33 12.117900
## salmon_observed_genes_hg38_115 salmon_observed_genes_tcruzi_all
## X02_HeLa_control_60h 47839 121
## X04_HeLa_WT_60hpi 46509 19145
## X06_HeLa_KO7_60hpi 48117 19177
## X08_HeLa_Cas_60hpi 46351 19153
## X18_HeLa_control_60h 47978 654
## X20_HeLa_WT_60hpi 47985 19270
## input_r1
## X02_HeLa_control_60h unprocessed/02_HeLa_control_60h_2_S1_R1_001.fastq.gz
## X04_HeLa_WT_60hpi unprocessed/04_HeLa_WT_60hpi_2_S2_R1_001.fastq.gz
## X06_HeLa_KO7_60hpi unprocessed/06_HeLa_KO7_60hpi_2_S3_R1_001.fastq.gz
## X08_HeLa_Cas_60hpi unprocessed/08_HeLa_Cas_60hpi_2_S4_R1_001.fastq.gz
## X18_HeLa_control_60h unprocessed/18_HeLa_control_60h_2_S5_R1_001.fastq.gz
## X20_HeLa_WT_60hpi unprocessed/20_HeLa_WT_60hpi_2_S6_R1_001.fastq.gz
## input_r2
## X02_HeLa_control_60h unprocessed/02_HeLa_control_60h_2_S1_R2_001.fastq.gz
## X04_HeLa_WT_60hpi unprocessed/04_HeLa_WT_60hpi_2_S2_R2_001.fastq.gz
## X06_HeLa_KO7_60hpi unprocessed/06_HeLa_KO7_60hpi_2_S3_R2_001.fastq.gz
## X08_HeLa_Cas_60hpi unprocessed/08_HeLa_Cas_60hpi_2_S4_R2_001.fastq.gz
## X18_HeLa_control_60h unprocessed/18_HeLa_control_60h_2_S5_R2_001.fastq.gz
## X20_HeLa_WT_60hpi unprocessed/20_HeLa_WT_60hpi_2_S6_R2_001.fastq.gz
## hisat_count_table_hg38_115
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031hisat_hg38_115/hg38_115_genome-paired_s2_gene_ID_fcounts.csv.xz
## hisat_count_table_tcruzi_all
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031hisat_tcruzi_all/tcruzi_all_genome-paired_s2_gene_ID_fcounts.csv.xz
## salmon_count_table_hg38_115
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031salmon_hg38_115_CDS/quant.sf
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031salmon_hg38_115_CDS/quant.sf
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031salmon_hg38_115_CDS/quant.sf
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031salmon_hg38_115_CDS/quant.sf
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031salmon_hg38_115_CDS/quant.sf
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031salmon_hg38_115_CDS/quant.sf
## salmon_count_table_tcruzi_all
## X02_HeLa_control_60h preprocessing/02_HeLa_control_60h/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
## X04_HeLa_WT_60hpi preprocessing/04_HeLa_WT_60hpi/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
## X06_HeLa_KO7_60hpi preprocessing/06_HeLa_KO7_60hpi/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
## X08_HeLa_Cas_60hpi preprocessing/08_HeLa_Cas_60hpi/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
## X18_HeLa_control_60h preprocessing/18_HeLa_control_60h/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
## X20_HeLa_WT_60hpi preprocessing/20_HeLa_WT_60hpi/outputs/20251031salmon_tcruzi_all_CDS/quant.sf
Strangely, this did not pick up the freebayes outputs. I will add them manually to the original sheet. Possibly because I ran it twice with different parameters, my code gets confused when multiple files match the same rule.
color_choices <- list(
"hs" = list(
"AB10" = "#086448",
"cas" = "#702601",
"control" = "#454178",
"ko7" = "#870649",
"positive" = "#46060E",
"wt" = "#785C01"),
"tc" = list(
"AB10" = "#0DA877",
"cas" = "#BA3F01",
"control" = "#7771D1",
"ko7" = "#BF086A",
"positive" = "#8F0C1E",
"wt" = "#AF8401"))These colors are bad, the human are too dark and lose their contrast with respect to each other. I should get Najib/April/Amalie to help define better.
hs_se <- create_se(new_meta[["new_meta"]], gene_info = hs_annot[["gene_annotations"]],
file_column = "hisat_count_table_hg38_115") %>%
set_conditions(fact = "background") %>%
set_batches(fact = "exp_number") %>%
set_colors(color_choices[["hs"]])## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 14 rows(samples) and 48 columns(metadata fields).
## Matched 21562 annotations and counts.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 21571 rows and 48 columns.
## The numbers of samples by condition are:
##
## AB10 cas control ko7 positive wt
## 3 1 3 3 1 3
## The number of samples by batch are:
##
## e1 e2 e3 undef
## 4 4 5 1
## Deleting the file excel/hs_expression_data.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Warning in .local(x, row.names, optional, ...): arguments in '...' ignored
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the hpgltools package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## 85373 entries are 0. We are on a log scale, adding 1 to the data.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the directlabels package.
## Please report the issue at <https://github.com/tdhock/directlabels/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'This dataset does not support lmer with condition+batch
## Removing 9704 low-count genes (11867 remaining).
## transform_counts: Found 100 values equal to 0, adding 1 to the matrix.
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'The factor AB10 has 3 rows.
## The factor cas has only 1 row.
## The factor control has 3 rows.
## The factor ko7 has 3 rows.
## The factor positive has only 1 row.
## The factor wt has 3 rows.
tc_se <- create_se(new_meta[["new_meta"]], gene_info = tc_annot,
file_column = "hisat_count_table_tcruzi_all") %>%
set_conditions(fact = "background") %>%
set_batches(fact = "exp_number") %>%
set_colors(color_choices[["tc"]])## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 14 rows(samples) and 48 columns(metadata fields).
## Matched 23304 annotations and counts.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 25100 rows and 48 columns.
## The numbers of samples by condition are:
##
## AB10 cas control ko7 positive wt
## 3 1 3 3 1 3
## The number of samples by batch are:
##
## e1 e2 e3 undef
## 4 4 5 1
## Deleting the file excel/tc_expression_data.xlsx before writing the tables.
## Writing the first sheet, containing a legend and some summary data.
## Warning in .local(x, row.names, optional, ...): arguments in '...' ignored
## The following samples have less than 16315 genes.
## [1] "02_HL_c_60" "18_HL_c_60" "34_HL_c_60" "pos_ctrl"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 132845 entries are 0. We are on a log scale, adding 1 to the data.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
## Error in (function (side, at = NULL, labels = TRUE, tick = TRUE, line = NA, :
## no locations are finite
## Hey, you merged the annotation data and did not reset the column names!
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'This dataset does not support lmer with condition+batch
## Removing 0 low-count genes (25100 remaining).
## transform_counts: Found 132845 values equal to 0, adding 1 to the matrix.
## Plot describing the gene distribution from a dataset.
## `geom_smooth()` using formula = 'y ~ x'The factor AB10 has 3 rows.
## The factor cas has only 1 row.
## The factor control has 3 rows.
## The factor ko7 has 3 rows.
## The factor positive has only 1 row.
## The factor wt has 3 rows.
One of my concerns surrounds the fate of the various trans-sialidase genes and the ability to discern the efficaciousness of adding stop codons to them. I therefore quantified the samples with salmon which I think is more sensitive to multi gene families.
salmon_annot <- tc_annot
rownames(salmon_annot) <- paste0(rownames(salmon_annot), ":mRNA")
tc_salmon <- create_se(new_meta[["new_meta"]], gene_info = salmon_annot,
file_column = "salmon_count_table_tcruzi_all") %>%
set_conditions(fact = "background") %>%
set_batches(fact = "exp_number") %>%
set_colors(color_choices[["tc"]])## Reading the sample metadata.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 14 rows(samples) and 48 columns(metadata fields).
## Matched 19476 annotations and counts.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the summarized experiment to 'se.rda'.
## The final summarized experiment has 19533 rows and 48 columns.
## The numbers of samples by condition are:
##
## AB10 cas control ko7 positive wt
## 3 1 3 3 1 3
## The number of samples by batch are:
##
## e1 e2 e3 undef
## 4 4 5 1
hs_mapped <- plot_metadata_factors(hs_se, column = "hisat_genome_percent_log_hg38_115")
pp(file = "images/hs_hisat_mapping_percent.png", image = hs_mapped)## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
tc_mapped <- plot_metadata_factors(tc_se, column = "hisat_genome_percent_log_tcruzi_all")
pp(file = "images/tc_hisat_mapping_percent.png", image = tc_mapped)## Warning: Removed 1 row containing non-finite outside the scale range (`stat_ydensity()`).
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
I picked off another TODO in here, I changed plot_nonzero to more intelligently set the text annotation.
## The following samples have less than 16315 genes.
## [1] "02_HL_c_60" "18_HL_c_60" "34_HL_c_60" "pos_ctrl"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## The samples (and read coverage) removed when filtering 5000 non-zero genes are:
## 02_HeLa_control_60h 18_HeLa_control_60h 34_HeLa_control_60h pos_ctrl
## 1615 1957 1661 301
## 02_HeLa_control_60h 18_HeLa_control_60h 34_HeLa_control_60h pos_ctrl
## 80 529 96 18
## Samples removed: 80, 529, 96, 18
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 85373 entries are 0. We are on a log scale, adding 1 to the data.
## Plot describing the gene distribution from a dataset.
## Removing samples with less than 3 replicates.
## Removed: 08_HeLa_Cas_60hpi, pos_ctrl.
tc_replicated <- subset_se(tc_se, min_replicates = 3, fact = "condition") %>%
subset_se(nonzero = 10000)## Removing samples with less than 3 replicates.
## Removed: 08_HeLa_Cas_60hpi.
## No samples have fewer than 10000 observed genes.
tcsal_replicated <- subset_se(tc_salmon, min_replicates = 3, fact = "condition") %>%
subset_se(nonzero = 10000)## Removing samples with less than 3 replicates.
## Removed: 08_HeLa_Cas_60hpi, pos_ctrl.
## The samples (and read coverage) removed when filtering 10000 non-zero genes are:
## 02_HeLa_control_60h 18_HeLa_control_60h 34_HeLa_control_60h
## 2811 2801 2201
## 02_HeLa_control_60h 18_HeLa_control_60h 34_HeLa_control_60h
## 120 653 143
## Samples removed: 120, 653, 143
## Using the snp column: PAIRED from the sample annotations.
Reminder to self: count_snps reads the freebayes table, pass that to get_snp_sets() to cross reference against the experimental design, then pass that to snps_intersections() and snps_vs_genes(). I should change that to be able to directly take the output from count_snps()
var_norm <- normalize(tc_variants, convert = "cpm", norm = "quant",
filter = TRUE, transform = "log2")## Removing 0 low-count genes (51349 remaining).
## transform_counts: Found 298249 values equal to 0, adding 1 to the matrix.
tc_variant_pca <- plot_pca(var_norm)
pp(file = "images/tc_variant_pca.png", image = tc_variant_pca[["plot"]])## The samples represent the following categories:
##
## AB10 ko7 wt
## 3 3 3
## Using a proportion of observed variants, converting the data to binary observations.
## The factor AB10 has 3 rows.
## The factor ko7 has 3 rows.
## The factor wt has 3 rows.
## Finished iterating over the chromosomes.
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: condition. 3
## categories and 51349 variants were observed with 7
## combinations among them. 1796 chromosomes/scaffolds were observed with a
## density of variants ranging from 2.25022502250225e-05 to 0.1.
snp_intersections <- snps_intersections(tc_se, tc_sets, start_column = "start",
end_column = "end", chr_column = "annot_sequence_id")
snp_intersections## The combinations of variants, chromosomes, and genes which are unique to every factor
## and combination of factors in the data.
snps_vs_genes <- snps_vs_genes(tc_se, tc_sets, start_column = "start",
end_column = "end", chr_column = "seqnames")## The snp grange data has 51349 elements.
## The first few snp chromosomes are: TcChr10_P, TcChr10_S, TcChr11_P, TcChr11_S, TcChr12_P, TcChr12_S
## The first few exp chromosomes are: TcChr1-P, TcChr1-S, TcChr10-P, TcChr10-S, TcChr11-P, TcChr11-S
## There are 543 overlapping variants and genes.
## When the variants observed were cross referenced against annotated genes,
## 239 genes were observed with at least 1 variant.
## TcCLB.402863.9 had the most variants, with 48.
Here are a couple of variance partition invocations. Once we have other metadata, this will be more useful. Note, this will only work once the non-replicated conditions are removed (control and cas).
## The model of ~ condition + batch has 6 levels and rank 6
## The model of ~ condition + batch has 5 levels and rank 5
## Warning in .fitExtractVarPartModel(exprObj, formula, data, REML = REML, : Model failed for 9 responses.
## See errors with attr(., 'errors')
I think we probably should not be surprised at the amount of variance attributed to the batch due to the very large difference in coverage between experiment #3 and 1/2.
Perform our default PCA plot along with a combat version.
hs_norm <- normalize(hs_replicated, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 9809 low-count genes (11762 remaining).
## Setting 1672 entries to zero.
hs_disheat <- plot_disheat(hs_norm)
pp(file = "images/hs_distance_heatmap.png", image = hs_disheat2)## Error:
## ! object 'hs_disheat2' not found
hs_corheat <- plot_corheat(hs_norm)
pp(file = "images/hs_correlation_heatmap.png", image = hs_corheat[["plot"]])hs_norm_pca <- plot_pca(hs_norm, plot_labels = TRUE)
pp(file = "images/hs_norm_pca.png", image = hs_norm_pca[["plot"]])hs_nb <- normalize(hs_replicated, transform = "log2", convert = "cpm",
filter = TRUE, batch = "svaseq")## Removing 9809 low-count genes (11762 remaining).
## transform_counts: Found 25 values less than 0.
## transform_counts: Found 25 values equal to 0, adding 1 to the matrix.
hs_cb <- normalize(hs_replicated, transform = "log2", convert = "cpm",
filter = TRUE, batch = "combat")## Removing 9809 low-count genes (11762 remaining).
## transform_counts: Found 84 values less than 0.
## transform_counts: Found 89 values equal to 0, adding 1 to the matrix.
hs_combat_pca <- plot_pca(hs_cb)
pp(file = "images/hs_norm_combat_pca.png", image = hs_combat_pca[["plot"]])## Removing 3731 low-count genes (21369 remaining).
## transform_counts: Found 48 values equal to 0, adding 1 to the matrix.
tc_corheat <- plot_corheat(tc_norm)
pp(file = "images/tc_correlation_heatmap.png", image = tc_corheat)A little bit of fun, extract the genes which are high-outliers in each sample and print what they are.
high_genes <- unique(as.character(unlist(norm_boxplot[["high_outlier_genes"]])))
unique(rowData(tc_se)[high_genes, ][["annot_transcript_product"]])## [1] "alpha tubulin, putative"
## [2] "undefined"
## [3] "Voltage-dependent calcium channel subunit, putative"
## [4] "zinc finger CCCH domain containing protein 11"
## [5] "60S ribosomal protein L6, putative"
## [6] "clathrin heavy chain, putative"
## [7] "ribosomal protein l35a, putative"
## [8] "cyclophilin a, putative"
## [9] "hypothetical protein, conserved"
## [10] "ABC transporter, putative"
## [11] "dynein heavy chain, putative"
## [12] "mitochondrial RNA binding complex 1 subunit, putative"
## [13] "hypothetical protein"
## [14] "cell division cycle protein 20, putative"
## [15] "conserved protein"
## [16] "metallo-peptidase, Clan MF, Family M17"
## [17] "ubiquitin-protein ligase-like, putative"
## [18] "60S ribosomal protein L23a, putative"
## [19] "RNA-binding protein 42 (RNA-binding motif protein 42), putative"
## [20] "60S ribosomal protein L10a, putative"
## [21] "40S ribosomal protein S6, putative"
## [22] "trans-sialidase, Group II, putative"
## [23] "enolase"
## [24] "polyadenylate-binding protein, putative"
## [25] "CCR4-NOT transcription complex subunit 1"
tc_norm_pca <- plot_pca(tc_norm, plot_labels = TRUE)
pp(file = "images/tc_norm_pca.png", image = tc_norm_pca)tc_rnorm <- normalize(tc_replicated, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 3797 low-count genes (21303 remaining).
## transform_counts: Found 44 values equal to 0, adding 1 to the matrix.
tc_rnorm_disheat <- plot_disheat(tc_rnorm)
pp(file = "images/tc_rnorm_disheat.png", image = tc_rnorm_disheat)tc_rbnorm <- normalize(tc_replicated, transform = "log2", convert = "cpm",
filter = TRUE, batch = "svaseq")## Removing 3797 low-count genes (21303 remaining).
## transform_counts: Found 1419 values less than 0.
## transform_counts: Found 1419 values equal to 0, adding 1 to the matrix.
tc_cbnorm <- normalize(tc_replicated, transform = "log2", convert = "cpm",
filter = TRUE, batch = "combat")## Removing 3797 low-count genes (21303 remaining).
## transform_counts: Found 1671 values less than 0.
## transform_counts: Found 1812 values equal to 0, adding 1 to the matrix.
I am not thinking we will see many genes of interest.
hs_keepers <- list(
"ab_vs_control" = c("AB10", "control"),
"ko_vs_control" = c("ko7", "control"),
"ko_vs_wt" = c("ko7", "wt"),
"ab_vs_wt" = c("AB10", "wt"),
"ab_vs_ko" = c("AB10", "ko7"))
hs_de <- all_pairwise(hs_replicated, filter = TRUE, model_fstring = "~ 0 + condition",
model_svs = "svaseq")## AB10 control ko7 wt
## 3 3 3 3
## Removing 9809 low-count genes (11762 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 1603 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## AB10 control ko7 wt
## 3 3 3 3
## conditions
## AB10 control ko7 wt
## 3 3 3 3
## conditions
## AB10 control ko7 wt
## 3 3 3 3
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 6 comparisons.
## Deleting the file excel/hs_tables.xlsx before writing the tables.
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown
## 1 control_vs_AB10-inverted 16 29 25 43
## 2 ko7_vs_control 82 20 104 28
## 3 wt_vs_ko7-inverted 0 3 0 4
## 4 wt_vs_AB10-inverted 0 1 0 1
## 5 ko7_vs_AB10-inverted 0 9 0 16
## limma_sigup limma_sigdown
## 1 2 1
## 2 4 0
## 3 0 0
## 4 0 0
## 5 0 0
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Plot describing unique/shared genes in a differential expression table.
## Deleting the file excel/hs_sig.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## ab_vs_control 2 1 25 43 16 29 12
## ko_vs_control 4 0 104 28 82 20 31
## ko_vs_wt 0 0 0 4 0 3 0
## ab_vs_wt 0 0 0 1 0 1 0
## ab_vs_ko 0 0 0 16 0 9 0
## ebseq_down basic_up basic_down
## ab_vs_control 0 0 0
## ko_vs_control 1 0 0
## ko_vs_wt 1 0 0
## ab_vs_wt 0 0 0
## ab_vs_ko 0 0 0
While it is true there are not a tremendous number of genes, at least some of the groups are interesting.
## BP CC CORUM HP HPA KEGG MIRNA MF REAC TF WP
## ab_vs_control_up 0 5 0 0 0 0 1 1 1 0 0
## ab_vs_control_down 4 1 0 0 0 0 0 4 0 0 0
## ko_vs_control_up 64 2 0 0 0 2 1 15 5 21 4
## ko_vs_control_down 12 5 1 0 0 1 1 8 2 0 0
## Error:
## ! unable to find an inherited method for function 'conditions' for signature 'object = "SummarizedExperiment"'
tc_keepers <- list(
"ab_vs_wt" = c("AB10", "wt"),
"ko_vs_wt" = c("ko7", "wt"),
"ab_vs_ko" = c("AB10", "ko7"))
tc_de <- all_pairwise(tc_replicated, filter = TRUE, model_fstring = "~ 0 + condition",
model_svs = "svaseq")## AB10 ko7 wt
## 3 3 3
## Removing 3797 low-count genes (21303 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 7241 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## AB10 ko7 wt
## 3 3 3
## conditions
## AB10 ko7 wt
## 3 3 3
## conditions
## AB10 ko7 wt
## 3 3 3
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 3 comparisons.
## Deleting the file excel/tc_tables.xlsx before writing the tables.
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown
## 1 wt_vs_AB10-inverted 32 377 99 621
## 2 wt_vs_ko7-inverted 42 49 103 165
## 3 ko7_vs_AB10-inverted 11 286 38 465
## limma_sigup limma_sigdown
## 1 74 341
## 2 70 67
## 3 20 245
## Plot describing unique/shared genes in a differential expression table.
## Deleting the file excel/tc_sig.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## ab_vs_wt 74 341 99 621 32 377 62
## ko_vs_wt 70 67 103 165 42 49 117
## ab_vs_ko 20 245 38 465 11 286 10
## ebseq_down basic_up basic_down
## ab_vs_wt 247 0 0
## ko_vs_wt 47 0 0
## ab_vs_ko 194 0 0
I ought to be able to use my semantic filter to extract anything with sialidase and/or trans-sialidase group I and look directly at the expression of these genes. My hypothesis is that if the CRISPR experiment worked as intended, these genes should all have decreased expression.
all_ts <- semantic_filter(tc_replicated, invert = TRUE, semantic = c("trans-sialidase"),
semantic_column = "annot_transcript_product")## Hit 1523 genes for term trans-sialidase.
## semantic_filter(): kept, 23577 genes.
## transform_counts: Found 1833 values equal to 0, adding 1 to the matrix.
all_ts_norm_heat <- plot_sample_heatmap(all_ts_norm)
pp(file = "images/all_ts_norm_hisat_heatmap.png")
all_ts_norm_heat
dev.off()## png
## 2
all_ts_sal <- semantic_filter(tcsal_replicated, invert = TRUE, semantic = c("trans-sialidase"),
semantic_column = "annot_transcript_product")## Hit 730 genes for term trans-sialidase.
## semantic_filter(): kept, 18803 genes.
## transform_counts: Found 451 values equal to 0, adding 1 to the matrix.
all_ts_sal_norm_heat <- plot_sample_heatmap(all_ts_sal_norm)
pp(file = "images/all_ts_norm_salmon_heatmap.png")
all_ts_sal_norm_heat
dev.off()## png
## 2
The group-I TS genes are not obvious in this group, let us yank them out explicitly and see.
Note, the following is a little bit wrong in thinking because searching for ‘Group I’ will pick up all genes from Group I, II, III, and IV. The next stanza will extract just the IDs of interest.
g1_ts <- semantic_filter(all_ts, invert = TRUE, semantic = c("Group I"),
semantic_column = "annot_transcript_product")## Hit 175 genes for term Group I.
## semantic_filter(): kept, 1348 genes.
g1_ts_sal <- semantic_filter(all_ts_sal, invert = TRUE, semantic = c("Group I"),
semantic_column = "annot_transcript_product")## Hit 171 genes for term Group I.
## semantic_filter(): kept, 559 genes.
There is a pretty significant increase in a few AB samples, perhaps those are in the list of 19 specific genes? Let us find out.
## subset_genes(), before removal, there were 175 genes, now there are 18.
## There are 9 samples which kept less than 90 percent counts.
## 04_HeLa_WT_60hpi 06_HeLa_KO7_60hpi 20_HeLa_WT_60hpi 22_HeLa_KO7_60hpi
## 16.731 10.844 21.933 9.256
## 23_HeLa_AB10_60hpi 36_HeLa_WT_60hpi 38_HeLa_KO7_60hpi 39_HeLa_AB10_60hpi
## 11.236 10.072 7.398 6.906
## 40_HeLa_AB10_60hpi
## 7.433
## transform_counts: Found 14 values equal to 0, adding 1 to the matrix.
g1_ts_hisat_norm_heat <- plot_sample_heatmap(expected_norm)
pp(file = "images/g1_ts_hisat_norm_heat.png")
g1_ts_hisat_norm_heat
dev.off()## png
## 2
sal_expected <- paste0(expected_lower, ":mRNA")
expected_ts_sal <- subset_genes(g1_ts_sal, ids = sal_expected, method = "keep")## subset_genes(), before removal, there were 171 genes, now there are 18.
## There are 9 samples which kept less than 90 percent counts.
## 04_HeLa_WT_60hpi 06_HeLa_KO7_60hpi 20_HeLa_WT_60hpi 22_HeLa_KO7_60hpi
## 16.868 10.714 22.443 9.488
## 23_HeLa_AB10_60hpi 36_HeLa_WT_60hpi 38_HeLa_KO7_60hpi 39_HeLa_AB10_60hpi
## 10.798 11.991 8.080 7.421
## 40_HeLa_AB10_60hpi
## 7.671
## transform_counts: Found 2 values equal to 0, adding 1 to the matrix.
g1_ts_salmon_norm_heat <- plot_sample_heatmap(expected_sal_norm)
pp(file = "images/g1_ts_salmon_norm_heat.png")
g1_ts_salmon_norm_heat
dev.off()## png
## 2
We cannot use gProfiler2 with the parasite because it is not a reference species; but other ontology methods are not constrained thus. In the case of clusterProfiler, there is another constraint, I do not have a single orgDB object which comprises Esmer/NonEsmer/Unassigned; as a result I must attempt the ontology search on the haplotypes separately.
ko_wt_up <- tc_sig[["deseq"]][["ups"]][["ko_vs_wt"]]
ko_wt_down <- tc_sig[["deseq"]][["downs"]][["ko_vs_wt"]]
ko_wt_all <- tc_tables[["data"]][["ko_vs_wt"]]
ab_ko_up <- tc_sig[["deseq"]][["ups"]][["ab_vs_ko"]]
ab_ko_down <- tc_sig[["deseq"]][["downs"]][["ab_vs_ko"]]
ab_ko_all <- tc_tables[["data"]][["ab_vs_ko"]]
tc_esmer_up_cp <- simple_clusterprofiler(
ko_wt_up, de_table = ko_wt_all, orgdb = esmer_db, orgdb_to = "GID",
organism = "tcruzi", excel = "excel/ko_wt_up_cp_esmer.xlsx")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
tc_nonesmer_up_cp <- simple_clusterprofiler(
ko_wt_up, de_table = ko_wt_all, orgdb = nonesmer_db, orgdb_to = "GID",
organism = "tcruzi", excel = "excel/ko_wt_up_cp_nonesmer.xlsx")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
tc_unas_up_cp <- simple_clusterprofiler(
ko_wt_up, de_table = ko_wt_all, orgdb = unas_db, orgdb_to = "GID",
organism = "tcruzi")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
## Error:
## ! object 'tc_esmer_up_cp' not found
tc_esmer_down_cp <- simple_clusterprofiler(
ko_wt_down, de_table = ko_wt_all, orgdb = esmer_db, orgdb_to = "GID",
organism = "tcruzi", excel = "excel/ko_wt_down_cp_esmer.xlsx")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
tc_nonesmer_down_cp <- simple_clusterprofiler(
ko_wt_down, de_table = ko_wt_all, orgdb = nonesmer_db, orgdb_to = "GID",
organism = "tcruzi", excel = "excel/ko_wt_down_cp_nonesmer.xlsx")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
tc_unas_down_cp <- simple_clusterprofiler(
ko_wt_down, de_table = ko_wt_all, orgdb = unas_db, orgdb_to = "GID",
organism = "tcruzi")## Error in `cp_msigdb_loaded()`:
## ! object 'signature_df' not found
## Error:
## ! object 'tc_esmer_down_cp' not found
length_db <- as.data.frame(rowData(tc_se))
length_db[["gid"]] <- rownames(length_db)
length_db <- length_db[, c("gid", "width")]
tc_up_gs <- simple_goseq(ko_wt_up, go_db = tc_go, length_db = length_db, min_xref = 10)## Found 27 go_db genes and 42 length_db genes out of 42.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Warning in (function (model, data, ...) : Arguments in `...` must be used.
## ✖ Problematic argument:
## • by = "Count"
## ℹ Did you misspell an argument name?
## Warning in (function (model, data, ...) : Arguments in `...` must be used.
## ✖ Problematic argument:
## • by = "Count"
## ℹ Did you misspell an argument name?
## dimensionality reduction failed with provided drfun; falling back to stats::cmdscale.
Now check the position of the expected lower expression genes in the context of all genes compared to wt.
## Pull the ko_wt_all table and see where expected_lower compares.
hs_duplicate <- subset_se(hs_replicated, subset = "round!='r3'")
tc_duplicate <- subset_se(tc_replicated, subset = "round!='r3'")
hs_dup_de <- all_pairwise(hs_duplicate, filter = TRUE,
model_fstring = "~ 0 + condition + batch", model_svs = FALSE)## AB10 control ko7 wt
## 2 2 2 2
## e1 e2 e3
## 3 4 1
## Removing 9950 low-count genes (11621 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 495 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## AB10 control ko7 wt
## 2 2 2 2
## conditions
## AB10 control ko7 wt
## 2 2 2 2
## conditions
## AB10 control ko7 wt
## 2 2 2 2
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 6 comparisons.
## Deleting the file excel/hs_dup_de_table-v202604.xlsx before writing the tables.
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
hs_dup_sig <- extract_significant_genes(hs_dup_table, excel = glue("excel/hs_dup_de_sig-v{ver}.xlsx"))## Deleting the file excel/hs_dup_de_sig-v202604.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## control_vs_AB10 0 0 3 38 2 11 11
## ko7_vs_AB10 0 0 0 1 0 0 43
## wt_vs_AB10 0 0 1 2 0 0 17
## ko7_vs_control 0 0 323 83 206 44 316
## wt_vs_control 0 0 297 75 172 37 176
## wt_vs_ko7 0 0 3 1 3 0 7
## ebseq_down basic_up basic_down
## control_vs_AB10 53 0 0
## ko7_vs_AB10 12 0 0
## wt_vs_AB10 8 0 0
## ko7_vs_control 56 0 0
## wt_vs_control 28 0 0
## wt_vs_ko7 1 0 0
tc_dup_de <- all_pairwise(tc_duplicate, filter = TRUE,
model_fstring = "~ 0 + condition", model_svs = "svaseq")## AB10 ko7 wt
## 2 2 2
## Removing 4457 low-count genes (20643 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 2091 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## AB10 ko7 wt
## 2 2 2
## conditions
## AB10 ko7 wt
## 2 2 2
## conditions
## AB10 ko7 wt
## 2 2 2
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 3 comparisons.
## Deleting the file excel/tc_dup_de_table-v202604.xlsx before writing the tables.
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 ko7_vs_AB10 125 10 239 60 118
## 2 wt_vs_AB10 123 28 358 306 129
## 3 wt_vs_ko7 22 56 238 253 119
## limma_sigdown
## 1 20
## 2 39
## 3 89
## Plot describing unique/shared genes in a differential expression table.
tc_dup_sig <- extract_significant_genes(tc_dup_table, excel = glue("excel/tc_dup_de_sig-v{ver}.xlsx"))## Deleting the file excel/tc_dup_de_sig-v202604.xlsx before writing the tables.
## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down ebseq_up
## ko7_vs_AB10 118 20 239 60 125 10 171
## wt_vs_AB10 129 39 358 306 123 28 205
## wt_vs_ko7 119 89 238 253 22 56 63
## ebseq_down basic_up basic_down
## ko7_vs_AB10 24 0 0
## wt_vs_AB10 465 0 0
## wt_vs_ko7 194 0 0
Invoke goseq/clusterprofiler on these genes.
## Found 95 go_db genes and 125 length_db genes out of 125.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Found 5 go_db genes and 10 length_db genes out of 10.
## Found 98 go_db genes and 123 length_db genes out of 123.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Found 18 go_db genes and 28 length_db genes out of 28.
## Found 17 go_db genes and 22 length_db genes out of 22.
## Found 41 go_db genes and 56 length_db genes out of 56.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
Check expression of genes expected to be lower
A nice detail came out today, the PTCs introduced by CRISPR included M13; I unfortunately did not think to ask which primer, but I should be able to figure that out trivially:
Start by checking an arbitrary ko sample, I should see a bunch of reads with at least one of the above.
cd preprocessing/06_HeLa_KO7_60hpi
xzgrep GTAAAACGACGGCCAGTG outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 forward -20 vs. R1: 0 hits
xzgrep CACTGGCCGTCGTTTTAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 forward -20 RC vs. R1: 20 hits
xzgrep GTAAAACGACGGCCAGTG outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 forward -20 vs R2: 75 hits
xzgrep CACTGGCCGTCGTTTTAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 forward -20 RC vs R2: 0 hits
xzgrep GGTTTTCCCAGTCACGAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 forward -41 vs R1: 11 hits
xzgrep GTCGTGACTGGGAAAACC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 forward RC -41 vs R1:
xzgrep GGTTTTCCCAGTCACGAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 forward -41 vs R2: 12
xzgrep GTCGTGACTGGGAAAACC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 forward -41 RC vs R2: 8
xzgrep GGAAACAGCTATGACCATG outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 reverse -27 vs R1: 54
xzgrep CATGGTCATAGCTGTTTCC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 reverse -27 RC vs R1: 0
xzgrep GGAAACAGCTATGACCATG outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 reverse -27 vs R1: 0
xzgrep CATGGTCATAGCTGTTTCC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 reverse -27 RC vs R1: 104
xzgrep AGCGGATAACAATTTCACAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 reverse -48 vs R1: 286
xzgrep GTGTGAAATTGTTATCCGCT outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R1_001-trimmed.fastq.xz | wc
## M13 reverse -48 RC vs R1: 0
xzgrep AGCGGATAACAATTTCACAC outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 reverse -48 vs R2: 0
xzgrep GTGTGAAATTGTTATCCGCT outputs/20251031trimomatic/06_HeLa_KO7_60hpi_2_S3_R2_001-trimmed.fastq.xz | wc
## M13 reverse -48 RC vs R2: 90 hitsCodify the above: I wrote a quick target in cyoa to seek out these sequences and extract the other read, e.g. if R1 has one of these sequences, it will pull out R2 and write it to a separate fastq file.
sequences="GTAAAACGACGGCCAGTG:GGTTTTCCCAGTCACGAC:GGAAACAGCTATGACCATG:AGCGGATAACAATTTCACAC"
samples=$(/bin/ls -d [0-9]*)
for s in ${samples}; do
pushd $s
input=$(/bin/ls outputs/*trimomatic/*_R1*-trimmed.fastq.xz)
library=$(/bin/ls outputs/*trimomatic/*_R2*-trimmed.fastq.xz)
cyoa --method getother --input $input --library $library --query $sequences
popd
doneI ran the above and was pleased to see that only the KO and AB samples contain any M13 sequence. I then did a little arbitrary BLASTing of the other reads. Weirdly, most of the hits were to GAPDH, but the second read I pulled aligned to Tc00.1047053509065.50, which is a synonym for TcCLB.509065.50 (~ 800,000 on TcChr32-P)
I then started searching through the set of reads extracted to see if I can find where the M13 sequences live. I have a screenshot from IGV suggesting that many/most/all of them are adjacent to GAPDH on chromosome 32P.
I decided to check and see the degree to which these genes should(not) be expected to map cleanly due to being members of a sprawling multi-gene family. I therefore extracted all genes annotated with ‘sialidase’ and from them extracted the group I members. In images/groupI_sialidase_phyML_tree.svg resides the resulting tree. They are not so similar as I feared.
Let us take a moment and look at a kmer tree of the 1524 groupx trans-sialidase genes.
## Reading kmer/sialidase.fasta