This is intended to contain analyses which logically follow our transciptomic analyses. It is therefore a bit of a grab bag, it may eventually comprise the variant search; but currently that is intertwined with the DE results. As a result, this and the ‘pre_visualization’ document are currently very redundant.
I have a few methods of creating (phylogenetic) trees describing the relationship of the strains of interest in this data.
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
"other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")
The following block was generated via the following methods:
The hisat2-based alignments were passed to freebayes2[ref], which generated a set of high-confidence transcriptome-based vcf files for each sample of sufficient coverage. These were sorted, compressed, and indexed via bcftools[ref]; and high-confidence variants (more than 80% with coverage higher than 5 reads per position) were extracted into a table variants as well as used to modify the reference genome to represent these filtered variants. These modified genomes were passed to ape[ref], indexed, and used to create a kmer index and distance matrix; these distances were used by ape to create a neighbor joining tree.
I wrote a little function which in theory should make the above a bit simpler and more robust for future analyses. Lets see if it works. It currently takes a directory containing the fasta files of the sequences to compare and an optional root.
strain_tree <- genomic_sequence_phylo("compare_strains")
## Reading compare_strains/lbraziliensis_2904_v46.fasta
## Reading compare_strains/leishmania_guyanensis_202209.fasta
## Reading compare_strains/lpanamensis_cds.fasta
## Reading compare_strains/lpanamensis_col_zymodeme_genes.fasta
## Reading compare_strains/lpanamensis_psc1_v46.fasta
## Reading compare_strains/lpanamensis_v36.fasta
## Reading compare_strains/lpanamensis_z21_cds.fasta
## Reading compare_strains/lpanamensis_z22_cds.fasta
## Reading compare_strains/lpanamensis_z23_cds.fasta
## Reading compare_strains/lpanamensis_z24_cds.fasta
## Reading compare_strains/strain_12444_modified_z24.fasta
## Reading compare_strains/strain_12588_modified_z21.fasta
## Reading compare_strains/strain_2168_modified_z23.fasta
## Reading compare_strains/strain_2272_modified_z22.fasta
plot(strain_tree$phylo)
In order to perform this, I will use the same fasta files, but extract the zymodeme genes from them and write out a set of fasta files containing their sequences. I therefore wrote a function which takes in the annotation data and fasta files in order to extract the data of interest.
Sadly, I will need to read in the annotations for braziliensis/panamensis panama and any other sequences. But the sequences which are directly extracted from panamensis colombia I will be able to use the same annotations.
wanted_ids <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")
reference <- write_cds_entries("compare_strains/lpanamensis_v36.fasta", all_lp_annot,
ids = wanted_ids, output = "compare_strains/lpanamensis_cds.fasta")
modified_12588 <- write_cds_entries("compare_strains/strain_12588_modified_z21.fasta", all_lp_annot,
name_prefix = "z21", ids = wanted_ids, output = "compare_strains/lpanamensis_z21_cds.fasta")
modified_2272 <- write_cds_entries("compare_strains/strain_2272_modified_z22.fasta", all_lp_annot,
name_prefix = "z22", ids = wanted_ids, output = "compare_strains/lpanamensis_z22_cds.fasta")
modified_2168 <- write_cds_entries("compare_strains/strain_2168_modified_z23.fasta", all_lp_annot,
name_prefix = "z23", ids = wanted_ids, output = "compare_strains/lpanamensis_z23_cds.fasta")
modified_12444 <- write_cds_entries("compare_strains/strain_12444_modified_z24.fasta", all_lp_annot,
name_prefix = "z24", ids = wanted_ids, output = "compare_strains/lpanamensis_z24_cds.fasta")
Having written these files, I concatenated the zymodeme CDS sequences into individual sequences/strain and performed a MSA and MP tree using clustalo[ref] and PhyML[ref] via seaview[ref]. Sadly, there were only 12 informative sites in the 6 zymodeme defining genes. Happily, the tree generated looks pretty much exactly like my genome-based tree. Also, I didn’t bother to add the other genomes because with only 12 variant positions it did not feel interesting.
PhyML based tree of the zymodeme sequences
Over the last couple of weeks, I redid all the variant searches with a newer, (I think) more sensitive and more specific variant tool. In addition I changed my script which interprets the results so that it is able to extract any tags from it, instead of just the one or two that my previous script handled. In addition, at least in theory it is now able to provide the set of amino acid substitutions for every gene in species without or with introns (not really relevant for Leishmania panamensis).
However, as of this writing, I have not re-performed the same tasks with the 2016 data, primarily because it will require remapping all of the samples. As a result, for the moment I cannot combine the older and newer samples. Thus, any of the following blocks which use the 2016 data are currently disabled.
Note that the creation of the old_snps and new_snps datastructures has been moved to the datastructures file.
both_norm <- set_expt_conditions(both_snps, fact = "knnv2classification")
## The numbers of samples by condition are:
##
## null z10 z21 z22 z23 z24 z32
## 33 3 6 47 41 2 2
## strains <- both_norm[["design"]][["strain"]]
both_strain <- set_expt_conditions(both_norm, fact = "strain")
## The numbers of samples by condition are:
##
## Lp001 null s1022 s1320 s2189 s2271 s2272 s2504 s5397 s5430 s5433
## 6 101 4 1 4 1 4 4 4 4 1
The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.
The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).
new_variant_heatmap <- plot_corheat(new_snps)
dev <- pp(file = "images/raw_snp_corheat.png", height = 12, width = 12)
new_variant_heatmap$plot
closed <- dev.off()
new_variant_heatmap$plot
The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.
snp_sets <- get_snp_sets(both_snps, factor = "knnhclusttogethercall")
## The samples represent the following categories:
##
## z10 z21 z22 z23 z32
## 3 11 40 42 2
## Using a proportion of observed variants, converting the data to binary observations.
## The factor unknown has 36 rows.
## The factor z10 has 3 rows.
## The factor z21 has 11 rows.
## The factor z22 has 40 rows.
## The factor z23 has 42 rows.
## The factor z32 has 2 rows.
## Finished iterating over the chromosomes.
Biobase::annotation(lp_previous$expressionset) <- Biobase::annotation(lp_expt$expressionset)
lp_knn <- set_expt_conditions(lp_expt, fact = "knnhclusttogethercall")
## The numbers of samples by condition are:
##
## null z10 z21 z22 z23 z32
## 3 3 11 40 42 2
both_expt <- combine_expts(lp_knn, lp_previous)
## The numbers of samples by condition are:
##
## z23 z22 z10 z32 z21 null undef sh chr inf
## 42 40 3 2 11 3 0 13 14 6
snp_genes <- snps_vs_genes(both_expt, snp_sets, expt_name_col = "chromosome")
## The snp grange data has 1611844 elements.
## There are 717876 overlapping variants and genes.
## I think we have some metrics here we can plot...s
snp_subset <- snp_subset_genes(
both_expt, both_snps,
genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
## Note, I renamed this to subset_genes().
## remove_genes_expt(), before removal, there were 1514127 genes, now there are 179.
## There are 134 samples which kept less than 90 percent counts.
## tmrc20001 tmrc20065 tmrc20005 tmrc20007 tmrc20008 tmrc20027 tmrc20028 tmrc20032 tmrc20040 tmrc20066 tmrc20039 tmrc20037 tmrc20038 tmrc20067 tmrc20068
## 0.0363994 0.0284342 0.0446300 0.0678958 0.0000000 0.0594351 0.0753021 0.0353544 0.0204152 0.0244539 0.0218095 0.0228205 0.0244650 0.0259861 0.0275633
## tmrc20041 tmrc20015 tmrc20009 tmrc20010 tmrc20016 tmrc20011 tmrc20012 tmrc20013 tmrc20017 tmrc20014 tmrc20018 tmrc20019 tmrc20070 tmrc20020 tmrc20021
## 0.0084708 0.0249880 0.0000000 0.0278667 0.0232143 0.0243409 0.0778398 0.0294979 0.0102837 0.0191370 0.0239034 0.0282985 0.0274939 0.0235432 0.0286477
## tmrc20022 tmrc20025 tmrc20024 tmrc20036 tmrc20069 tmrc20033 tmrc20026 tmrc20031 tmrc20076 tmrc20073 tmrc20055 tmrc20079 tmrc20071 tmrc20078 tmrc20094
## 0.0000000 0.0624989 0.0212984 0.0089603 0.0270318 0.0019682 0.0352553 0.0199682 0.0270505 0.0282364 0.0395199 0.0280768 0.0247556 0.0177539 0.0279169
## tmrc20042 tmrc20058 tmrc20072 tmrc20059 tmrc20048 tmrc20057 tmrc20088 tmrc20056 tmrc20060 tmrc20077 tmrc20074 tmrc20063 tmrc20053 tmrc20052 tmrc20064
## 0.0398656 0.0256906 0.0158464 0.0251221 0.0238737 0.0062348 0.0349161 0.0003009 0.0294943 0.0340198 0.0282781 0.0017690 0.0202252 0.0274156 0.0280939
## tmrc20075 tmrc20051 tmrc20050 tmrc20049 tmrc20062 tmrc20110 tmrc20080 tmrc20043 tmrc20083 tmrc20054 tmrc20085 tmrc20046 tmrc20093 tmrc20089 tmrc20047
## 0.0236363 0.0297070 0.0324800 0.0338522 0.0314505 0.0350400 0.0288995 0.0273341 0.0121428 0.0298779 0.0251593 0.0054459 0.0065508 0.0269888 0.0299476
## tmrc20090 tmrc20044 tmrc20045 tmrc20061 tmrc20105 tmrc20108 tmrc20109 tmrc20098 tmrc20096 tmrc20097 tmrc20101 tmrc20092 tmrc20082 tmrc20102 tmrc20099
## 0.0273432 0.0316706 0.0052405 0.0263012 0.0303013 0.0267368 0.0184037 0.0265627 0.0202590 0.0107502 0.0282691 0.0024627 0.0012027 0.0250142 0.0291960
## tmrc20100 tmrc20091 tmrc20084 tmrc20087 tmrc20103 tmrc20104 tmrc20086 tmrc20107 tmrc20081 tmrc20106 tmrc20095 hpgl0242 hpgl0243 hpgl0244 hpgl0245
## 0.0266043 0.0274472 0.0063494 0.0291851 0.0065616 0.0265140 0.0251956 0.0200527 0.0133293 0.0134896 0.0136510 0.0000000 0.0291184 0.0277722 0.0092573
## hpgl0246 hpgl0247 hpgl0248 hpgl0316 hpgl0318 hpgl0320 hpgl0322 hpgl0631 hpgl0632 hpgl0633 hpgl0634 hpgl0635 hpgl0636 hpgl0638 hpgl0639
## 0.0281687 0.0690201 0.0000000 0.0135501 0.1068376 0.0581666 0.0520414 0.0838203 0.0000000 0.0320164 0.0482116 0.0307934 0.0000000 0.0000000 0.0296412
## hpgl0641 hpgl0643 hpgl0651 hpgl0652 hpgl0653 hpgl0654 hpgl0655 hpgl0656 hpgl0658 hpgl0659 hpgl0660 hpgl0661 hpgl0662 hpgl0663
## 0.0249169 0.1094691 0.0864779 0.0000000 0.0367420 0.0405186 0.0353871 0.0000000 0.0849825 0.0000000 0.0381134 0.0333671 0.0289599 0.0000000
zymo_heat <- plot_sample_heatmap(
normalize_expt(snp_subset, transform = "log2"),
row_label = rownames(exprs(snp_subset)))
## transform_counts: Found 22797 values equal to 0, adding 1 to the matrix.
zymo_heat
most_variant <- head(snp_genes$summary_by_gene, n = 100)
least_variant <- tail(snp_genes$summary_by_gene, n = 100)
## test <- simple_goseq(names(most_variant), go_db = lp_go, length_db = lp_lengths)
Najib has asked a few times about the relationship between variants and DE genes. In subsequent conversations I figured out what he really wants to learn is variants in the UTR (most likely 5’) which might affect expression of genes. The following explicitly does not help this question, but is a paralog: is there a relationship between variants in the CDS and differential expression?
In order to do this comparison, we need to reload some of the DE results.
rda <- glue::glue("rda/zymo_tables_nobatch-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]
rda <- glue::glue("rda/sus_tables_nobatch-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
## Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file 'rda/sus_tables_nobatch-v202305.rda', probable reason 'No such file or directory'
## Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
sus_df <- get0(varname)[["data"]][["resistant_sensitive"]]
vars_df <- data.frame(ID = names(snp_genes[["count_by_gene"]]),
variants = as.numeric(snp_genes[["count_by_gene"]]))
vars_df <- merge(vars_df, lp_lengths, by = "ID")
vars_df[["length"]] <- as.numeric(vars_df[["length"]])
vars_df[["var_len"]] <- vars_df[["variants"]] / vars_df[["length"]]
vars_by_de_gene <- merge(zy_df, vars_df, by.x="row.names", by.y="ID")
rownames(vars_by_de_gene) <- vars_by_de_gene[["Row.names"]]
vars_by_de_gene[["Row.names"]] <- NULL
cor.test(vars_by_de_gene$deseq_logfc, vars_by_de_gene$var_len)
##
## Pearson's product-moment correlation
##
## data: vars_by_de_gene$deseq_logfc and vars_by_de_gene$var_len
## t = -5, df = 8533, p-value = 5e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07571 -0.03341
## sample estimates:
## cor
## -0.05458
variants_wrt_logfc <- plot_linear_scatter(vars_by_de_gene, xcol = "deseq_logfc",
ycol = "var_len", text_col = "annotgeneproduct")
scatter <- variants_wrt_logfc$scatter
plotly::ggplotly(scatter, tooltip = c("x", "y", "text"))
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.
Ok, I think I can do this on a UTR basis.
snp_utrs <- snps_vs_genes_padded(both_expt, snp_sets, expt_name_col = "chromosome")
## There are 29 genes with less than 200 nt. before the start of the chromosome on the plus strand
## There are 4485 plus strand features and 4112 minus strand features.
## Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': LPAL13_SCAF000001, LPAL13_SCAF000011, LPAL13_SCAF000027, LPAL13_SCAF000028, LPAL13_SCAF000033, LPAL13_SCAF000048, LPAL13_SCAF000051, LPAL13_SCAF000065, LPAL13_SCAF000068, LPAL13_SCAF000071, LPAL13_SCAF000074, LPAL13_SCAF000075, LPAL13_SCAF000110, LPAL13_SCAF000114, LPAL13_SCAF000115, LPAL13_SCAF000119, LPAL13_SCAF000121, LPAL13_SCAF000124, LPAL13_SCAF000132, LPAL13_SCAF000136, LPAL13_SCAF000137, LPAL13_SCAF000142, LPAL13_SCAF000146, LPAL13_SCAF000153, LPAL13_SCAF000165, LPAL13_SCAF000166, LPAL13_SCAF000171, LPAL13_SCAF000172, LPAL13_SCAF000174, LPAL13_SCAF000176, LPAL13_SCAF000200, LPAL13_SCAF000201, LPAL13_SCAF000202, LPAL13_SCAF000203, LPAL13_SCAF000227, LPAL13_SCAF000245, LPAL13_SCAF000265, LPAL13_SCAF000266, LPAL13_SCAF000267, LPAL13_SCAF000271, LPAL13_SCAF000286, LPAL13_SCAF000287, LPAL13_SCAF000288, LPAL13_SCAF000292, LPAL13_SCAF000319, LPAL13_SCAF000347, LPAL13_SCAF000358, LPAL13_SCAF000364, LPAL13_SCAF000368, LPAL13_SCAF000421, LPAL13_SCAF000424, LPAL13_SCAF000446, LPAL13_SCAF000447, LPAL13_SCAF000463, LPAL13_SCAF000465, LPAL13_SCAF000492, LPAL13_SCAF000554, LPAL13_SCAF000555, LPAL13_SCAF000569, LPAL13_SCAF000573, LPAL13_SCAF000575, LPAL13_SCAF000576, LPAL13_SCAF000588, LPAL13_SCAF000590, LPAL13_SCAF000591, LPAL13_SCAF000593, LPAL13_SCAF000598, LPAL13_SCAF000599, LPAL13_SCAF000601, LPAL13_SCAF000619, LPAL13_SCAF000682, LPAL13_SCAF000700, LPAL13_SCAF000707, LPAL13_SCAF000713, LPAL13_SCAF000734, LPAL13_SCAF000735, LPAL13_SCAF000737, LPAL13_SCAF000760, LPAL13_SCAF000761, LPAL13_SCAF000762, LPAL13_SCAF000772, LPAL13_SCAF000779, LPAL13_SCAF000780, LPAL13_SCAF000805, LPAL13_SCAF000806, LPAL13_SCAF000812, LPAL13_SCAF000813, LPAL13_SCAF000815, LPAL13_SCAF000818
## - in 'y': LPAL13_SCAF000017, LPAL13_SCAF000021, LPAL13_SCAF000025, LPAL13_SCAF000039, LPAL13_SCAF000041, LPAL13_SCAF000044, LPAL13_SCAF000053, LPAL13_SCAF000062, LPAL13_SCAF000063, LPAL13_SCAF000076, LPAL13_SCAF000078, LPAL13_SCAF000080, LPAL13_SCAF000084, LPAL13_SCAF000089, LPAL13_SCAF000096, LPAL13_SCAF000097, LPAL13_SCAF000117, LPAL13_SCAF000188, LPAL13_SCAF000191, LPAL13_SCAF000194, LPAL13_SCAF000197, LPAL13_SCAF000206, LPAL13_SCAF000209, LPAL13_SCAF000211, LPAL13_SCAF000229, LPAL13_SCAF000230, LPAL13_SCAF000233, LPAL13_SCAF000235, LPAL13_SCAF000237, LPAL13_SCAF000239, LPAL13_SCAF000248, LPAL13_SCAF000250, LPAL13_SCAF000253, LPAL13_SCAF000295, LPAL13_SCAF000301, LPAL13_SCAF000302, LPAL13_SCAF000303, LPAL13_SCAF000314, LPAL13_SCAF000316, LPAL13_SCAF000328, LPAL13_SCAF000400, LPAL13_SCAF000403, LPAL13_SCAF000417, LPAL13_SCAF000420, LPAL13_SCAF000426, LPAL13_SCAF000434, LPAL13_SCAF000459, LPAL13_SCAF000461, LPAL13_SCAF000483, LPAL13_SCAF000496, LPAL13_SCAF000500, LPAL13_SCAF000503, LPAL13_SCAF000511, LPAL13_SCAF000512, LPAL13_SCAF000515, LPAL13_SCAF000522, LPAL13_SCAF000532, LPAL13_SCAF000539, LPAL13_SCAF000541, LPAL13_SCAF000544, LPAL13_SCAF000577, LPAL13_SCAF000582, LPAL13_SCAF000585, LPAL13_SCAF000603, LPAL13_SCAF000605, LPAL13_SCAF000610, LPAL13_SCAF000611, LPAL13_SCAF000618, LPAL13_SCAF000627, LPAL13_SCAF000637, LPAL13_SCAF000644, LPAL13_SCAF000645, LPAL13_SCAF000649, LPAL13_SCAF000650, LPAL13_SCAF000652, LPAL13_SCAF000653, LPAL13_SCAF000654, LPAL13_SCAF000656, LPAL13_SCAF000666, LPAL13_SCAF000697, LPAL13_SCAF000717, LPAL13_SCAF000720, LPAL13_SCAF000722, LPAL13_SCAF000723, LPAL13_SCAF000727, LPAL13_SCAF000738, LPAL13_SCAF000781, LPAL13_SCAF000783, LPAL13_SCAF000789, LPAL13_SCAF000797, LPAL13_SCAF000798, LPAL13_SCAF000799, LPAL13_SCAF000804, LPAL13_SCAF000816
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': LPAL13_SCAF000001, LPAL13_SCAF000011, LPAL13_SCAF000027, LPAL13_SCAF000028, LPAL13_SCAF000033, LPAL13_SCAF000048, LPAL13_SCAF000051, LPAL13_SCAF000065, LPAL13_SCAF000068, LPAL13_SCAF000071, LPAL13_SCAF000074, LPAL13_SCAF000075, LPAL13_SCAF000110, LPAL13_SCAF000114, LPAL13_SCAF000115, LPAL13_SCAF000119, LPAL13_SCAF000121, LPAL13_SCAF000124, LPAL13_SCAF000132, LPAL13_SCAF000136, LPAL13_SCAF000137, LPAL13_SCAF000142, LPAL13_SCAF000146, LPAL13_SCAF000153, LPAL13_SCAF000165, LPAL13_SCAF000166, LPAL13_SCAF000171, LPAL13_SCAF000172, LPAL13_SCAF000174, LPAL13_SCAF000176, LPAL13_SCAF000200, LPAL13_SCAF000201, LPAL13_SCAF000202, LPAL13_SCAF000203, LPAL13_SCAF000227, LPAL13_SCAF000245, LPAL13_SCAF000265, LPAL13_SCAF000266, LPAL13_SCAF000267, LPAL13_SCAF000271, LPAL13_SCAF000286, LPAL13_SCAF000287, LPAL13_SCAF000288, LPAL13_SCAF000292, LPAL13_SCAF000319, LPAL13_SCAF000347, LPAL13_SCAF000358, LPAL13_SCAF000364, LPAL13_SCAF000368, LPAL13_SCAF000421, LPAL13_SCAF000424, LPAL13_SCAF000446, LPAL13_SCAF000447, LPAL13_SCAF000463, LPAL13_SCAF000465, LPAL13_SCAF000492, LPAL13_SCAF000554, LPAL13_SCAF000555, LPAL13_SCAF000569, LPAL13_SCAF000573, LPAL13_SCAF000575, LPAL13_SCAF000576, LPAL13_SCAF000588, LPAL13_SCAF000590, LPAL13_SCAF000591, LPAL13_SCAF000593, LPAL13_SCAF000598, LPAL13_SCAF000599, LPAL13_SCAF000601, LPAL13_SCAF000619, LPAL13_SCAF000682, LPAL13_SCAF000700, LPAL13_SCAF000707, LPAL13_SCAF000713, LPAL13_SCAF000734, LPAL13_SCAF000735, LPAL13_SCAF000737, LPAL13_SCAF000760, LPAL13_SCAF000761, LPAL13_SCAF000762, LPAL13_SCAF000772, LPAL13_SCAF000779, LPAL13_SCAF000780, LPAL13_SCAF000805, LPAL13_SCAF000806, LPAL13_SCAF000812, LPAL13_SCAF000813, LPAL13_SCAF000815, LPAL13_SCAF000818
## - in 'y': LPAL13_SCAF000017, LPAL13_SCAF000021, LPAL13_SCAF000025, LPAL13_SCAF000039, LPAL13_SCAF000041, LPAL13_SCAF000044, LPAL13_SCAF000053, LPAL13_SCAF000062, LPAL13_SCAF000063, LPAL13_SCAF000076, LPAL13_SCAF000078, LPAL13_SCAF000080, LPAL13_SCAF000084, LPAL13_SCAF000089, LPAL13_SCAF000096, LPAL13_SCAF000097, LPAL13_SCAF000117, LPAL13_SCAF000188, LPAL13_SCAF000191, LPAL13_SCAF000194, LPAL13_SCAF000197, LPAL13_SCAF000206, LPAL13_SCAF000209, LPAL13_SCAF000211, LPAL13_SCAF000229, LPAL13_SCAF000230, LPAL13_SCAF000233, LPAL13_SCAF000235, LPAL13_SCAF000237, LPAL13_SCAF000239, LPAL13_SCAF000248, LPAL13_SCAF000250, LPAL13_SCAF000253, LPAL13_SCAF000295, LPAL13_SCAF000301, LPAL13_SCAF000302, LPAL13_SCAF000303, LPAL13_SCAF000314, LPAL13_SCAF000316, LPAL13_SCAF000328, LPAL13_SCAF000400, LPAL13_SCAF000403, LPAL13_SCAF000417, LPAL13_SCAF000420, LPAL13_SCAF000426, LPAL13_SCAF000434, LPAL13_SCAF000459, LPAL13_SCAF000461, LPAL13_SCAF000483, LPAL13_SCAF000496, LPAL13_SCAF000500, LPAL13_SCAF000503, LPAL13_SCAF000511, LPAL13_SCAF000512, LPAL13_SCAF000515, LPAL13_SCAF000522, LPAL13_SCAF000532, LPAL13_SCAF000539, LPAL13_SCAF000541, LPAL13_SCAF000544, LPAL13_SCAF000577, LPAL13_SCAF000582, LPAL13_SCAF000585, LPAL13_SCAF000603, LPAL13_SCAF000605, LPAL13_SCAF000610, LPAL13_SCAF000611, LPAL13_SCAF000618, LPAL13_SCAF000627, LPAL13_SCAF000637, LPAL13_SCAF000644, LPAL13_SCAF000645, LPAL13_SCAF000649, LPAL13_SCAF000650, LPAL13_SCAF000652, LPAL13_SCAF000653, LPAL13_SCAF000654, LPAL13_SCAF000656, LPAL13_SCAF000666, LPAL13_SCAF000697, LPAL13_SCAF000717, LPAL13_SCAF000720, LPAL13_SCAF000722, LPAL13_SCAF000723, LPAL13_SCAF000727, LPAL13_SCAF000738, LPAL13_SCAF000781, LPAL13_SCAF000783, LPAL13_SCAF000789, LPAL13_SCAF000797, LPAL13_SCAF000798, LPAL13_SCAF000799, LPAL13_SCAF000804, LPAL13_SCAF000816
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## The snp grange data has 1611844 elements.
## There are 92755 overlapping variants and 5' padded UTRs.
## There are 433499 overlapping variants and 3' padded UTRs.
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Warning in Ops.factor(left, right): '/' not meaningful for factors
fivep_vars_df <- data.frame(ID = names(snp_utrs[["count_fivep_by_gene"]]),
variants = as.numeric(snp_utrs[["count_fivep_by_gene"]]))
fivep_vars_by_de_gene <- merge(zy_df, fivep_vars_df, by.x="row.names", by.y="ID")
rownames(fivep_vars_by_de_gene) <- fivep_vars_by_de_gene[["Row.names"]]
fivep_vars_by_de_gene[["Row.names"]] <- NULL
cor.test(fivep_vars_by_de_gene$deseq_logfc, fivep_vars_by_de_gene[["variants"]])
##
## Pearson's product-moment correlation
##
## data: fivep_vars_by_de_gene$deseq_logfc and fivep_vars_by_de_gene[["variants"]]
## t = 0.47, df = 8533, p-value = 0.6
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01611 0.02632
## sample estimates:
## cor
## 0.00511
fivep_variants_wrt_logfc <- plot_linear_scatter(fivep_vars_by_de_gene, xcol = "deseq_logfc",
ycol = "variants", text_col = "annotgeneproduct")
scatter <- fivep_variants_wrt_logfc$scatter
plotly::ggplotly(scatter, tooltip = c("x", "y", "text"))
threep_vars_df <- data.frame(ID = names(snp_utrs[["count_threep_by_gene"]]),
variants = as.numeric(snp_utrs[["count_threep_by_gene"]]))
threep_vars_by_de_gene <- merge(zy_df, threep_vars_df, by.x="row.names", by.y="ID")
rownames(threep_vars_by_de_gene) <- threep_vars_by_de_gene[["Row.names"]]
threep_vars_by_de_gene[["Row.names"]] <- NULL
cor.test(threep_vars_by_de_gene$deseq_logfc, threep_vars_by_de_gene[["variants"]])
##
## Pearson's product-moment correlation
##
## data: threep_vars_by_de_gene$deseq_logfc and threep_vars_by_de_gene[["variants"]]
## t = 2.1, df = 8533, p-value = 0.03
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.001659 0.044069
## sample estimates:
## cor
## 0.02287
threep_variants_wrt_logfc <- plot_linear_scatter(threep_vars_by_de_gene, xcol = "deseq_logfc",
ycol = "variants", text_col = "annotgeneproduct")
scatter <- threep_variants_wrt_logfc$scatter
plotly::ggplotly(scatter, tooltip = c("x", "y", "text"))
Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()
clinical_sets <- get_snp_sets(new_snps, factor = "clinicalresponse")
## The samples represent the following categories:
##
## cure failure failure miltefosine laboratory line
## 40 36 1 1
## laboratory line miltefosine resistant nd reference strain
## 1 18 4
## Using a proportion of observed variants, converting the data to binary observations.
## The factor cure has 40 rows.
## The factor failure has 36 rows.
## The factor failure miltefosine has only 1 row.
## The factor laboratory line has only 1 row.
## The factor laboratory line miltefosine resistant has only 1 row.
## The factor nd has 18 rows.
## The factor reference strain has 4 rows.
## Finished iterating over the chromosomes.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
ggplot(density_df, aes_string(x = "chr", y = "density_vec")) +
ggplot2::geom_col() +
ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
## Error in aes_string(x = "chr", y = "density_vec"): could not find function "aes_string"
## clinical_written <- write_variants(new_snps)
clinical_genes <- snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome")
## The snp grange data has 1514127 elements.
## There are 678723 overlapping variants and genes.
snp_density <- merge(as.data.frame(clinical_genes[["count_by_gene"]]),
as.data.frame(fData(lp_expt)),
by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- snp_density[["snps"]] / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]
removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin
clinical_snps <- snps_intersections(lp_expt, clinical_sets, chr_column = "chromosome")
fail_ref_snps <- as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
fail_ref_snps <- rbind(fail_ref_snps,
as.data.frame(clinical_snps[["inters"]][["failure"]]))
cure_snps <- as.data.frame(clinical_snps[["inters"]][["cure"]])
head(fail_ref_snps)
## [1] seqnames start end width strand
## <0 rows> (or 0-length row.names)
head(cure_snps)
## seqnames start end width strand
## chr_LpaL13-34_pos_1370748_ref_C_alt_G LpaL13-34 1370748 1370749 2 +
write.csv(file="csv/cure_variants.txt", x=rownames(cure_snps))
write.csv(file="csv/fail_variants.txt", x=rownames(fail_ref_snps))
annot <- fData(lp_expt)
clinical_interest <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
clinical_interest <- merge(clinical_interest,
as.data.frame(clinical_snps[["gene_summaries"]][["failure, reference strain"]]),
by = "row.names")
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps","fail_snps")
annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
fData(lp_expt$expressionset) <- annot
The heatmap produced here should show the variants only for the zymodeme genes.
I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.
## Drop the 2.1, 2.4, unknown, and null
pruned_snps <- subset_expt(new_snps, subset="condition=='z2.2'|condition=='z2.3'")
new_sets <- get_snp_sets(pruned_snps, factor = "zymodemecategorical")
summary(new_sets)
## 1000000: 2.2
## 0100000: 2.3
summary(new_sets[["intersections"]][["10"]])
write.csv(file="csv/variants_22.csv", x=new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])
write.csv(file="csv/variants_23.csv", x=new_sets[["intersections"]][["01"]])
Thus we see that there are 3,553 variants associated with 2.2 and 81,589 associated with 2.3.
The following function uses the positional data to look for sequential mismatches associated with zymodeme in the hopes that there will be some regions which would provide good potential targets for a PCR-based assay.
sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
if (is.null(conditions)) {
conditions <- 1
}
intersection_sets <- snp_sets[["intersections"]]
intersection_names <- snp_sets[["set_names"]]
chosen_intersection <- 1
if (is.numeric(conditions)) {
chosen_intersection <- conditions
} else {
intersection_idx <- intersection_names == conditions
chosen_intersection <- names(intersection_names)[intersection_idx]
}
possible_positions <- intersection_sets[[chosen_intersection]]
position_table <- data.frame(row.names = possible_positions)
pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
position_idx <- order(position_table[, "chr"], position_table[, "pos"])
position_table <- position_table[position_idx, ]
position_table[["dist"]] <- 0
last_chr <- ""
for (r in 1:nrow(position_table)) {
this_chr <- position_table[r, "chr"]
if (r == 1) {
position_table[r, "dist"] <- position_table[r, "pos"]
last_chr <- this_chr
next
}
if (this_chr == last_chr) {
position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
} else {
position_table[r, "dist"] <- position_table[r, "pos"]
}
last_chr <- this_chr
}
## Working interactively here.
doubles <- position_table[["dist"]] == 1
doubles <- position_table[doubles, ]
write.csv(doubles, "doubles.csv")
one_away <- position_table[["dist"]] == 2
one_away <- position_table[one_away, ]
write.csv(one_away, "one_away.csv")
two_away <- position_table[["dist"]] == 3
two_away <- position_table[two_away, ]
write.csv(two_away, "two_away.csv")
combined <- rbind(doubles, one_away)
combined <- rbind(combined, two_away)
position_idx <- order(combined[, "chr"], combined[, "pos"])
combined <- combined[position_idx, ]
this_chr <- ""
for (r in 1:nrow(combined)) {
this_chr <- combined[r, "chr"]
if (r == 1) {
combined[r, "dist_pair"] <- combined[r, "pos"]
last_chr <- this_chr
next
}
if (this_chr == last_chr) {
combined[r, "dist_pair"] <- combined[r, "pos"] - combined[r - 1, "pos"]
} else {
combined[r, "dist_pair"] <- combined[r, "pos"]
}
last_chr <- this_chr
}
dist_pair_maximum <- 1000
dist_pair_minimum <- 200
dist_pair_idx <- combined[["dist_pair"]] <= dist_pair_maximum &
combined[["dist_pair"]] >= dist_pair_minimum
remaining <- combined[dist_pair_idx, ]
no_weak_idx <- grepl(pattern="ref_(G|C)", x=rownames(remaining))
remaining <- remaining[no_weak_idx, ]
print(head(table(position_table[["dist"]])))
sequentials <- position_table[["dist"]] <= maximum_separation
message("There are ", sum(sequentials), " candidate regions.")
## The following can tell me how many runs of each length occurred, that is not quite what I want.
## Now use run length encoding to find the set of sequential sequentials!
rle_result <- rle(sequentials)
rle_values <- rle_result[["values"]]
## The following line is equivalent to just leaving values alone:
## true_values <- rle_result[["values"]] == TRUE
rle_lengths <- rle_result[["lengths"]]
true_sequentials <- rle_lengths[rle_values]
rle_idx <- cumsum(rle_lengths)[which(rle_values)]
position_table[["last_sequential"]] <- 0
count <- 0
for (r in rle_idx) {
count <- count + 1
position_table[r, "last_sequential"] <- true_sequentials[count]
}
message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")
wanted_idx <- position_table[["last_sequential"]] >= minimum
wanted <- position_table[wanted_idx, c("chr", "pos")]
return(wanted)
}
zymo22_sequentials <- sequential_variants(new_sets, conditions = "z22", minimum=1, maximum_separation=2)
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z23",
minimum = 2, maximum_separation = 2)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!
The first 4 candidate regions from my set of remaining: * Chr Pos. Distance * LpaL13-15 238433 448 * LpaL13-18 142844 613 * LpaL13-29 830342 252 * LpaL13-33 1331507 843
Lets define a couple of terms: * Third: Each of the 4 above positions. * Second: Third - Distance * End: Third + PrimerLen * Start: Second - Primerlen
In each instance, these are the last positions, so we want to grab three things:
## * LpaL13-15 238433 448
first_candidate_chr <- genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p
## * LpaL13-18 142844 613
second_candidate_chr <- genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p
## * LpaL13-29 830342 252
third_candidate_chr <- genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.
## * LpaL13-33 1331507 843
fourth_candidate_chr <- genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3p
I made a fun little function which should find regions which have lots of variants associated with a given experimental factor.
pheno <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_expt(pheno, subset = "!is.na(pData(pheno)[['bcftable']])")
pheno_snps <- sm(count_expt_snps(pheno, annot_column = "bcftable"))
fun_stuff <- snp_density_primers(
pheno_snps,
bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)
library(dplyr)
favorite_primer_regions <- favorite_primer_regions %>%
relocate(bin)
Here is my note from our meeting:
Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible, add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing this actually means to put the logFC as a column.
One nice thing, I did a semantic removal on the lp_expt, so the set of logFC/pvalues should not have any of the offending types; thus I should be able to automagically get rid of them in the merge.
logfc_columns <- zy_df[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
by.x = "closest_gene_before_id", by.y = "row.names")
sus_columns <- sus_df[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
by.x = "closest_gene_before_id", by.y = "row.names") %>%
relocate(bin)
written <- write_xlsx(data=new_table,
excel="excel/favorite_primers_xref_zy_sus.xlsx")
We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.
snp_genes <- snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome")
## Error in snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"): object 'new_sets' not found
clinical_colors_v2 <- list(
"z22" = "#0000cc",
"z23" = "#cc0000")
new_zymo_norm <- normalize_expt(pruned_snps, norm = "quant") %>%
set_expt_conditions(fact = "zymodemecategorical") %>%
set_expt_colors(clinical_colors_v2)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'expt' in selecting a method for function 'state': object 'pruned_snps' not found
zymo_heat <- plot_disheat(new_zymo_norm)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'expt_data' in selecting a method for function 'plot_heatmap': object 'new_zymo_norm' not found
pp(file = "images/onlyz22_z23_snp_heatmap.png", width=12, height=12)
zymo_heat$plot
## NULL
closed <- dev.off()
zymo_heat[["plot"]]
## NULL
Now let us try to make a heatmap which includes some of the annotation data.
des <- pData(both_norm)
zymo_column <- "zymodemecategorical"
undef_idx <- is.na(des[[zymo_column]])
des[undef_idx, "strain"] <- "unknown"
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))
na_idx <- is.na(correlations)
correlations[na_idx] <- 0
zymo_missing_idx <- is.na(des[[zymo_column]])
des[[zymo_column]] <- as.character(des[[zymo_column]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, zymo_column] <- "unknown"
mydendro <- list(
"clustfun" = hclust,
"lwd" = 2.0)
col_data <- as.data.frame(des[, c(zymo_column, "clinicalcategorical")])
colnames(col_data) <- c("zymodeme", "outcome")
unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
col_data[unknown_clinical, "outcome"] <- "undefined"
row_data <- as.data.frame(des[, c("sus_category_current", "knnv2classification")])
colnames(row_data) <- c("susceptibility", "mlclass")
myannot <- list(
"Col" = list("data" = col_data),
"Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
"col" = BrewerClusterCol)
mylabs <- list(
"Row" = list("nrow" = 4),
"Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(380)
zymo_annot_heat <- annHeatmap2(
correlations,
dendrogram = mydendro,
annotation = myannot,
cluster = myclust,
labels = mylabs,
## The following controls if the picture is symmetric
scale = "none",
col = hmcols)
dev <- pp(file = "images/dendro_heatmap.png", height = 20, width = 20)
plot(zymo_annot_heat)
closed <- dev.off()
plot(zymo_annot_heat)
Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.
big heatmap
xref_prop <- table(pheno_snps[["conditions"]])
pheno_snps$conditions
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
new_tbl[[n]] <- 0
idx_cols <- which(pheno_snps[["conditions"]] == n)
prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1
new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]
library(CMplot)
simplify <- new_tbl
simplify[["strong22"]] <- NULL
CMplot(simplify, bin.size = 100000)
CMplot(new_tbl, plot.type="m", multracks=TRUE, threshold = c(0.01, 0.05),
threshold.lwd=c(1,1), threshold.col=c("black","grey"),
amplify=TRUE, bin.size=10000,
chr.den.col=c("darkgreen", "yellow", "red"),
signal.col=c("red", "green", "blue"),
signal.cex=1, file="jpg", memo="", dpi=300, file.output=TRUE, verbose=TRUE)
This tool looks a little opaque, but provides sample data with things that make sense to me and should be pretty easy to recapitulate in our data.
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- pData(new_snps)[, c(zymo_column, "clinicalcategorical")]
for (col in colnames(my_covariates)) {
my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)
my_geneloc <- fData(lp_expt)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")
my_ge <- exprs(normalize_expt(lp_expt, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(exprs(new_snps))
my_ge <- my_ge[, used_samples]
my_snpsloc <- data.frame(rownames = rownames(exprs(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
message("There are: ", sum(duplicated), " duplicated entries.")
keep_idx <- ! test
my_snpsloc <- my_snpsloc[keep_idx, ]
}
my_snps <- exprs(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1
## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")
useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name) > 0) {
cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message("This is hpgltools commit: ", get_git_commit())
message("Saving to ", savefile)
## tmp <- sm(saveme(filename = savefile))
}
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 4671781c252308412d318cdcf1c4d111e9b757d7
## This is hpgltools commit: Mon Jun 26 15:16:57 2023 -0400: 4671781c252308412d318cdcf1c4d111e9b757d7
## Saving to tmrc2_post_visualization_202305.rda.xz
tmp <- loadme(filename = savefile)