A quick introduction. I received 5 samples of DNA sequencing data with a question: are there some shared/unique variants compared to the reference genome. As a reminder, these are orn deletion strains of pseudomonas.
I did the following:
The agreement is quite nice between the tools, which is encouraging. My tool is perhaps too conservative and suggests that more coverage is needed. Snippy suggests to me that the coverage is sufficient. I think I agree with snippy at this time.
There are a series of images in the igv/ directory which in theory back up these suppositions. I took one picture for each variant observed by snippy. Now I am going to read the results from it and look to see how many of these variants are shared among all samples vs.ย are unique.
Depending on what these results look like, I may do some venn or other shenanigans.
Note to self, I keep forgetting where orn lives: 5,827,671
Names of transposons and their locations: Samples 3, 7: yciB: 1,980,222 Samples 2, 6: yciI: 1,979,921 Samples 4, 8: fleQ: 4,460,872 Samples 5, 9: PA14_27420: 2,379,170 Sample 12: dnR: 595,954 Sample 13: potB: 1,514,941
samples <- c("070919_21", "070919_32", "070919_54",
"070919_65", "070919_76")
csv_files <- paste0("preprocessing/201907/", samples, "/outputs/snippy_paeruginosa_pa14/snps.csv")
## We got some new samples!!
new_samples <- c("080519_76", "080519_77", "080519_78",
"190808_01", "190808_02", "190808_03",
"190808_04", "190808_05")
new_csv_files <- paste0("preprocessing/201908/", new_samples, "/outputs/snippy_paeruginosa_pa14/snps.csv")
csv_files <- c(csv_files, new_csv_files)
samples <- c(samples, new_samples)
snp_df <- data.frame()
for (f in 1:length(new_csv_files)) {
name <- new_samples[f]
file <- new_csv_files[f]
tmpdf <- readr::read_csv(file)
tmpdf[["encoded"]] <- glue::glue("{tmpdf[['POS']]}_{tmpdf[['TYPE']]}_{tmpdf[['REF']]}_{tmpdf[['ALT']]}")
tmpdf[["count"]] <- as.numeric(gsub(x=tmpdf[["EVIDENCE"]], pattern="^\\w+:(\\d+) .*$", replacement="\\1"))
tmpdf <- tmpdf[, c("encoded", "count")]
colnames(tmpdf) <- c("encoded", name)
if (f == 1) {
snp_df <- tmpdf
} else {
snp_df <- merge(snp_df, tmpdf, by="encoded", all=TRUE)
}
}
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 185 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/080519_76/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/080519_76/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/080519_76/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/080519_76/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/080519_76/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 189 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/080519_77/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/080519_77/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/080519_77/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/080519_77/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/080519_77/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 164 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/080519_78/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/080519_78/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/080519_78/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/080519_78/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/080519_78/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 179 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/190808_01/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/190808_01/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/190808_01/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/190808_01/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/190808_01/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 179 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/190808_02/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/190808_02/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/190808_02/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/190808_02/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/190808_02/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 183 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/190808_03/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/190808_03/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/190808_03/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/190808_03/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/190808_03/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 176 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/190808_04/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/190808_04/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/190808_04/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/190808_04/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/190808_04/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## Parsed with column specification:
## cols(
## CHROM = col_character(),
## POS = col_double(),
## TYPE = col_character(),
## REF = col_character(),
## ALT = col_character(),
## EVIDENCE = col_character(),
## FTYPE = col_character(),
## STRAND = col_character(),
## NT_POS = col_character(),
## AA_POS = col_character(),
## EFFECT = col_character(),
## LOCUS_TAG = col_character(),
## GENE = col_character(),
## PRODUCT = col_character()
## )
## Warning: 180 parsing failures.
## row col expected actual file
## 1 -- 14 columns 6 columns 'preprocessing/201908/190808_05/outputs/snippy_paeruginosa_pa14/snps.csv'
## 2 -- 14 columns 6 columns 'preprocessing/201908/190808_05/outputs/snippy_paeruginosa_pa14/snps.csv'
## 3 -- 14 columns 6 columns 'preprocessing/201908/190808_05/outputs/snippy_paeruginosa_pa14/snps.csv'
## 4 -- 14 columns 6 columns 'preprocessing/201908/190808_05/outputs/snippy_paeruginosa_pa14/snps.csv'
## 5 -- 14 columns 6 columns 'preprocessing/201908/190808_05/outputs/snippy_paeruginosa_pa14/snps.csv'
## ... ... .......... ......... .........................................................................
## See problems(...) for more details.
## [1] 202 8
## 080519_76 080519_77 080519_78 190808_01 190808_02
## Min. : 10.0 Min. : 11 Min. :10.0 Min. : 10.0 Min. : 10.0
## 1st Qu.: 50.0 1st Qu.: 80 1st Qu.:19.0 1st Qu.: 44.0 1st Qu.: 41.0
## Median : 77.0 Median :109 Median :24.0 Median : 59.0 Median : 64.0
## Mean : 82.8 Mean :119 Mean :28.1 Mean : 63.5 Mean : 69.1
## 3rd Qu.:102.0 3rd Qu.:149 3rd Qu.:34.0 3rd Qu.: 81.0 3rd Qu.: 86.0
## Max. :292.0 Max. :485 Max. :91.0 Max. :220.0 Max. :304.0
## NA's :17 NA's :13 NA's :38 NA's :23 NA's :23
## 190808_03 190808_04 190808_05
## Min. : 11.0 Min. : 10.0 Min. : 10.0
## 1st Qu.: 67.5 1st Qu.: 43.0 1st Qu.: 44.0
## Median : 93.0 Median : 62.5 Median : 65.0
## Mean :105.7 Mean : 71.6 Mean : 72.7
## 3rd Qu.:130.0 3rd Qu.: 87.0 3rd Qu.: 94.5
## Max. :535.0 Max. :308.0 Max. :296.0
## NA's :19 NA's :26 NA's :22
write.csv(file="snp_df.csv", x=snp_df)
binary <- snp_df
idx <- binary > 0
binary[idx] <- 1
idx <- is.na(binary)
binary[idx] <- 0
binary_list <- list()
for (e in colnames(binary)) {
idx <- binary[, e] == 1
positions <- rownames(binary)[idx]
binary_list[[e]] <- positions
}
library(Vennerable)
test <- Vennerable::Venn(Sets=binary_list)
##Vennerable::plot(test, doWeights=FALSE)
Sample Order: Orn5,Orn6,Orn3,Orn2,Wt,Mona,Orn10,Orn11
ideally our intersection is therefore: 11110011: e.g the things shared among all the transposon mutants but not the wt nor mona mutant.
An important caveat:
Orn10 and Orn11 are not suppressors.
## NULL
## [1] "945882_del_CG_C" "945892_del_GC_G"
## [1] "2354158_del_CG_C"
## [1] "5358509_del_GT_G"
## [1] "1980865_snp_A_T"
## Confirmed in the first two samples, the third sample is very interesting in this region too.
## This is just before the yciB locus.
test@IntersectionSets[["01110000"]]
## [1] "2354158_del_CG_C"
## [1] "4947631_complex_ATGCGCACTACACAGGACAACATTT_CTGGCGAACAGCCACGACGGCCTGG"
## [2] "5910265_snp_G_C"
## [1] "2033089_del_AT_A"