I am looking to learn about unannotated short CDSes in 5448. In reality, I think I want to learn about all short transcripts which are likely residing in the inter-CDS regions and which have some evidence of being actively transcribed. The short CDS sequences are thus a (potentially small) subset thereof; but they are a subset for which there are quick and easy tools to hunt.
With that in mind, I compiled a new copy of prodigal in which I changed its hard-coded settings for the lower limits of its search range.
If anyone reads this who is interested, I changed the constants in node.h:
I went through a few iterations of messing with these parameters.
I will just run the following commands here, I just run prodigal (default) and progial_small (modified) separately on the 5448 genome.
ok, so bizarrely my first run gave me (I think) 1778 CDS sequences, which seemed rather lower than I expected. So I decreased the inter-operon distance with the reasonable assumption that I would find a few more, but now prodigal is detecting 1773.
module add prodigal
cd reference
prodigal -i spyogenes_5448.fasta -f gff > spyogenes_5448.gff
prodigal_small -i spyogenes_5448.fasta -f gff > spyogenes_5448_small.gff
wc -l spyogenes_5448.gff
wc -l spyogenes_5448_small.gff
## bash: line 1: module: command not found
## -------------------------------------
## PRODIGAL v2.6.3 [February, 2016]
## Univ of Tenn / Oak Ridge National Lab
## Doug Hyatt, Loren Hauser, et al.
## -------------------------------------
## Request: Single Genome, Phase: Training
## Reading in the sequence(s) to train...1829516 bp seq created, 38.50 pct GC
## Locating all potential starts and stops...70248 nodes
## Looking for GC bias in different frames...frame bias scores: 2.57 0.15 0.28
## Building initial set of genes to train from...done!
## Creating coding model and scoring nodes...done!
## Examining upstream regions and training starts...done!
## -------------------------------------
## Request: Single Genome, Phase: Gene Finding
## Finding genes in sequence #1 (1829516 bp)...done!
## -------------------------------------
## PRODIGAL v2.6.3 [February, 2016]
## Univ of Tenn / Oak Ridge National Lab
## Doug Hyatt, Loren Hauser, et al.
## -------------------------------------
## Request: Single Genome, Phase: Training
## Reading in the sequence(s) to train...1829516 bp seq created, 38.50 pct GC
## Locating all potential starts and stops...193604 nodes
## Looking for GC bias in different frames...frame bias scores: 2.30 0.22 0.48
## Building initial set of genes to train from...done!
## Creating coding model and scoring nodes...done!
## Examining upstream regions and training starts...done!
## -------------------------------------
## Request: Single Genome, Phase: Gene Finding
## Finding genes in sequence #1 (1829516 bp)...done!
## 1769 spyogenes_5448.gff
## 1773 spyogenes_5448_small.gff
I would expect that most of the resulting entries are identical, but I can easily imagine that the different parameters might cause different start codons to be chosen (which might in turn affect the number of new entries in the small set and be why I got the unexpectedly smaller number). Either way, the + strand ends and - strand starts should remain identical (start and end in this context are divorced from AUG/GUG/UUT(I thought pyogenes also had a couple CUG?) vs.ย stop codon).
load_gff_annotations("reference/spyogenes_5448.gff") normal <-
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Returning a df with 22 columns and 1766 rows.
summary(as.factor(normal[["start_type"]]))
## ATG GTG TTG
## 1599 86 81
load_gff_annotations("reference/spyogenes_5448_small.gff") small <-
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = TRUE)
## Trying attempt: rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Had a successful gff import with rtracklayer::import.gff3(gff, sequenceRegionsAsSeqinfo = FALSE)
## Returning a df with 22 columns and 1770 rows.
summary(as.factor(small[["start_type"]]))
## ATG GTG TTG
## 1601 90 79
These summaries make clear that a few start codons did in fact get moved. The question therefore is, did that shift the putative CDS sizes up or down (with the caveat that the minimum went down by definition).
summary(normal[["width"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90 453 765 892 1187 6180
summary(small[["width"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63 447 759 888 1182 6180
Interestingly, even though I explicitly said the minimum is 24, the smallest observed putative CDS is 63 nucleotides.
Alrighty, let us find the intersection and (more interestingly) unique to the small (and hopefully null unique to the large) sets. I will assume the stop codons are static and split the strands +/-.
normal[["strand"]] == "+"
normal_plus_idx <- normal[normal_plus_idx, ]
normal_plus <- normal[!normal_plus_idx, ]
normal_minus <-
small[["strand"]] == "+"
small_plus_idx <- small[small_plus_idx, ]
small_plus <- small[!small_plus_idx, ] small_minus <-
dim(normal_plus)
## [1] 952 22
dim(small_plus)
## [1] 953 22
merge(normal_plus, small_plus, by="end", all.x=TRUE, all.y=TRUE)
all_plus <-
## These ought to be the set of babies which are in the small set but _not_ the large set.
is.na(all_plus[["width.x"]])
missing_normal_idx <- all_plus[missing_normal_idx, ]
missing_normal <- missing_normal
## end seqnames.x start.x width.x strand.x source.x type.x score.x phase.x
## 261 344192 <NA> NA NA <NA> <NA> <NA> NA NA
## 488 619685 <NA> NA NA <NA> <NA> <NA> NA NA
## 536 668181 <NA> NA NA <NA> <NA> <NA> NA NA
## 570 689910 <NA> NA NA <NA> <NA> <NA> NA NA
## 660 822606 <NA> NA NA <NA> <NA> <NA> NA NA
## 863 1439115 <NA> NA NA <NA> <NA> <NA> NA NA
## 873 1475677 <NA> NA NA <NA> <NA> <NA> NA NA
## 885 1528425 <NA> NA NA <NA> <NA> <NA> NA NA
## ID.x partial.x start_type.x rbs_motif.x rbs_spacer.x gc_cont.x conf.x
## 261 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 488 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 536 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 570 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 660 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 863 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 873 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 885 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## score.1.x cscore.x sscore.x rscore.x uscore.x tscore.x seqnames.y start.y
## 261 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 344109
## 488 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 619566
## 536 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 668119
## 570 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 689800
## 660 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 822523
## 863 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1439047
## 873 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1475570
## 885 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1528339
## width.y strand.y source.y type.y score.y phase.y ID.y partial.y
## 261 84 + Prodigal_v2.6.3 CDS 2.6 0 1_319 00
## 488 120 + Prodigal_v2.6.3 CDS 0.1 0 1_585 00
## 536 63 + Prodigal_v2.6.3 CDS 0.6 0 1_643 00
## 570 111 + Prodigal_v2.6.3 CDS 1.6 0 1_676 00
## 660 84 + Prodigal_v2.6.3 CDS 10.7 0 1_799 00
## 863 69 + Prodigal_v2.6.3 CDS 5.3 0 1_1402 00
## 873 108 + Prodigal_v2.6.3 CDS 0.4 0 1_1448 00
## 885 87 + Prodigal_v2.6.3 CDS 2.3 0 1_1503 00
## start_type.y rbs_motif.y rbs_spacer.y gc_cont.y conf.y score.1.y cscore.y
## 261 ATG GGAG/GAGG 5-10bp 0.369 64.26 2.55 -2.95
## 488 ATG AGxAGG/AGGxGG 3-4bp 0.300 50.40 0.07 -5.28
## 536 ATG GGA/GAG/AGG 5-10bp 0.333 53.42 0.60 -1.20
## 570 ATG AGGAG 5-10bp 0.378 59.31 1.64 -5.80
## 660 ATG AGGAG 5-10bp 0.298 92.14 10.71 2.94
## 863 ATG AGGAG 5-10bp 0.290 77.01 5.26 -0.60
## 873 ATG GGAG/GAGG 5-10bp 0.333 52.51 0.44 -5.03
## 885 ATG GGAG/GAGG 5-10bp 0.379 62.71 2.26 -3.42
## sscore.y rscore.y uscore.y tscore.y
## 261 5.50 3.17 0.50 1.41
## 488 5.35 2.89 0.93 2.03
## 536 1.79 -3.30 -10.03 1.04
## 570 7.44 6.84 -1.48 1.87
## 660 7.77 5.13 1.24 1.41
## 863 5.86 4.18 1.03 1.15
## 873 5.47 4.11 0.69 1.82
## 885 5.69 3.28 0.52 1.46
## Conversely these are the ones which are lost in the 'small' dataset.
is.na(all_plus[["width.y"]])
missing_small_idx <- all_plus[missing_small_idx, ]
missing_small <- missing_small
## end seqnames.x start.x width.x strand.x source.x type.x score.x
## 84 99385 CP008776 98951 435 + Prodigal_v2.6.3 CDS 15.4
## 533 667770 CP008776 667537 234 + Prodigal_v2.6.3 CDS 15.2
## 557 681475 CP008776 681137 339 + Prodigal_v2.6.3 CDS 45.6
## 590 717868 CP008776 717641 228 + Prodigal_v2.6.3 CDS 9.8
## 601 732661 CP008776 732527 135 + Prodigal_v2.6.3 CDS 0.0
## 694 840756 CP008776 840298 459 + Prodigal_v2.6.3 CDS 26.4
## 917 1707641 CP008776 1707450 192 + Prodigal_v2.6.3 CDS 28.4
## phase.x ID.x partial.x start_type.x rbs_motif.x rbs_spacer.x gc_cont.x
## 84 0 1_88 00 TTG AGGAG/GGAGG 11-12bp 0.352
## 533 0 1_641 00 ATG GGA/GAG/AGG 11-12bp 0.346
## 557 0 1_664 00 ATG AGGAG/GGAGG 11-12bp 0.378
## 590 0 1_701 00 GTG AGGAG 5-10bp 0.373
## 601 0 1_714 00 ATG AGGA/GGAG/GAGG 11-12bp 0.370
## 694 0 1_835 00 TTG None None 0.412
## 917 0 1_1660 00 ATG AGGA/GGAG/GAGG 11-12bp 0.401
## conf.x score.1.x cscore.x sscore.x rscore.x uscore.x tscore.x seqnames.y
## 84 97.17 15.38 17.03 -1.66 8.25 -0.84 -10.32 <NA>
## 533 97.07 15.23 15.05 0.18 -3.24 -0.17 3.59 <NA>
## 557 100.00 45.63 31.70 13.93 8.25 1.78 3.89 <NA>
## 590 90.39 9.75 1.67 8.08 12.18 2.78 -6.88 <NA>
## 601 50.06 0.01 -3.65 3.67 1.46 1.31 2.05 <NA>
## 694 99.77 26.42 46.84 -20.42 -8.48 -1.62 -10.32 <NA>
## 917 99.86 28.44 22.17 6.26 2.08 1.24 2.94 <NA>
## start.y width.y strand.y source.y type.y score.y phase.y ID.y partial.y
## 84 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 533 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 557 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 590 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 601 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 694 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## 917 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## start_type.y rbs_motif.y rbs_spacer.y gc_cont.y conf.y score.1.y cscore.y
## 84 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 533 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 557 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 590 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 601 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 694 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 917 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## sscore.y rscore.y uscore.y tscore.y
## 84 <NA> <NA> <NA> <NA>
## 533 <NA> <NA> <NA> <NA>
## 557 <NA> <NA> <NA> <NA>
## 590 <NA> <NA> <NA> <NA>
## 601 <NA> <NA> <NA> <NA>
## 694 <NA> <NA> <NA> <NA>
## 917 <NA> <NA> <NA> <NA>
dim(normal_minus)
## [1] 814 22
dim(small_minus)
## [1] 817 22
merge(normal_minus, small_minus, by="start", all.x=TRUE, all.y=TRUE)
all_minus <-
## These ought to be the set of babies which are in the small set but _not_ the large set.
is.na(all_minus[["width.x"]])
missing_normal_idx <- all_minus[missing_normal_idx, ]
missing_normal <- missing_normal
## start seqnames.x end.x width.x strand.x source.x type.x score.x phase.x
## 522 1401665 <NA> NA NA <NA> <NA> <NA> NA NA
## 655 1564785 <NA> NA NA <NA> <NA> <NA> NA NA
## 796 1789077 <NA> NA NA <NA> <NA> <NA> NA NA
## 803 1797083 <NA> NA NA <NA> <NA> <NA> NA NA
## ID.x partial.x start_type.x rbs_motif.x rbs_spacer.x gc_cont.x conf.x
## 522 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 655 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 796 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 803 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## score.1.x cscore.x sscore.x rscore.x uscore.x tscore.x seqnames.y end.y
## 522 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1401736
## 655 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1564868
## 796 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1789160
## 803 <NA> <NA> <NA> <NA> <NA> <NA> CP008776 1797157
## width.y strand.y source.y type.y score.y phase.y ID.y partial.y
## 522 72 - Prodigal_v2.6.3 CDS 5.6 0 1_1369 00
## 655 84 - Prodigal_v2.6.3 CDS 3.1 0 1_1539 00
## 796 84 - Prodigal_v2.6.3 CDS 2.6 0 1_1736 00
## 803 75 - Prodigal_v2.6.3 CDS 5.5 0 1_1743 00
## start_type.y rbs_motif.y rbs_spacer.y gc_cont.y conf.y score.1.y cscore.y
## 522 ATG AGGA 5-10bp 0.222 78.31 5.58 1.21
## 655 ATG GGAG/GAGG 5-10bp 0.357 66.86 3.05 -2.44
## 796 ATG GGAG/GAGG 5-10bp 0.357 64.26 2.55 -2.95
## 803 ATG None None 0.373 78.15 5.54 3.04
## sscore.y rscore.y uscore.y tscore.y
## 522 4.38 2.13 1.70 1.20
## 655 5.50 3.17 0.50 1.41
## 796 5.50 3.17 0.50 1.41
## 803 2.50 -30.03 -2.18 1.25
## Conversely these are the ones which are lost in the 'small' dataset.
is.na(all_minus[["width.y"]])
missing_small_idx <- all_minus[missing_small_idx, ]
missing_small <- missing_small
## start seqnames.x end.x width.x strand.x source.x type.x score.x
## 12 118120 CP008776 118347 228 - Prodigal_v2.6.3 CDS 39.1
## phase.x ID.x partial.x start_type.x rbs_motif.x rbs_spacer.x gc_cont.x
## 12 0 1_108 00 ATG AGGAGG 5-10bp 0.412
## conf.x score.1.x cscore.x sscore.x rscore.x uscore.x tscore.x seqnames.y
## 12 99.99 39.10 19.72 19.38 12.99 2.16 3.50 <NA>
## end.y width.y strand.y source.y type.y score.y phase.y ID.y partial.y
## 12 NA NA <NA> <NA> <NA> NA NA <NA> <NA>
## start_type.y rbs_motif.y rbs_spacer.y gc_cont.y conf.y score.1.y cscore.y
## 12 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## sscore.y rscore.y uscore.y tscore.y
## 12 <NA> <NA> <NA> <NA>
::pander(sessionInfo())
pandermessage(paste0("This is hpgltools commit: ", get_git_commit()))
paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
this_save <-message(paste0("Saving to ", this_save))
sm(saveme(filename=this_save)) tmp <-