We always arbitrarily say 1.5 kb…
Pull some code from annotation.Rmd and check that assumption.
## microbes_ids
## Looks like it is taxon ID 208963
paeruginosa_annotations <- load_microbesonline_annotations(id=208963)
## The species being downloaded is: Pseudomonas aeruginosa UCBPP-PA14
locusId | accession | GI | scaffoldId | start | stop | strand | sysName | name | desc | COG | COGFun | COGDesc | TIGRFam | TIGRRoles | GO | EC | ECDesc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2194572 | YP_788156.1 | 116053721 | 4582 | 483 | 2027 | + | PA14_00010 | dnaA | chromosomal replication initiator protein DnaA (NCBI) | COG593 | L | ATPase involved in DNA replication initiation | TIGR00362 chromosomal replication initiator protein DnaA [dnaA] | DNA metabolism:DNA replication, recombination, and repair | GO:0006270,GO:0006275,GO:0003688,GO:0017111,GO:0005524 | ||
2194573 | YP_788157.1 | 116053722 | 4582 | 2056 | 3159 | + | PA14_00020 | dnaN | DNA polymerase III, beta chain (NCBI) | COG592 | L | DNA polymerase sliding clamp subunit (PCNA homolog) | TIGR00663 DNA polymerase III, beta subunit [dnaN] | DNA metabolism:DNA replication, recombination, and repair | GO:0006260,GO:0003677,GO:0003893,GO:0008408,GO:0016449,GO:0019984,GO:0003889,GO:0003894,GO:0015999,GO:0016450,GO:0003890,GO:0003895,GO:0016000,GO:0016451,GO:0003891,GO:0016448,GO:0016452 | 2.7.7.7 | DNA-directed DNA polymerase. |
2194574 | YP_788158.1 | 116053723 | 4582 | 3169 | 4278 | + | PA14_00030 | recF | DNA replication and repair protein RecF (NCBI) | COG1195 | L | Recombinational DNA repair ATPase (RecF pathway) | TIGR00611 DNA replication and repair protein RecF [recF] | DNA metabolism:DNA replication, recombination, and repair | GO:0006281,GO:0005694,GO:0005524,GO:0017111,GO:0003697 | ||
2194575 | YP_788159.1 | 116053724 | 4582 | 4275 | 6695 | + | PA14_00050 | gyrB | DNA gyrase subunit B (NCBI) | COG187 | L | Type IIA topoisomerase (DNA gyrase/topo II, topoisomerase IV), B subunit | TIGR01059 DNA gyrase, B subunit [gyrB] | DNA metabolism:DNA replication, recombination, and repair | GO:0006304,GO:0006265,GO:0005694,GO:0003918,GO:0005524 | 5.99.1.3 | DNA topoisomerase (ATP-hydrolyzing). |
2194576 | YP_788160.1 | 116053725 | 4582 | 7791 | 7018 | - | PA14_00060 | PA14_00060 | putative acyltransferase (NCBI) | COG204 | I | 1-acyl-sn-glycerol-3-phosphate acyltransferase | GO:0008152,GO:0003841 | 2.3.1.51 | 1-acylglycerol-3-phosphate O-acyltransferase. | ||
2194577 | YP_788161.1 | 116053726 | 4582 | 8339 | 7803 | - | PA14_00070 | PA14_00070 | putative histidinol-phosphatase (NCBI) | COG241 | E | Histidinol phosphatase and related phosphatases | TIGR01656 histidinol-phosphate phosphatase domain,TIGR01662 HAD hydrolase, family IIIA | Unknown function:Enzymes of unknown specificity | GO:0000105,GO:0004401 | 3.1.3.- |
pa <- new.env()
test <- try(load("pa14.rda", envir=pa))
pa14 <- NULL
if (class(test) == "try-error") {
pa14 <- sm(gbk2txdb(accession="NC_008463.1"))
## save a copy of this data structure to avoid having to redownload it
save(list=c("pa14"), file="pa14.rda")
} else {
pa14 <- pa$pa14
rm(pa)
}
summary(pa14$cds)
## [1] "GRanges object with 5890 ranges and 20 metadata columns"
## 6069 cds regions.
cds_regions <- as.data.frame(pa14$cds)
length_summary <- summary(cds_regions$width)
length_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 72 525 858 986 1254 15639
## Mean: 971.3 nt / cds over 6069 cds regions: 5,984,820 nt of cds sequence.
IRanges::width(pa14$seq)
## [1] 6537648
I would like to back these up with some numbers from Yoann’s RNASeq. But until I do, the ‘HiSeq_2500_1500_Sequencing_systems_specsheet.pdf’ Najib sent me shows the following:
2 x 50 nt reads: 25-30 Gb. in high-throughput mode. (Call it 12 Gb for 50 nt single)
‘Up to 1.5 billion single reads passing quality filters.’
So let us assume that the genome is completely and equally expressed.
We therefore expect some number of reads per nt.
The following is assuming one uses a normal read-mode, divide all numbers by 5 for rapid run.
## [1] 229.4
## [1] 11472
## [1] 247158
## [1] 254.5
Caveat, the genome is only 1.8E6 nt rather than your 6.5E6.
Also, lets jump straight to the number of reads aligned to cds sequences.
These were run in a single sequencing run, I do not know if rapid or regular, but 100 nt single. Note that I am only counting the successfully mapped reads to the cds, including reads which map randomly to more than 1 place (but are mapped only once.)
Sample: Total counts to cds with strict mapping parameters (1,814 cds entries)
spyogenes_summary <- data.frame(
ids=c("hpgl0531","hpgl0532","hpgl0533","hpgl0534","hpgl0535","hpgl0536","hpgl0537","hpgl0538","hpgl0539",
"hpgl0540","hpgl0541","hpgl0542","hpgl0543","hpgl0544","hpgl0545","hpgl0546","hpgl0547","hpgl0548","hpgl0549",
"hpgl0550","hpgl0551","hpgl0552","hpgl0553","hpgl0554"),
reads=c(5729423, 6032465, 5554651, 5946316, 5711986, 6451776, 3356330, 4022852,
5203054, 5655144, 5324956, 5953598, 5556916, 5203653, 5076034, 5475249,
6150507, 4712725, 5176689, 5788482, 6052472, 5555723, 5655540, 6403623))
knitr::kable(spyogenes_summary)
ids | reads |
---|---|
hpgl0531 | 5729423 |
hpgl0532 | 6032465 |
hpgl0533 | 5554651 |
hpgl0534 | 5946316 |
hpgl0535 | 5711986 |
hpgl0536 | 6451776 |
hpgl0537 | 3356330 |
hpgl0538 | 4022852 |
hpgl0539 | 5203054 |
hpgl0540 | 5655144 |
hpgl0541 | 5324956 |
hpgl0542 | 5953598 |
hpgl0543 | 5556916 |
hpgl0544 | 5203653 |
hpgl0545 | 5076034 |
hpgl0546 | 5475249 |
hpgl0547 | 6150507 |
hpgl0548 | 4712725 |
hpgl0549 | 5176689 |
hpgl0550 | 5788482 |
hpgl0551 | 6052472 |
hpgl0552 | 5555723 |
hpgl0553 | 5655540 |
hpgl0554 | 6403623 |
## the range of reads / sample mapped goes from ~ 3,360,000 to 6,450,000.
summary(spyogenes_summary$reads)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3356330 5203503 5606030 5489590 5948136 6451776
## [1] "1.318e+08"
So I think we can assume that they loaded 1 lane of a normal 8 lane run?
But the main piece of interesting information in my mind: the numbers of reads/sample varied by about a factor of 2.