In contrast, it is possible to load most annotations of interest directly from the gff files used in the alignments. More in-depth information for the human transcriptome may be extracted from biomart.
## The old way of getting genome/annotation data
mtb_gff <- "reference/mycobacterium_tuberculosis_h37rv_2.gff.gz"
mtb_genome <- "reference/mtuberculosis_h37rv_genbank.fasta"
mtb_cds <- "reference/mtb_cds.fasta"
mtb_annotations <- sm(load_gff_annotations(mtb_gff, type="gene"))
rownames(mtb_annotations) <- mtb_annotations[["ID"]]
Microbesonline is my favorite resource for bacterial data. This is probably because of Yoann.
## First figure out the ID for the Mtb genome:
ids <- get_microbesonline_ids("37")
knitr::kable(head(ids))
taxonomyId | shortName |
---|---|
83332 | Mycobacterium tuberculosis H37Rv |
243273 | Mycoplasma genitalium G37 |
316274 | Herpetosiphon aurantiacus ATCC 23779 |
331111 | Escherichia coli E24377A |
338966 | Pelobacter propionicus DSM 2379 |
350704 | Pseudomonas aeruginosa C3719 |
## Mycobacterium tuberculosis H37Rv is 83332
mtb_id <- 83332
mtb_microbes <- load_microbesonline_annotations(ids=mtb_id)[[1]]
## Querying microbesonline for: Mycobacterium tuberculosis H37Rv.
## Warning in if (grepl(pattern = "Time-out", x = data[3, ])) {: the condition
## has length > 1 and only the first element will be used
knitr::kable(head(mtb_microbes))
locusId | accession | GI | scaffoldId | start | stop | strand | sysName | name | desc | COG | COGFun | COGDesc | TIGRFam | TIGRRoles | GO | EC | ECDesc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31772 | NP_214515.1 | 15607143 | 7022 | 1 | 1524 | + | Rv0001 | dnaA | chromosomal replication initiation protein (NCBI) | COG593 | L | ATPase involved in DNA replication initiation | TIGR00362 chromosomal replication initiator protein DnaA [dnaA] | DNA metabolism:DNA replication, recombination, and repair | GO:0006270,GO:0006275,GO:0003688,GO:0017111,GO:0005524 | ||
31773 | NP_214516.1 | 15607144 | 7022 | 2052 | 3260 | + | Rv0002 | dnaN | DNA polymerase III subunit beta (NCBI) | COG592 | L | DNA polymerase sliding clamp subunit (PCNA homolog) | TIGR00663 DNA polymerase III, beta subunit [dnaN] | DNA metabolism:DNA replication, recombination, and repair | GO:0006260,GO:0003890,GO:0003895,GO:0016000,GO:0016451,GO:0003891,GO:0016448,GO:0016452,GO:0003677,GO:0003893,GO:0008408,GO:0016449,GO:0019984,GO:0003889,GO:0003894,GO:0015999,GO:0016450 | 2.7.7.7 | DNA-directed DNA polymerase. |
31774 | NP_214517.1 | 15607145 | 7022 | 3280 | 4437 | + | Rv0003 | recF | recombination protein F (NCBI) | COG1195 | L | Recombinational DNA repair ATPase (RecF pathway) | TIGR00611 DNA replication and repair protein RecF [recF] | DNA metabolism:DNA replication, recombination, and repair | GO:0006281,GO:0005694,GO:0003697,GO:0005524 | ||
31775 | NP_214518.1 | 15607146 | 7022 | 4434 | 4997 | + | Rv0004 | Rv0004 | hypothetical protein (NCBI) | COG5512 | R | Zn-ribbon-containing, possibly RNA-binding protein and truncated derivatives | |||||
31776 | NP_214519.1 | 15607147 | 7022 | 5123 | 7267 | + | Rv0005 | gyrB | DNA topoisomerase IV subunit B (NCBI) | COG187 | L | Type IIA topoisomerase (DNA gyrase/topo II, topoisomerase IV), B subunit | TIGR01059 DNA gyrase, B subunit [gyrB] | DNA metabolism:DNA replication, recombination, and repair | GO:0006304,GO:0006265,GO:0005694,GO:0003918,GO:0005524 | 5.99.1.3 | DNA topoisomerase (ATP-hydrolyzing). |
31777 | NP_214520.1 | 15607148 | 7022 | 7302 | 9818 | + | Rv0006 | gyrA | DNA gyrase subunit A (NCBI) | COG188 | L | Type IIA topoisomerase (DNA gyrase/topo II, topoisomerase IV), A subunit | TIGR01063 DNA gyrase, A subunit [gyrA] | DNA metabolism:DNA replication, recombination, and repair | GO:0006265,GO:0006268,GO:0005694,GO:0003918,GO:0005509,GO:0005524 | 5.99.1.3 | DNA topoisomerase (ATP-hydrolyzing). |
Biomart is obnoxious to use, but I think far and away the best metazoan information resource.
human_annot <- load_biomart_annotations()
## The biomart annotations file already exists, loading from it.
hs_annot <- human_annot[["annotation"]]
knitr::kable(head(hs_annot))
ensembl_transcript_id | ensembl_gene_id | version | transcript_version | hgnc_symbol | description | gene_biotype | cds_length | chromosome_name | strand | start_position | end_position | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENST00000000233 | ENST00000000233 | ENSG00000004059 | 10 | 9 | ARF5 | ADP ribosylation factor 5 [Source:HGNC Symbol;Acc:HGNC:658] | protein_coding | 543 | 7 | + | 127588345 | 127591705 |
ENST00000000412 | ENST00000000412 | ENSG00000003056 | 7 | 7 | M6PR | mannose-6-phosphate receptor, cation dependent [Source:HGNC Symbol;Acc:HGNC:6752] | protein_coding | 834 | 12 | - | 8940363 | 8949955 |
ENST00000000442 | ENST00000000442 | ENSG00000173153 | 13 | 10 | ESRRA | estrogen related receptor alpha [Source:HGNC Symbol;Acc:HGNC:3471] | protein_coding | 1272 | 11 | + | 64305572 | 64316743 |
ENST00000001008 | ENST00000001008 | ENSG00000004478 | 7 | 5 | FKBP4 | FK506 binding protein 4 [Source:HGNC Symbol;Acc:HGNC:3720] | protein_coding | 1380 | 12 | + | 2794953 | 2805423 |
ENST00000001146 | ENST00000001146 | ENSG00000003137 | 8 | 6 | CYP26B1 | cytochrome P450 family 26 subfamily B member 1 [Source:HGNC Symbol;Acc:HGNC:20581] | protein_coding | 1539 | 2 | - | 72129238 | 72148038 |
ENST00000002125 | ENST00000002125 | ENSG00000003509 | 15 | 8 | NDUFAF7 | NADH:ubiquinone oxidoreductase complex assembly factor 7 [Source:HGNC Symbol;Acc:HGNC:28816] | protein_coding | 1326 | 2 | + | 37231631 | 37253403 |
There are a few ways to get information directly from genbank. None of them are particularly satisfactory. Here is one attempt.
tt <- sm(library(UniProt.ws))
colnames(availableUniprotSpecies())
found <- availableUniprotSpecies(pattern="Mycobacterium tuberculosis")
knitr::kable(found)
## Lets pull data for H37Ra which is not a perfect match, but I am thinking a good one.
mtb_uniprot <- UniProt.ws(419947)
species(mtb_uniprot)
available_keytypes <- keytypes(mtb_uniprot)
available_columns <- columns(mtb_uniprot)
mtb_keys <- keys(x=mtb_uniprot, keytype="ENTREZ_GENE")
head(mtb_keys)
##hitlst <- list()
##for (k in available_keytypes) {
## test_keys <- sm(try(keys(x=mtb_uniprot, keytype=k)))
## hitlst[[k]] <- length(test_keys)
##}
##hitlst
## `GI_NUMBER*` is the winner!
keytype <- "GI_NUMBER*"
mtb_keys <- keys(x=mtb_uniprot, keytype=keytype)
wanted_columns <- c("ENSEMBL", "ENTREZ_GENE", "TUBERCULIST", "KEGG", "INTERPRO", "HGNC",
mtb_stuff <- select(x=mtb_uniprot, keys=mtb_keys, columns=wanted_columns, keytype=keytype)
knitr::kable(head(mtb_stuff))
##length(mtb_keys)
##for (col in available_columns) {
## if (col == keytype) {
## next
## }
## hits <- sm(try(select(x=mtb_uniprot, keys=mtb_keys, columns=col, keytype="ENTREZ_GENE")))
## num_hits <- sum(!is.na(hits[[2]]))
## hitlst[[col]] <- num_hits
##}
## Error: <text>:27:1: unexpected symbol
## 26: mtb_stuff <- select(x=mtb_uniprot, keys=mtb_keys, columns=wanted_columns, keytype=keytype)
## 27: knitr
## ^
mtb_go <- load_microbesonline_go(id=mtb_id)
## Collecting go data for: Mycobacterium tuberculosis H37Rv.
I want to be able to cross reference some work from Keith. His gene IDs are MTBxxxx
all_de <- read.table("limma_result.csv", header=TRUE, sep="\t")
## Warning in file(file, "rt"): cannot open file 'limma_result.csv': No such
## file or directory
## Error in file(file, "rt"): cannot open the connection
pander::pander(sessionInfo())
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: hpgltools(v.2018.03), Homo.sapiens(v.1.3.1), TxDb.Hsapiens.UCSC.hg19.knownGene(v.3.2.2), org.Hs.eg.db(v.3.5.0), GO.db(v.3.5.0), OrganismDbi(v.1.20.0), GenomicFeatures(v.1.30.3), GenomicRanges(v.1.30.3), GenomeInfoDb(v.1.14.0), AnnotationDbi(v.1.40.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), Biobase(v.2.38.0), ggrepel(v.0.7.0), ggplot2(v.2.2.1), UniProt.ws(v.2.18.0), BiocGenerics(v.0.24.0), RCurl(v.1.95-4.10), bitops(v.1.0-6) and RSQLite(v.2.0)
loaded via a namespace (and not attached): httr(v.1.3.1), RMySQL(v.0.10.14), bit64(v.0.9-7), foreach(v.1.4.4), assertthat(v.0.2.0), highr(v.0.6), pander(v.0.6.1), RBGL(v.1.54.0), blob(v.1.1.0), GenomeInfoDbData(v.1.0.0), Rsamtools(v.1.30.0), yaml(v.2.1.18), progress(v.1.1.2), pillar(v.1.2.1), backports(v.1.1.2), lattice(v.0.20-35), digest(v.0.6.15), XVector(v.0.18.0), colorspace(v.1.3-2), htmltools(v.0.3.6), Matrix(v.1.2-12), plyr(v.1.8.4), XML(v.3.98-1.10), pkgconfig(v.2.0.1), devtools(v.1.13.5), biomaRt(v.2.34.2), zlibbioc(v.1.24.0), scales(v.0.5.0), BiocParallel(v.1.12.0), tibble(v.1.4.2), withr(v.2.1.2), SummarizedExperiment(v.1.8.1), lazyeval(v.0.2.1), magrittr(v.1.5), memoise(v.1.1.0), evaluate(v.0.10.1), xml2(v.1.2.0), graph(v.1.56.0), BiocInstaller(v.1.28.0), tools(v.3.4.4), data.table(v.1.10.4-3), prettyunits(v.1.0.2), matrixStats(v.0.53.1), stringr(v.1.3.0), munsell(v.0.4.3), DelayedArray(v.0.4.1), Biostrings(v.2.46.0), compiler(v.3.4.4), rlang(v.0.2.0), grid(v.3.4.4), iterators(v.1.0.9), base64enc(v.0.1-3), rmarkdown(v.1.9), gtable(v.0.2.0), codetools(v.0.2-15), DBI(v.0.8), roxygen2(v.6.0.1), R6(v.2.2.2), GenomicAlignments(v.1.14.1), knitr(v.1.20), rtracklayer(v.1.38.3), bit(v.1.1-12), commonmark(v.1.4), rprojroot(v.1.3-2), stringi(v.1.1.7) and Rcpp(v.0.12.16)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
## R> packrat::restore()
## This is hpgltools commit: Wed Mar 21 15:55:32 2018 -0400: 5d8c266e48bb9f73cdac8300e5c7c9f5baf003dc
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
## Saving to 01_annotation-v20180306.rda.xz
tmp <- sm(saveme(filename=this_save))