1 A fresh running of all proteomics tasks

I think I finally worked out all(most?) of the kinks in the processing of DIA data. Thus I want to have a fresh run of all the tasks required to interpret the results.

2 Annotation version: 20190801

2.1 Genome annotation input

2.1.2 Download from microbesonline

Apparently I queried the microbesonline too often and now I get an error whenever I try to use them, this disappoints me.

## The species being downloaded is: Mycobacterium tuberculosis H37Rv
## Downloading: http://www.microbesonline.org/cgi-bin/genomeInfo.cgi?tId=83332;export=tab

2.2 Getting ontology data

## The species being downloaded is: Mycobacterium tuberculosis H37Rv and is being downloaded as 83332.tab.

2.3 Some pattern matching

This little block is intended to seek out peptide sequences with the highly degenerate pattern: Y(E|D) with a varying number of amino acids between the Y and (E or D).

get_hits <- function(patterns, peptide_file, pct_limit=0.25) {
  pep_seq <- Biostrings::readAAStringSet(peptide_file)
  pep_lst <- as.data.frame(pep_seq)[["x"]]
  pos_df <- data.frame(stringsAsFactors=FALSE)
  pep_names <- names(pep_seq)
  rowname <- ""
  for (r in 1:length(pep_names)) {
    rowname <- pep_names[r]
    new_row <- vector()
    found_hits <- FALSE
    for (p in 1:length(patterns)) {
      pat <- patterns[[p]]
      pname <- names(patterns)[p]
      column <- gregexpr(pattern=pat, text=pep_lst)
      names(column) <- names(pep_seq)
      name <- names(column)[r]
      row <- column[[r]]
      len <- nchar(pep_lst[r])
      pct <- as.numeric(row) / len
      cdist <- len - as.numeric(row)
      pos_name <- glue::glue("pos_{pname}")
      num_name <- glue::glue("number_{pname}")
      nt_name <- glue::glue("nt_{pname}")
      ct_name <- glue::glue("ct_{pname}")
      pct_name <- glue::glue("pct_{pname}")
      pos_name <- glue::glue("pos_{pname}")
      ## An important caveat here:  There may be more than one value.
      new_row["rowname"] <- rowname
      number_hits <- length(as.numeric(row))
      if (as.numeric(row)[1] == -1) {
        number_hits <- "0"
      }
      new_row[num_name] <- number_hits
      new_row[pos_name] <- toString(as.numeric(row))
      if (new_row[pos_name] == "-1") {
        new_row[pos_name] <- "0"
      } else {
        found_hits <- TRUE
      }
      new_row["length"] <- len
      op <- options(warn=2)
      if (pct[1] < 0) {
        new_row[pct_name] <- "0"
      } else {
        new_row[pct_name] <- toString(pct)
      }
      options(op)
      nt_str <- ""
      ct_str <- ""
      max_pct <- 1 - pct_limit
      for (i in pct) {
        if (i < 0) {
          next
        }
        if (i < pct_limit) {
          nt_str <- paste0(nt_str, ", ", i)
        }
        if (i > max_pct) {
          ct_str <- paste0(ct_str, ", ", i)
        }
      }
      nt_str <- gsub(pattern="^\\, ", replacement="", x=nt_str)
      ct_str <- gsub(pattern="^\\, ", replacement="", x=ct_str)
      new_row[nt_name] <- nt_str
      new_row[ct_name] <- ct_str
    }
    if (isTRUE(found_hits)) {
      pos_df <- rbind(pos_df, new_row, stringsAsFactors=FALSE)
      colnames(pos_df) <- names(new_row)
    }
  } ## End for every gene
  rownames(pos_df) <- pos_df[["rowname"]]
  pos_df <- pos_df[-1, ]

  return(pos_df)
}

##patterns <- c("Y(E|D)", "Y.(E|D)", "Y..(E|D)", "Y...(E|D)", "Y....(E|D)", "Y.....(E|D)")
mtb_cds <- "reference/mtb_cds.fasta"
patterns <- c("Y..(E|D)", "Y...(E|D)")
names(patterns) <- c("two", "three")
yed_patterns <- get_hits(patterns, mtb_cds)
written <- write_xls(yed_patterns, excel="positions_by_patterns.xlsx")

2.6 Compare subset of interesting proteins vs. all

Najib and Volker requested a comparison of the distribution pattern hits of all proteins vs. the distribution in a specific subset of proteins. I don’t actually know the subset desired, so I will assume the

## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 083922869a37724ece10beed7b0bb758a179fdfb
## This is hpgltools commit: Thu Oct 17 11:43:00 2019 -0400: 083922869a37724ece10beed7b0bb758a179fdfb
## Saving to 01_annotation_20190801-v20190801.rda.xz
