I am hoping to learn some tasks one may perform for high-throughput population genetics. This is a realm in which I have very limited understanding, so for the moment I am relying upon the ANGSD and PLINK documentation and hoping that I will understand more as I continue through them.
One important caveat about the following. In this iteration I am only including samples for which we have known annotations of the zymodeme. At the time of this writing, that limits the scope of this document to 35 samples (16 2.2 and 19 2.3 samples). I do not really know if this is sufficient for a valid analysis, but it does not seem unreasonable for a first pass.
My understanding: the first task to perform is to read in the alignments against the reference genome and filter them. Thus I created a series of text files with the relative locations of the relevant bamfiles. I initially separated them by zymodeme, but came to a realization that it is better to keep them together.
In addition, I used this ‘lst’ file to write out a ‘set’ file in the format expected by PLINK. Here is a short bash block showing what I did:
## I symlinked the preprocessing directory from my main tree.
ls -ld preprocessing
head all_samples.lst
head zymodeme_plink_sets.txt## lrwxrwxrwx 1 trey hpgl 16 Aug 3 11:30 preprocessing -> ../preprocessing
## preprocessing/TMRC20002/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20005/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20009/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20011/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20012/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20014/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20017/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20039/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20041/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20049/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## ZY22
## preprocessing/TMRC20002/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20005/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20009/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20011/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20012/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20014/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20017/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20039/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
## preprocessing/TMRC20041/outputs/hisat2_lpanamensis_v36/r1_trimmed.bam
Given these text inputs, I followed the snpFilters section of the ANGSD manual:
http://popgen.dk/angsd/index.php/SnpFilters
in order to perform what I think are appropriate filters and initial statistics as per the following block. I chose not to do genotype calling because the authors suggested against it:
“We really don’t recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.”
module add angsd
angsd -b all_samples.lst \
-doHWE 1 -GL 1 \
-doMajorMinor 1 -doMaf 2 \
-snp_pval 1e-2 -P 5 \
-dosnpstat 1 -out all_filtered
## Some other possible likely options:
## -doGeno performs genotype calling. It takes the sum of:
## 1: major/minor calls,
## 2: write out called genotype encoded as -1,0,1,2...
## 4: write called genotype directly as AA,AC,...
## 8: write posterior probabilities of all genotypes
## 16,32: other posteriors
## It looks like a likely value is therefore -doGeno 3 or 55
##
## -doPost I think is responsible for doing the posterior
## probabilities, the documentation is a little sparse here.The resulting output files are here:
ls -al all_filtered*
## Frequencies of each variant in all samples and some stats about them,
## more to the point, these are Hardy Weinberg equilibrium test results.
## This file was created because of the 'doHWE' argument above.
less all_filtered.hwe.gz | head
## Relative frequency of each allele for every site in the data.
## I think that this should be closely related to the HWE above.
## This file was created because of the 'doMaf' argument above.
less all_filtered.mafs.gz | head
## Summary statistics for every variant.
## This file was created because of the 'doSnpStat' above.
less all_filtered.snpStat.gz | head## -rw-rw-r-- 1 trey 10186 8192 Aug 2 14:27 all_filtered.arg
## -rw-rw-r-- 1 trey 10186 2082575 Aug 2 19:10 all_filtered.hwe.gz
## -rw-rw-r-- 1 trey 10186 986600 Aug 2 19:10 all_filtered.mafs.gz
## -rw-rw-r-- 1 trey 10186 5466471 Aug 2 19:15 all_filtered.snpStat.gz
## Chromo Position Major Minor hweFreq Freq F LRT p-value
## LpaL13_01 44 C T 0.031864 0.059486 0.999106 1.211539e+01 5.000728e-04
## LpaL13_01 63 T C 0.095021 0.202719 0.999691 1.712653e+01 3.497047e-05
## LpaL13_01 66 T C 0.091056 0.186261 0.999670 1.527911e+01 9.273653e-05
## LpaL13_01 160 C A 0.101253 0.091189 0.999397 1.791205e+01 2.313519e-05
## LpaL13_01 258 G A 0.028581 0.028755 -0.026026 4.757321e-02 8.273412e-01
## LpaL13_01 260 A G 0.028582 0.028756 -0.026027 4.757476e-02 8.273384e-01
## LpaL13_01 262 A G 0.028642 0.028818 -0.026101 4.780990e-02 8.269190e-01
## LpaL13_01 295 C T 0.206553 0.208134 -0.262234 3.614116e+00 5.729116e-02
## LpaL13_01 444 A C 0.021996 0.030348 0.998301 7.336812e+00 6.755654e-03
## chromo position major minor unknownEM pu-EM nInd
## LpaL13_01 44 C T 0.031862 5.854110e-05 34
## LpaL13_01 63 T C 0.095021 0.000000e+00 33
## LpaL13_01 66 T C 0.091057 0.000000e+00 33
## LpaL13_01 160 C A 0.101254 0.000000e+00 33
## LpaL13_01 258 G A 0.028577 1.901286e-04 32
## LpaL13_01 260 A G 0.028575 3.826105e-04 32
## LpaL13_01 262 A G 0.028635 3.811637e-04 32
## LpaL13_01 295 C T 0.206554 0.000000e+00 33
## LpaL13_01 444 A C 0.021995 3.330669e-16 33
## Chromo Position +Major +Minor -Major -Minor SB1:SB2:SB3 HWE_LRT:HWE_pval baseQ_Z:baseQ_pval mapQ_Z:mapQ_pval edge_z:edge_pval
## LpaL13_01 44 634 9 125 0 1.194401:0.014163:0.368423 12.115394:5.000727e-04 -1.975427:4.821962e-02 -3.989394:6.627263e-05 -3.040978:2.358248e-03
## LpaL13_01 63 413 23 260 0 1.596330:0.054555:0.000025 17.126526:3.497047e-05 -1.694273:9.021327e-02 -8.113846:4.440892e-16 -1.390011:1.645257e-01
## LpaL13_01 66 361 23 301 0 1.783854:0.061977:0.000001 15.279108:9.273653e-05 -1.452362:1.464010e-01 -8.123046:4.440892e-16 -2.426678:1.523779e-02
## LpaL13_01 160 1012 120 854 70 0.327331:0.107952:0.021459 17.912048:2.313519e-05 -6.342156:2.275872e-10 -4.224533:2.395913e-05 -6.494033:8.397893e-11
## LpaL13_01 258 488 3 16 0 1.032587:0.006146:1.000000 0.047573:8.273412e-01 -1.588945:1.120728e-01 -0.515814:6.059841e-01 -2.440731:1.465759e-02
## LpaL13_01 260 483 3 16 0 1.032922:0.006210:1.000000 0.047575:8.273384e-01 -1.485042:1.375328e-01 -0.520962:6.023931e-01 -2.530958:1.137518e-02
## LpaL13_01 262 481 3 17 0 1.035124:0.006236:1.000000 0.047810:8.269190e-01 -1.464012:1.431908e-01 -0.516004:6.058515e-01 -2.696022:7.017410e-03
## LpaL13_01 295 1053 106 348 103 -1.054767:0.238448:0.000000 3.614116:5.729116e-02 -3.947932:7.886299e-05 -4.633368:3.601517e-06 -8.565749:0.000000e+00
## LpaL13_01 444 97 3 992 3 4.924749:0.030074:0.012061 7.336812:6.755654e-03 -2.679576:7.371634e-03 0.000000:1.000000e+00 -0.678308:4.975762e-01
This provides a distribution of alleles at polymorphic sites. As the authors put it, “inferring the SFS from low coverage sequencing data in a straightforward manner by using genotype calls can lead to significant bias.”
Again, I am relatively ignorant of population genetics, so I am paraphrasing the authors’ paper; but it looks like these statistics provide proxies for measurements of various processes which may be uncovered by a population genetics study: expansions, bottlenecks, migration. It also appears to be the starting statistic for later analyses.
In order to perform this, one must have an ancestral genome and the set of alignments. In our case, the ancestral must be the panamensis reference, which I copied to reference/lpanamensis.fasta
I will run it once with and once without filtering.
module add angsd
## Index the reference
samtools faidx reference/lpanamensis.fasta
## The doSaf argument may be 1-4:
## 1: assume HWE
## 2: include inbreeding coefficients (which we cannot)
## 3: Use a prior distribution for all sites to calculate posterior
## probabilities, potentially two pass from an existing -doSaf 1
## 4: Calculate posterior probabilities from genotype probabilities,
## potentially two pass after using the '-beagle' options.
## Do no filtering
angsd -b all_samples.lst \
-gl 1 -anc reference/lpanamensis.fasta \
-doSaf 1 -out all_saf_unfiltered
realSFS all_saf_unfiltered.saf.idx > all_saf_unfiltered_realsfs.txt
## Filter for high quality hits
angsd -b all_samples.lst \
-gl 1 -anc reference/lpanamensis.fasta \
-doSaf 1 -baq 1 \
-C 50 -minMapQ 30 \
-minQ 20 -out all_saf_filtered
realSFS all_saf_filtered.saf.idx > all_saf_filtered_realsfs.txtpander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))