Examining some R.norvegicus scRNASeq data.
Table of Contents
1. Introduction
This data set is with Dr. Li's lab and was initially examined by April; my current goal is to poke around and recapitulate her observations.
I therefore downloaded new versions of the various cellranger suite tools and am now going to her home directory to see what was done and what I can observe.
export BASE=/scratch/atb/scrnaseq/rattus_norvegicus_202602
1.1. Indices
Being lazy, I copied April's reference material to /scratch/atb/scrnaseq/rattus_norvegicus_2026/reference and reorganized it slightly so that I do not get confused (e.g. genus_species_method).
Thus I now have $BASE/reference/rnorvegicus_atac and rnorvegicus_3p which were created by April for the ATAC and 'normal' libraries respectively. April has one important question: are these libraries the correct species for the animals used in Phoebe's experiments. These are specific to the mRatBN7.2 assembly: The BN/NHsdMcwi strain of the norway rat.
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_015227675.2/
We should check in with Phoebe but I am betting this assembly is better than close enough.
Thus, let us take a moment and record April's commands just in case we need to regenerate these with a different assembly (we won't).
1.1.1. ATAC reference
She wrote the rat.config with the following:
{ organism: "Norway Rat" genome: ["mRatBN7-2"] input_fasta: ["../Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa"] input_gtf: ["../Rattus_norvegicus.mRatBN7.2.113.gtf"] non_nuclear_contigs: ["MT"] localcores: "16" }and invoked
/opt/cellranger-atac-2.2.0/bin/cellranger-atac mkref --config=rat.config
an optional step is to filter the rnorvegicus gtf file via cellranger mkgtf; I am going to ignore this step for the moment.
1.1.2. ARC reference
We don't have an example right now because ARC requires the 3p data to run; so leave this aside for the moment. The caveat here is that we should only be using ARC/ATAC if the combined 3' and ATAC libraries do not work as well as we want.
1.1.3. 3p reference
It looks like this does not have a config file and just uses the 'normal' cellranger mkref command; no wait, April is saying this is potentially only a reference valid for normal cellranger, so let us take a moment to look for the arc.
/opt/cellranger-10.0.0/bin/cellranger mkref --ref-version="2024-A" --genome="mRatBN7-2" --fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa --genes=Rattus_norvegicus.mRatBN7.2.113.filtered.gtf
1.2. Running the ATAC counter
Given the references above, April ran the following:
/opt/cellranger-atac-2.2.0/bin/cellranger-atac count --id 1_saline --reference rattus_norvegicus/mRatBN7-2/ --fastqs ~/nas-read-only/260117_VL00136_113_AAHNVMMM5/Analysis/1/Data/fastq/ --sample 1_saline_ATAC --localmem 90 --localcores 30 /opt/cellranger-atac-2.2.0/bin/cellranger-atac count --id 2_oxycodone --reference rattus_norvegicus/mRatBN7-2/ --fastqs ~/nas-read-only/260117_VL00136_113_AAHNVMMM5/Analysis/1/Data/fastq/ --sample 2_oxycodone_ATAC --localmem 90 --localcores 30 /opt/cellranger-atac-2.2.0/bin/cellranger-atac count --id 3_saline --reference rattus_norvegicus/mRatBN7-2/ --fastqs ~/nas-read-only/260117_VL00136_113_AAHNVMMM5/Analysis/1/Data/fastq/ --sample 3_saline_ATAC --localmem 90 --localcores 30 /opt/cellranger-atac-2.2.0/bin/cellranger-atac count --id 4_oxycodone --reference rattus_norvegicus/mRatBN7-2/ --fastqs ~/nas-read-only/260117_VL00136_113_AAHNVMMM5/Analysis/1/Data/fastq/ --sample 4_oxycodone_ATAC --localmem 90 --localcores 30
Since I moved this around and installed via modules, I will be slightly different, I moved the raw files into separate directories and moved the indexes a little.
Note, I created a test sample using the first 100,000 sequences from the 1_saline sample; so let us try out arguments etc there.
Note: 10x provides different kits, this was done with the 'epi multiome kit' which does 3p+ATAC; other kits do each step separately. When using the multiome kit, one must specify –chemistry=ARC-v1 presumably if I go poking around at 10xgenomics I will figure this out…
module add cellranger_atac ref="${BASE}/reference/rnorvegicus_atac/mRatBN7-2/" pre="${BASE}/preprocessing" ids="test" chemistry="--chemistry=ARC-v1" for id in $ids; do echo $id cmd="cellranger-atac count ${chemistry} --id ${id}_out --reference $ref --fastqs ${pre}/${id}/ --sample ${id}" ls -la $pre/$id echo "I would run $cmd" $cmd done
test total 20099 drwxrwxr-x 2 abelew hpgl 6 Feb 2 13:17 . drwxrwxr-x 9 abelew hpgl 9 Feb 2 13:16 .. -rw-rw-r-- 1 abelew hpgl 10256496 Feb 2 13:17 test_I1.fastq -rw-rw-r-- 1 abelew hpgl 13456496 Feb 2 13:17 test_I2.fastq -rw-rw-r-- 1 abelew hpgl 19256496 Feb 2 13:17 test_R1.fastq -rw-rw-r-- 1 abelew hpgl 19256496 Feb 2 13:17 test_R2.fastq I would run cellranger-atac count --chemistry=ARC-v1 --id test_out --reference /scratch/atb/scrnaseq/rattus_norvegicus_202602/reference/rnorvegicus_atac/mRatBN7-2/ --fastqs /scratch/atb/scrnaseq/rattus_norvegicus_202602/preprocessing/test/ --sample test bash: cellranger-atac: command not found
module add cellranger_atac cd $BASE ref="${BASE}/reference/rnorvegicus_atac/mRatBN7-2/" pre="${BASE}/preprocessing" ids="1_saline 2_oxycodone 3_saline 4_oxycodone" for id in $ids; do echo $id cmd="cellranger-atac count ${chemistry} --id ${id} --reference $ref --fastqs ${pre}/${id}/ --sample ${id}_ATAC" ls -la $pre/$id echo "I would run $cmd" done
1_saline total 4629096 drwxrwxr-x 2 abelew hpgl 6 Feb 2 13:08 . drwxrwxr-x 9 abelew hpgl 9 Feb 2 13:16 .. -rw-rw---- 1 abelew hpgl 741393405 Jan 18 04:50 1_saline_ATAC_S1_I1_001.fastq.gz -rw-rw---- 1 abelew hpgl 897325116 Jan 18 04:50 1_saline_ATAC_S1_I2_001.fastq.gz -rw-rw---- 1 abelew hpgl 1567315767 Jan 18 04:50 1_saline_ATAC_S1_R1_001.fastq.gz -rw-rw---- 1 abelew hpgl 1533341528 Jan 18 04:50 1_saline_ATAC_S1_R2_001.fastq.gz I would run cellranger-atac count --chemistry=ARC-v1 --id 1_saline --reference /scratch/atb/scrnaseq/rattus_norvegicus_202602/reference/rnorvegicus_atac/mRatBN7-2/ --fastqs /scratch/atb/scrnaseq/rattus_norvegicus_202602/preprocessing/1_saline/ --sample 1_saline_ATAC 2_oxycodone total 7101971 drwxrwxr-x 2 abelew hpgl 6 Feb 2 13:08 . drwxrwxr-x 9 abelew hpgl 9 Feb 2 13:16 .. -rw-rw---- 1 abelew hpgl 1129819923 Jan 18 04:53 2_oxycodone_ATAC_S2_I1_001.fastq.gz -rw-rw---- 1 abelew hpgl 1352714185 Jan 18 04:51 2_oxycodone_ATAC_S2_I2_001.fastq.gz -rw-rw---- 1 abelew hpgl 2400197776 Jan 18 04:52 2_oxycodone_ATAC_S2_R1_001.fastq.gz -rw-rw---- 1 abelew hpgl 2388428710 Jan 18 04:53 2_oxycodone_ATAC_S2_R2_001.fastq.gz I would run cellranger-atac count --chemistry=ARC-v1 --id 2_oxycodone --reference /scratch/atb/scrnaseq/rattus_norvegicus_202602/reference/rnorvegicus_atac/mRatBN7-2/ --fastqs /scratch/atb/scrnaseq/rattus_norvegicus_202602/preprocessing/2_oxycodone/ --sample 2_oxycodone_ATAC 3_saline total 7501859 drwxrwxr-x 2 abelew hpgl 6 Feb 2 13:08 . drwxrwxr-x 9 abelew hpgl 9 Feb 2 13:16 .. -rw-rw---- 1 abelew hpgl 1160021614 Jan 18 04:51 3_saline_ATAC_S3_I1_001.fastq.gz -rw-rw---- 1 abelew hpgl 1425810317 Jan 18 04:53 3_saline_ATAC_S3_I2_001.fastq.gz -rw-rw---- 1 abelew hpgl 2539689802 Jan 18 04:51 3_saline_ATAC_S3_R1_001.fastq.gz -rw-rw---- 1 abelew hpgl 2555076313 Jan 18 04:51 3_saline_ATAC_S3_R2_001.fastq.gz I would run cellranger-atac count --chemistry=ARC-v1 --id 3_saline --reference /scratch/atb/scrnaseq/rattus_norvegicus_202602/reference/rnorvegicus_atac/mRatBN7-2/ --fastqs /scratch/atb/scrnaseq/rattus_norvegicus_202602/preprocessing/3_saline/ --sample 3_saline_ATAC 4_oxycodone total 7015235 drwxrwxr-x 2 abelew hpgl 6 Feb 2 13:08 . drwxrwxr-x 9 abelew hpgl 9 Feb 2 13:16 .. -rw-rw---- 1 abelew hpgl 1097052924 Jan 18 04:51 4_oxycodone_ATAC_S4_I1_001.fastq.gz -rw-rw---- 1 abelew hpgl 1330497254 Jan 18 04:52 4_oxycodone_ATAC_S4_I2_001.fastq.gz -rw-rw---- 1 abelew hpgl 2374261791 Jan 18 04:50 4_oxycodone_ATAC_S4_R1_001.fastq.gz -rw-rw---- 1 abelew hpgl 2380564230 Jan 18 04:52 4_oxycodone_ATAC_S4_R2_001.fastq.gz I would run cellranger-atac count --chemistry=ARC-v1 --id 4_oxycodone --reference /scratch/atb/scrnaseq/rattus_norvegicus_202602/reference/rnorvegicus_atac/mRatBN7-2/ --fastqs /scratch/atb/scrnaseq/rattus_norvegicus_202602/preprocessing/4_oxycodone/ --sample 4_oxycodone_ATAC
pwd
/z1/scratch/atb/scrnaseq/rattus_norvegicus_202602
2. Manual processing as if these were bulk samples
I am going to treat these samples in an identical fashion to how I am treating some L.donovani samples; my current understanding of this process is in /scratch/atb/atacseq/ldonovani_2021
Here is a copy/paste of the steps outlined in that paper: (1)
- Trim via trimomatic/trimgalore/fastp/cutadapt/etc
- Align against mtDNA
- Align against target species without mtDNA aligned reads a. Run kraken using the these reads
- Collect fragment sizes via picard/gatk (plot via R ATACseqQC)
- Mark duplicates via gatk
- Adjust Tn5 cut sites via alignmentSieve from deepTools (–ATACshift)
- Call peaks with MACS2
- FriP score calculation via plotEnrichment from deepTools (BPM normalized)
- Take fragments <100bp as nucleosome free; use fragments mapped to genes for TSS enrichment plots (Done via deepTools: computeMatrix and plotProfile)
- Annotate peaks via HOMER: annotatePeaks.pl
- Find consensus peaks across samples via multiintersect from deepTools
Visual inspection via IGV
module add cyoa/202410 species=rnorvegicus_norway_grcr8
2.1. Organize reads
cd preprocessing start=$(pwd) samples=$(/bin/ls -d [0-9]*) for i in $samples; do pushd $i mkdir -p unprocessed mv *.fastq.gz unprocessed/ popd done
2.2. Trim reads
The ATAC protocols I have been reading are for bulk reads and as a result do not take into account the cell barcodes; I may therefore need to use UMItools or similar in order to properly perform this step. I did poke around in the cellranger output directories and it looks to me like they did not, so I assume a default trimomatic run should work fine.
for i in ${samples}; do pushd ${start}/${i} inputs=$(/bin/ls unprocessed/*_R[1-2]_* | tr '\n' ':') cyoa --method trim --input ${inputs} popd done
2.3. Remove mtDNA reads
There are a couple of other goals here: I would like a sense of the percentage of reads which are mitochondrial in origin. I also intend to query my bacterial kraken database and would to make sure the mitochondrial reads are not skewing the results.
for i in ${samples}; do pushd ${start}/${i} inputs=$(/bin/ls outputs/01trimomatic/*-trimmed.fastq.xz | tr '\n' ':') cyoa --method hisat --introns 0 --libtype mt --input ${inputs} --species rnorvegicus \ --gff_type gene --gff_tag gene popd done
2.4. Query the R.norvegicus genome
I am using the NCBI norway rat BN/NHsdMcwi assembly: GCA_036323735.1 and named it: rnorvegicus_norway_grcr8
echo "Making sure start is defined: ${start}." echo "Making sure samples are defined: ${samples}." for i in ${samples}; do pushd ${start}/${i} inputs=$(/bin/ls outputs/40hisat_rnorvegicus/unalcon*.fastq.xz | tr '\n' ':') cyoa --method hisat --introns 0 --input ${inputs} --species $species \ --gff_type gene --gff_tag ID popd done
2.5. Query unmapped reads via kraken
Note, I ran this initially while some jobs were still actively compressing the outputs from hisat2, as a result it is likely that one sample (#1 is wrong), I will rerun it later, but I am super curious to see what these unmapped reads are.
echo "Making sure start is defined: ${start}." echo "Making sure samples are defined: ${samples}." for i in ${samples}; do pushd ${start}/${i} inputs=$(/bin/ls outputs/40hisat_rnorvegicus_norway_grcr8/unalcon*.fastq.xz | tr '\n' ':') echo $inputs cyoa --method kraken --input ${inputs} --library bacteria popd done
2.6. Deduplication
gatk may be used to pretty easily check for identical reads.
echo "Making sure start is defined: ${start}." echo "Making sure samples are defined: ${samples}." for i in ${samples}; do pushd ${start}/${i} input="outputs/40hisat_rnorvegicus_norway_grcr8/rnorvegicus_norway_grcr8_genome.bam" echo $input ls -ld $input cyoa --method gatkdedup --input ${input} popd done
2.7. Peak Detection
echo "Checking variables:" echo "$start" echo "$samples" echo "$species" for i in ${samples}; do pushd ${start}/${i} input="outputs/50gatk_dedup/deduplicated.bam" echo $input cyoa --method atac --input ${input} --species ${species} popd done
3. Bibliography
4. PostScript
Download the original document text.