Skip to content

xnought/dnacount.go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

dnacount.go

A quick script in Go to count the base pair frequencies in genomes w/ parallelization across named FASTA labels/regions. With dnacount.go, I computed that the human genome has a GC bias of 41%. See results below (bottom of page).

Only dependency is https://github.com/schollz/progressbar to show a progress bar for very long sequences.

Build

go build

Execute

./dnacount data/repeat_GCF_000863945.3_ViralProj15505_genomic.fna

returns

Loaded 'data/GCF_000863945.3_ViralProj15505_genomic.fna' into RAM
 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| [0s:0s]            
Total length 8006
G 19.12%
C 17.42%
T 30.60%
A 32.86%
GC bias of 36.54%

Entire Human Genome

Download reference human genome https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

wget -O human_genome.zip https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_000001405.40/download?include_annotation_type=GENOME_FASTA && unzip human_genome.zip -d data/human_genome && rm -fr human_genome.zip

execute

time ./dnacount data/human_genome/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna

returns

Loaded 'data/human_genome/ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna' into RAM
 100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| [32s:0s]            
Total length 3339662079
G 20.57%
C 20.48%
T 29.52%
A 29.43%
GC bias of 41.05%
./dnacount   291.64s user 3.66s system 733% cpu 40.242 total

About

Count DNA Nucleotide frequencies and GC Bias from FASTA in Parallel in Go

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages