Using codes in place of DNA Sample in Databases to reduce Storage
Abstract
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are
the biomolecules that are present in all cells of human beings. Due to the self-replicating property of
DNA, it is a key constituent of genetic material that exists in all breathing creatures. This biomolecule
(DNA) comprehends the genetic material obligatory for the operational and expansion of all personified
lives. To save DNA data of a single person we require 10CD-Rom's. In this paper, A lossless three-phase
compression algorithm is presented for DNA sequences. In the first phase the dataset is segmented
having tetra groups and then the resultant genetic sequences are compressed in the form of unique
numbers (e.g Array Index) and in the second phase binary code is generated on the bases of array index
numbers and in the last phase the modified version of Run Length Encoding (RLE) is applied on the
dataset.
The newly proposed technique has been implemented and its performance is also measured on samples.
It has achieved the best average compression ratio. After Storing different DNA Samples.