DNA Sonification refers to the use of audio to convey the information content of DNA sequence data. It provides an interesting adjunct to standard visualization of DNA sequence data. To achieve this the 4 bases (namely G, A, T and C) that make up the DNA sequence are processed from left to right in a linear fashion. To achieve this a dynamic web tool has been created in which DNA sequences are processed to produce audio output.
Recently these has been much interest in DNA sonification in light of recent advancements in DNA sequencing technology and the benefits thereof. Gene coding regions of the genome are essentially highly ordered sequences of DNA where by the genetic code relates the coding sequence of DNA to an amino acid residues of a protein. However, much of the information content of DNA, outside of gene coding regions has a lower sequence complexity according to our current knowledge base.
Two vastly different approaches have previously been taken to sonify DNA to achieve outcomes pertinent to either the art or science disciplines. One approach essentially treats DNA as a random sequence for the purpose of generative music synthesis whereas the other assumes non-random sequence and therefore takes into account basic chemical or biological properties during sonifcation. We have focused on the latter approach in this work.
From a scientific perspective the basic challenge of DNA sonification is to use audio cues to distinguish between a DNA sequence that is a highly ordered gene coding regions from that of low complexity. Towards achieving this, various algorithms have been established to map the nucleotide bases (motifs) to musical notes. In the most rudimentary algorithm, each of the 4 individual nucleotide base (G, A, T, or C) is considered to be a motif and is mapped to one of four musical notes however given the complexity of DNA sequences this mapping is ineffectual and included only for the sake of completeness.
The consideration of pairs of nucleotides as motifs provides for 16 notes and again does not give justice to the complexity of most DNA sequences. The most useful approach is to mirror the genetic code and treat each of three nucleotide bases as a motif to map to a note. In theory a total of 64 codons exist however in the realm of biology typically these give rise to only 20 of amino acid residues of proteins. This approach of note assignment could clearly be extended to map larger groupings of nucleotides to an ever increasing range of notes, for instance 4 or 5 nucleotide motifs could theoretically be mapped to 256 or 1024 notes, respectively. Whilst this has no basis in biology it is an interesting proposition for generative music aficionados. Given a typical hearing range and the number of discrete notes on musical instruments, this provides for more notes than can be sounded. One solution could be to map these motifs to micro-tonal scales using intervals smaller than semi-tones, however this approach was not pursued at this stage.
Number of motifs
Motif identifier (Motif ID)
(4 x 4)
AT = #7
(4 x 16)
GCT = #7
Six DNA sonification algorithms have been scripted to associate a DNA motif to a specific motif identifier. Each of these are further processed to produce a distinct mix of instrument and note identifiers to be assigned to musical notes. The motif identifiers are numbered from 1-4, 1-16 or 1-64 depending on the algorithm. These motif identifiers are further processed using additional parameters to establish a musical key, notes intervals, note length, note timing and tempo. These are then assigned to an octave suitable for the selected instrument. All audio is generated dynamically and the audio output is streamed in real time.
Irrespective of the algorithm used, in each case Motif ID 1 is assigned to the root note of a musical key and the octave is set by the lower pitch range of the assigned musical instrument. These assignments are made using MIDI note numbers. For each instrument there are 128 MIDI note numbers representing a 10 octave note range. The interval between notes is governed by the scale used to sonify the motifs. For instance the repeating semitone intervals of the natural minor scale (2, 1, 2, 2, 1, 2, 2) or the blues scale (3, 2, 1, 1, 3, 2) are used to assign sequential motif numbers to musical notes. Clearly the choice of key and scale determine the actual notes used in DNA sequence sonification.
Whilst each of the algorithms produces an audio output with interesting characteristics, the most useful algorithm for DNA sequence analyses using codons (motifs of three nucleotides) mapped to 21 musical notes. In this approach tri-nucleotides are processed in an analogous way to the biological rules of the genetic code (in which a codon consists of three consecutive bases coding for one of 20 amino acid building blocks of a protein). Each of 64 possible codons are mapped to one of 20 musical notes rather than amino acids, as is the STOP codon. Each of the three possible open reading frames is mapped to a separate instruments. In the absence of further DNA sequence annotation to indicate the actual reading frame of the sequence, each open reading frame (instrument) is voiced sequentially with equal bias.
The information content of the DNA sequence was further sonified using two unique approaches. Firstly, Start or Stop codons were assigned to a loud or quiet volumes, respectively. This volume manipulation not only effects the specific codon but the following notes for a period of time. This effectively silences a reading frame if a Stop codon occurs or makes the reading frame containing a Start codon louder for a period of time. Secondly, unique sequences of DNA are used to trigger percussion instruments upon their detection in the sequence, this is applied to transcription factor binding motifs, promoter elements and to Start and (silences) Stop codons. These methods are effective at distinguishing cDNA sequences from random DNA sequences or AT rich DNA from GC rich DNA.
The human genome consists of approx. 600 billion base pairs
|Consider approx. 1000 base pairs of DNA sequence:|
This above sequence contains a segment of the promoter region and coding region of the beta globin gene.
|Consider the beginning of this sequence:|
In a biological context, the information content of this can be read in one of three reading frames according to the rules of the genetic code, whereby three nucleotide bases code for a specific amino acid residue in a protein. So this single sequence can be written and processed in three ways.
Frame 1: act-cac-cct-gaa-gtt-ctc-agg...
Frame 2: a-ctc-acc-ctg-aag-ttc-tca-gga...
Frame 3: ac-tca-ccc-tga-agt-tct-cag-gat...
Sonifiying the first frame would read:
However Sonifiying all frames would read act:
Only one of these reading frames is processed by the cell to make a protein, this is determined by recognition of landmarks or motifs in the sequence such as an inframe "atg" start codon or other codons, such as "tga" that determine the end of a gene. In addition other motifs such as 5'-tataaa-3' determine protein binding sites approximately 25 base pairs upstream of transcription start.
A biological relationship (referred to as the genetic code) exists to convert each of the 64 codons to a specific amino acid residue (through the biological process of transcription and translation). Also included is a arbitrary association to a number to be used to reference a musical note in the MIDI file.
Table to convert number to midi note (C scale)
|codon to number||three octaves||midi note numbers|
Midi note numbers
The following table lists the numbers corresponding to notes for use in note on and note off commands in the MIDI file.
Codons usage table
|Codon number||Codon||Amino acid||Note number|