Introduction


DNA Sonification refers to the use of audio to convey the information content of DNA sequence data. It provides an interesting adjunct to standard visualization of DNA sequence data. To achieve this the 4 bases (namely G, A, T and C) that make up the DNA sequence are processed from left to right in a linear fashion. To achieve this a dynamic web tool has been created in which DNA sequences are processed to produce audio output.

Recently these has been much interest in DNA sonification in light of recent advancements in DNA sequencing technology and the benefits thereof. Gene coding regions of the genome are essentially highly ordered sequences of DNA where by the genetic code relates the coding sequence of DNA to an amino acid residues of a protein. However, much of the information content of DNA, outside of gene coding regions has a lower sequence complexity according to our current knowledge base.

Two vastly different approaches have previously been taken to sonify DNA to achieve outcomes pertinent to either the art or science disciplines. One approach essentially treats DNA as a random sequence for the purpose of generative music synthesis whereas the other assumes non-random sequence and therefore takes into account basic chemical or biological properties during sonifcation. We have focused on the latter approach in this work.

From a scientific perspective the basic challenge of DNA sonification is to use audio cues to distinguish between a DNA sequence that is a highly ordered gene coding regions from that of low complexity. Towards achieving this, various algorithms have been established to map the nucleotide bases (motifs) to musical notes. In the most rudimentary algorithm, each of the 4 individual nucleotide base (G, A, T, or C) is considered to be a motif and is mapped to one of four musical notes however given the complexity of DNA sequences this mapping is ineffectual and included only for the sake of completeness.

The consideration of pairs of nucleotides as motifs provides for 16 notes and again does not give justice to the complexity of most DNA sequences. The most useful approach is to mirror the genetic code and treat each of three nucleotide bases as a motif to map to a note. In theory a total of 64 codons exist however in the realm of biology typically these give rise to only 20 of amino acid residues of proteins. This approach of note assignment could clearly be extended to map larger groupings of nucleotides to an ever increasing range of notes, for instance 4 or 5 nucleotide motifs could theoretically be mapped to 256 or 1024 notes, respectively. Whilst this has no basis in biology it is an interesting proposition for generative music aficionados. Given a typical hearing range and the number of discrete notes on musical instruments, this provides for more notes than can be sounded. One solution could be to map these motifs to micro-tonal scales using intervals smaller than semi-tones, however this approach was not pursued at this stage.

Motif
Number of motifs
Motif identifier (Motif ID)
1 bp

4
(4 x 1)

G= #1
A= #2
T= #3
C= #4
2 bp
16
(4 x 4)
GG= #1
GA= #2
GT= #3
GC= #4
AG= #5
AA= #6
AT = #7
etc...
3 bp
(Codon)
64
(4 x 16)
GGG= #1
GGA= #2
GGT= #3
GGC= #4
GAG= #5
GTG= #6
GCT = #7
etc...

Six DNA sonification algorithms have been scripted to associate a DNA motif to a specific motif identifier. Each of these are further processed to produce a distinct mix of instrument and note identifiers to be assigned to musical notes. The motif identifiers are numbered from 1-4, 1-16 or 1-64 depending on the algorithm. These motif identifiers are further processed using additional parameters to establish a musical key, notes intervals, note length, note timing and tempo. These are then assigned to an octave suitable for the selected instrument. All audio is generated dynamically and the audio output is streamed in real time.

Irrespective of the algorithm used, in each case Motif ID 1 is assigned to the root note of a musical key and the octave is set by the lower pitch range of the assigned musical instrument. These assignments are made using MIDI note numbers. For each instrument there are 128 MIDI note numbers representing a 10 octave note range. The interval between notes is governed by the scale used to sonify the motifs. For instance the repeating semitone intervals of the natural minor scale (2, 1, 2, 2, 1, 2, 2) or the blues scale (3, 2, 1, 1, 3, 2) are used to assign sequential motif numbers to musical notes. Clearly the choice of key and scale determine the actual notes used in DNA sequence sonification.

Whilst each of the algorithms produces an audio output with interesting characteristics, the most useful algorithm for DNA sequence analyses using codons (motifs of three nucleotides) mapped to 21 musical notes. In this approach tri-nucleotides are processed in an analogous way to the biological rules of the genetic code (in which a codon consists of three consecutive bases coding for one of 20 amino acid building blocks of a protein). Each of 64 possible codons are mapped to one of 20 musical notes rather than amino acids, as is the STOP codon. Each of the three possible open reading frames is mapped to a separate instruments. In the absence of further DNA sequence annotation to indicate the actual reading frame of the sequence, each open reading frame (instrument) is voiced sequentially with equal bias.

The information content of the DNA sequence was further sonified using two unique approaches. Firstly, Start or Stop codons were assigned to a loud or quiet volumes, respectively. This volume manipulation not only effects the specific codon but the following notes for a period of time. This effectively silences a reading frame if a Stop codon occurs or makes the reading frame containing a Start codon louder for a period of time. Secondly, unique sequences of DNA are used to trigger percussion instruments upon their detection in the sequence, this is applied to transcription factor binding motifs, promoter elements and to Start and (silences) Stop codons. These methods are effective at distinguishing cDNA sequences from random DNA sequences or AT rich DNA from GC rich DNA.

The human genome consists of approx. 600 billion base pairs

Consider approx. 1000 base pairs of DNA sequence:

actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgtggcaaaggtgcccttgaggttgtccaggtgagccaggccatcactaaaggcaccgagcactttcttgccatgagccttcaccttagggttgcccataacagcatcaggagtggacagatccccaaaggactcaaagaacctctgggtccaagggtagaccaccagcagcctaagggtgggaaaatagaccaataggcagagagagtcagtgcctatcagaaacccaagagtcttctctgtctccacatgcccagtttctattggtctccttaaacctgtcttgtaaccttgataccaacctgcccagggcctcaccaccaacttcatccacgttcaccttgccccacagggcagtaacggcagacttctcctcaggagtcagatgcaccatggtgtctgtttgaggttgctagtgaacacagttgtgtcagaagcaaatgtaagcaatagatggctctgccctgacttttatgcccagccctggctcctgccctccctgctcctgggagtagattggccaaccctagggtgtggctccacagggtgaggtctaagtgatgacagccgtacctgtccttggctcttctggcactggcttaggagttggacttcaaaccctcagccctccctctaagatatatctcttggccccataccatcagtacaaattgctactaaaaacatcctcctttgcaagtgtatttacgtaatatttggaatcacagcttggtaagcatattgaagatcgttttcccaattttcttattacacaaataagaagttgatgcactaaaagtggaagagttttgtctaccataattcagctttgggatatgtagatggatctcttcctgcgtctccagaatatgcaaaatacttacaggacagaatggatgaaaa

This above sequence contains a segment of the promoter region and coding region of the beta globin gene.

Consider the beginning of this sequence:

actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgt

In a biological context, the information content of this can be read in one of three reading frames according to the rules of the genetic code, whereby three nucleotide bases code for a specific amino acid residue in a protein. So this single sequence can be written and processed in three ways.

Frame 1: act-cac-cct-gaa-gtt-ctc-agg...

Frame 2: a-ctc-acc-ctg-aag-ttc-tca-gga...

Frame 3: ac-tca-ccc-tga-agt-tct-cag-gat...

Sonifiying the first frame would read:
act-cac-cct-gaa-gtt-ctc...

However Sonifiying all frames would read act:
act-ctc-tca-cac-acc-ccc...

Only one of these reading frames is processed by the cell to make a protein, this is determined by recognition of landmarks or motifs in the sequence such as an inframe "atg" start codon or other codons, such as "tga" that determine the end of a gene. In addition other motifs such as 5'-tataaa-3' determine protein binding sites approximately 25 base pairs upstream of transcription start.

A biological relationship (referred to as the genetic code) exists to convert each of the 64 codons to a specific amino acid residue (through the biological process of transcription and translation). Also included is a arbitrary association to a number to be used to reference a musical note in the MIDI file.

Table to convert number to midi note (C scale)

codon to numberthree octavesmidi note numbers
1A57
2B59
3C60
4D62
5E64
6F65
7G67
8A69
9B71
10C72
11D74
12E76
13F77
14G79
15A81
16B83
17C84
18D86
19E88
20F89
21G91

Midi note numbers

The following table lists the numbers corresponding to notes for use in note on and note off commands in the MIDI file.

CC#DD#EFF#GG#AA#B
001234567891011
1121314151617181920212223
2242526272829303132333435
3363738394041424344454647
4484950515253545556575859
5606162636465666768697071
6727374757677787980818283
7848586878889909192939495
896979899100101102103104105106107
9108109110111112113114115116117118119
10120121122123124125126127

Codons usage table

Codon number Codon Amino acid Note number
1 GCA Ala 1
2 GCC Ala 1
3 GCG Ala 1
4 GCT Ala 1
5 AGA Arg 2
6 AGG Arg 2
7 CGA Arg 2
8 CGC Arg 2
9 CGG Arg 2
10 CGT Arg 2
11 AAC Asn 3
12 AAT Asn 3
13 GAC Asp 4
14 GAT Asp 4
15 TGC Cys 5
16 TGT Cys 5
17 CAA Gln 6
18 CAG Gln 6
19 GAA Glu 7
20 GAG Glu 7
21 GGA Gly 8
22 GGC Gly 8
23 GGG Gly 8
24 GGT Gly 8
25 CAC His 9
26 CAT His 9
27 ATA Ile 10
28 ATC Ile 10
29 ATT Ile 10
30 CTA Leu 11
31 CTC Leu 11
32 CTG Leu 11
33 CTT Leu 11
34 TTA Leu 11
35 TTG Leu 11
36 AAA Lys 12
37 AAG Lys 12
38 ATG Mt* 13
39 TTC Phe 14
40 TTT Phe 14
41 CCA Pro 15
42 CCC Pro 15
43 CCG Pro 15
44 CCT Pro 15
45 AGC Ser 16
46 AGT Ser 16
47 TCA Ser 16
48 TCC Ser 16
49 TCG Ser 16
50 TCT Ser 16
51 TAA ST* 17
52 TAG ST* 17
53 TGA ST* 17
54 ACA Thr 18
55 ACC Thr 18
56 ACG Thr 18
57 ACT Thr 18
58 TGG Trp 19
59 TAC Tyr 20
60 TAT Tyr 20
61 GTA Val 21
62 GTC Val 21
63 GTG Val 21
64 GTT Val 21