Sonification


This site is based on the BMC Bioinformatics journal article "An auditory display tool for DNA sequence analysis" BMC Bioinformatics 2017 18:221

The original pages listed in the BMC article are provided here.


This website has been updated:

  • It now supports a responsive design to display properly on all size screens.
  • The audio is now generated in the browser without the need for a plugin.
  • The pages now work across all modern browsers.
  • The tool now includes an animated view which is insync with the audio.
  • The site now includes a welcome screen.
  • Embeded videos have been added to show how the tools work.
  • Minor bugs have been caught and squashed.
  • Social media links have been added on the homePage

Introduction


DNA Sonification refers to the use of audio to convey the information content of DNA sequence data. It provides an interesting adjunct to standard visualization of DNA sequence data. To achieve this the 4 bases (namely G, A, T and C) that make up the DNA sequence are processed from left to right in a linear fashion. To achieve this a dynamic web tool has been created in which DNA sequences are processed to produce audio output.

Recently these has been much interest in DNA sonification in light of recent advancements in DNA sequencing technology and the benefits thereof. Gene coding regions of the genome are essentially highly ordered sequences of DNA where by the genetic code relates the coding sequence of DNA to an amino acid residues of a protein. However, much of the information content of DNA, outside of gene coding regions has a lower sequence complexity according to our current knowledge base.

Two vastly different approaches have previously been taken to sonify DNA to achieve outcomes pertinent to either the art or science disciplines. One approach essentially treats DNA as a random sequence for the purpose of generative music synthesis whereas the other assumes non-random sequence and therefore takes into account basic chemical or biological properties during sonifcation. We have focused on the latter approach in this work.

From a scientific perspective the basic challenge of DNA sonification is to use audio cues to distinguish between a DNA sequence that is a highly ordered gene coding regions from that of low complexity. Towards achieving this, various algorithms have been established to map the nucleotide bases (motifs) to musical notes. In the most rudimentary algorithm, each of the 4 individual nucleotide base (G, A, T, or C) is considered to be a motif and is mapped to one of four musical notes however given the complexity of DNA sequences this mapping is ineffectual and included only for the sake of completeness.

The consideration of pairs of nucleotides as motifs provides for 16 notes and again does not give justice to the complexity of most DNA sequences. The most useful approach is to mirror the genetic code and treat each of three nucleotide bases as a motif to map to a note. In theory a total of 64 codons exist however in the realm of biology typically these give rise to only 20 of amino acid residues of proteins. This approach of note assignment could clearly be extended to map larger groupings of nucleotides to an ever increasing range of notes, for instance 4 or 5 nucleotide motifs could theoretically be mapped to 256 or 1024 notes, respectively. Whilst this has no basis in biology it is an interesting proposition for generative music aficionados. Given a typical hearing range and the number of discrete notes on musical instruments, this provides for more notes than can be sounded. One solution could be to map these motifs to micro-tonal scales using intervals smaller than semi-tones, however this approach was not pursued at this stage.

Motif
Number of motifs
Motif identifier (Motif ID)
1 bp

4
(4 x 1)

G= #1
A= #2
T= #3
C= #4
2 bp
16
(4 x 4)
GG= #1
GA= #2
GT= #3
GC= #4
AG= #5
AA= #6
AT = #7
etc...
3 bp
(Codon)
64
(4 x 16)
GGG= #1
GGA= #2
GGT= #3
GGC= #4
GAG= #5
GTG= #6
GCT = #7
etc...

Six DNA sonification algorithms have been scripted to associate a DNA motif to a specific motif identifier. Each of these are further processed to produce a distinct mix of instrument and note identifiers to be assigned to musical notes. The motif identifiers are numbered from 1-4, 1-16 or 1-64 depending on the algorithm. These motif identifiers are further processed using additional parameters to establish a musical key, notes intervals, note length, note timing and tempo. These are then assigned to an octave suitable for the selected instrument. All audio is generated dynamically and the audio output is streamed in real time.

Irrespective of the algorithm used, in each case Motif ID 1 is assigned to the root note of a musical key and the octave is set by the lower pitch range of the assigned musical instrument. These assignments are made using MIDI note numbers. For each instrument there are 128 MIDI note numbers representing a 10 octave note range. The interval between notes is governed by the scale used to sonify the motifs. For instance the repeating semitone intervals of the natural minor scale (2, 1, 2, 2, 1, 2, 2) or the blues scale (3, 2, 1, 1, 3, 2) are used to assign sequential motif numbers to musical notes. Clearly the choice of key and scale determine the actual notes used in DNA sequence sonification.

Whilst each of the algorithms produces an audio output with interesting characteristics, the most useful algorithm for DNA sequence analyses using codons (motifs of three nucleotides) mapped to 21 musical notes. In this approach tri-nucleotides are processed in an analogous way to the biological rules of the genetic code (in which a codon consists of three consecutive bases coding for one of 20 amino acid building blocks of a protein). Each of 64 possible codons are mapped to one of 20 musical notes rather than amino acids, as is the STOP codon. Each of the three possible open reading frames is mapped to a separate instruments. In the absence of further DNA sequence annotation to indicate the actual reading frame of the sequence, each open reading frame (instrument) is voiced sequentially with equal bias.

The information content of the DNA sequence was further sonified using two unique approaches. Firstly, Start or Stop codons were assigned to a loud or quiet volumes, respectively. This volume manipulation not only effects the specific codon but the following notes for a period of time. This effectively silences a reading frame if a Stop codon occurs or makes the reading frame containing a Start codon louder for a period of time. Secondly, unique sequences of DNA are used to trigger percussion instruments upon their detection in the sequence, this is applied to transcription factor binding motifs, promoter elements and to Start and (silences) Stop codons. These methods are effective at distinguishing cDNA sequences from random DNA sequences or AT rich DNA from GC rich DNA.

The human genome consists of approx. 600 billion base pairs

Consider approx. 1000 base pairs of DNA sequence:

actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgtggcaaaggtgcccttgaggttgtccaggtgagccaggccatcactaaaggcaccgagcactttcttgccatgagccttcaccttagggttgcccataacagcatcaggagtggacagatccccaaaggactcaaagaacctctgggtccaagggtagaccaccagcagcctaagggtgggaaaatagaccaataggcagagagagtcagtgcctatcagaaacccaagagtcttctctgtctccacatgcccagtttctattggtctccttaaacctgtcttgtaaccttgataccaacctgcccagggcctcaccaccaacttcatccacgttcaccttgccccacagggcagtaacggcagacttctcctcaggagtcagatgcaccatggtgtctgtttgaggttgctagtgaacacagttgtgtcagaagcaaatgtaagcaatagatggctctgccctgacttttatgcccagccctggctcctgccctccctgctcctgggagtagattggccaaccctagggtgtggctccacagggtgaggtctaagtgatgacagccgtacctgtccttggctcttctggcactggcttaggagttggacttcaaaccctcagccctccctctaagatatatctcttggccccataccatcagtacaaattgctactaaaaacatcctcctttgcaagtgtatttacgtaatatttggaatcacagcttggtaagcatattgaagatcgttttcccaattttcttattacacaaataagaagttgatgcactaaaagtggaagagttttgtctaccataattcagctttgggatatgtagatggatctcttcctgcgtctccagaatatgcaaaatacttacaggacagaatggatgaaaa

This above sequence contains a segment of the promoter region and coding region of the beta globin gene.

Consider the beginning of this sequence:

actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgt

In a biological context, the information content of this can be read in one of three reading frames according to the rules of the genetic code, whereby three nucleotide bases code for a specific amino acid residue in a protein. So this single sequence can be written and processed in three ways.

Frame 1: act-cac-cct-gaa-gtt-ctc-agg...

Frame 2: a-ctc-acc-ctg-aag-ttc-tca-gga...

Frame 3: ac-tca-ccc-tga-agt-tct-cag-gat...

Sonifiying the first frame would read:
act-cac-cct-gaa-gtt-ctc...

However Sonifiying all frames would read act:
act-ctc-tca-cac-acc-ccc...

Only one of these reading frames is processed by the cell to make a protein, this is determined by recognition of landmarks or motifs in the sequence such as an inframe "atg" start codon or other codons, such as "tga" that determine the end of a gene. In addition other motifs such as 5'-tataaa-3' determine protein binding sites approximately 25 base pairs upstream of transcription start.

A biological relationship (referred to as the genetic code) exists to convert each of the 64 codons to a specific amino acid residue (through the biological process of transcription and translation). Also included is a arbitrary association to a number to be used to reference a musical note in the MIDI file.

Table to convert number to midi note (C scale)

codon to numberthree octavesmidi note numbers
1A57
2B59
3C60
4D62
5E64
6F65
7G67
8A69
9B71
10C72
11D74
12E76
13F77
14G79
15A81
16B83
17C84
18D86
19E88
20F89
21G91

Midi note numbers

The following table lists the numbers corresponding to notes for use in note on and note off commands in the MIDI file.

CC#DD#EFF#GG#AA#B
001234567891011
1121314151617181920212223
2242526272829303132333435
3363738394041424344454647
4484950515253545556575859
5606162636465666768697071
6727374757677787980818283
7848586878889909192939495
896979899100101102103104105106107
9108109110111112113114115116117118119
10120121122123124125126127

Codons usage table

Codon number Codon Amino acid Note number