Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

Similar presentations


Presentation on theme: "Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)"— Presentation transcript:

1 Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

2 Overview Input to the software is a set of files Output is a hierarchical clustering shown as an unrooted binary tree This is a case of unsupervised learning (example follows)

3 Process Overview 1. File translations, if necessary, for example from MIDI to “player-piano” type format. 2. Calculation of Normalized Compression Distance, or NCD. 3. Representation as an unrooted binary tree.

4 What’s Unique? This clustering system is unique in that it can be described as feature-free There are no parameters to tune, and no domain-specific knowledge went into it. Using general-purpose data compressors gives us a parameterized family of features automatically for each domain

5 Featureless Clustering No parameters and no customized features makes it convenient to develop as well as use Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions So how does it work?

6 Midi Translation In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data. We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.

7 Gene sequence translation Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C Almost no translation at all

8 Image Translation Black and white images are converted to ASCII using spaces for black and # for white Newlines are used to separate rows

9 NCD Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group Normalized Compression Distance measures how different two files are from one another.

10 NCD NCD is based on an earlier idea called Normalized Information Distance. NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K. K represents a perfect data compressor, and is therefore uncomputable.

11 NCD Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects

12 NCD C(x) means “the compressed size of x” C(xy) means “compressed size of x and y” 0 <= NCD(x,y) <= 1 (roughly)

13 NCD NCD measures how similar or different two strings (or equivalently, files) are. NCD(x,x) = 0, because nothing is different from itself NCD(x,y) = 1 means that x and y are completely unrelated Often less extreme values in real cases

14 NCD Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix Next step is transforming this array of distances into something easier to grasp We use the Quartet Method to construct an unrooted binary tree from the NCD matrix

15 Quartet Method Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years The input is a matrix of distances (NCD) The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections. Tree is just one visualization of NCD matrix

16 Newer developments Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression We’ve included experiments from many other areas: biology, astronomy, images…

17 Current and future work This year, we’ve begun experimenting with automatic conversion from.mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm

18 New directions Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise Application of our techniques in real outstanding questions within the musical community

19 Contact and more info Related papers and information: http://www.cwi.nl/~cilibrar Software: http://complearn.sourceforge.net/ http://complearn.sourceforge.net/ Rudi.Cilibrasi@cwi.nl Paul.Vitanyi@cwi.nl Ronald.de.Wolf@cwi.nl


Download ppt "Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)"

Similar presentations


Ads by Google