t-sne dimension reduction on Spotify mp3 samples

01-31 18:10


Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here . Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights .

Spotify Data

Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds . The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.

In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘ Bach ‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id ( spotify ) and the specific playlist_id ( 37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:

GET https://api.spotify.com/v1/users/{user_id}/playlists/{playlist_id}

You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:

  • Bach,
  • Heavy Metal,
  • Michael Jackson.

Transforming audio mp3’s to features

The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.

librosa = import("librosa")

#### python environment with librosa module installed
use_python(python = "/usr/bin/python3")

The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.

onemp3 = librosa$load("mp3songs/bach1.mp3")

length(onemp3[[1]])/onemp3[[2]]  # ~30 seconds sound

## 5 seconds plot
pp = 5*onemp3[[2]]
plot(onemp3[[1]][1:pp], type="l")

A line plot of the raw audio values will look like.

For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.

ff = librosa$feature
mel = librosa$logamplitude(
    sr = onemp3[[2]],

Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.

Calculating t-sne

A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:

tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3) 

The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here .


It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal �� The complete R code can be found here .

Cheers, Longhow

标签: MP3/MP4 Spotify
© 2014 TuiCode, Inc.