Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here . Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights .
Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds . The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.
In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘ Bach ‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id ( spotify ) and the specific playlist_id ( 37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:
You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:
The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.
library(reticulate) librosa = import("librosa") #### python environment with librosa module installed use_python(python = "/usr/bin/python3")
The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.
onemp3 = librosa$load("mp3songs/bach1.mp3") length(onemp3[]) length(onemp3[])/onemp3[] # ~30 seconds sound ## 5 seconds plot pp = 5*onemp3[] plot(onemp3[][1:pp], type="l")
A line plot of the raw audio values will look like.
For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.
ff = librosa$feature mel = librosa$logamplitude( ff$melspectrogram( onemp3[], sr = onemp3[], n_mels=96 ), ref_power=1.0 ) image(mel)
Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.
A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:
tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3)
The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here .
It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal The complete R code can be found here .