Poudro's blog
CTO / Data Scientist / Problem Solver - Consultant

Building a personalized radio using curated data

Building a personalized radio using curated data - poudro.com développeur web freelance

In this post I will describe the inner workings of skeep.io. It is a personal project I have been working on recently in an attempt to apply some ideas I had towards music recommendations and personalized radio.

When it comes to building such a system, there are a few things one needs, data-wise : tracks and user behavior, as well as a model to map the latter onto the former to construct a personalized sequence of tracks that “makes sense” to the user. Structuring such a sequence of tracks without prior knowledge of how each track relates to its predecessor and/or follower is a very hard task, and, in my opinion, not necessarily worth the “risk” in many circumstances. That’s where the curated data comes in. On all major streaming platforms, users can create and maintain custom playlists. Some of these are carefully tailored, and it isn’t that difficult to build a set of rules to select these “quality” playlists. Once we have the first half of the data, we need to build a profile for our users. For musical recommendation, this is usually done by taking into account a user’s listening history as well as their library and applying different transformations to build a snapshot of the user. Finally, we need to mix a user’s profile with the curated track data and build a model to generate the sequence.

(NB : For now, skeep.io only uses Deezer’s API as it is the only API, I know of, that grants the amount of access I needed to build what I had in mind. I hope to be able to open it to more platforms as they develop their APIs. The details of this API can be found here.
In turn, this means you need an active Deezer account for skeep.io to work for you, otherwise there won’t be anything to recommend when you go there).

1/ Fetching and filtering the curated data

What makes a playlist interesting in the current context? Well, we’re trying to build something that is intended to be used as a radio. So, what are the rules that streamed radios need to follow? In the US, at least, there are a set of rules that the DMCA (Digital Millennium Copyright Act) established in 1998 to regulate how music is streamed online. There is much discussion about these rules and it is not my intent to get into this here. However, within the whole text are outlined three elementary restrictions that are, in fact, a useful inspiration for measuring the relevancy of a playlist towards our goal (source):

  • No more than 3 songs from one album;
    no more than 2 played consecutively;
  • No more than 4 songs from a set/compilation;
    no more than 3 played consecutively;
  • No more than 4 recorded songs by the same artist
    (live studio appearances are okay)

Simply put, we will filter, amongst all the playlists we can fetch through the API, the ones that follow a set of rules very similar to those defined by the DMCA. After a little tinkering, these are the rules we ended up using to select the playlists :

  • Minimum duration : 2 hours;
  • Maximum number of tracks from same album : 3;
  • Maximum number of tracks from same album in a row : 2;
  • Maximum number of tracks from same artist : 4;
  • Maximum number of tracks from same artist in a row : 3;
  • Minimum number of differents artists : 7.

Once fetched and filtered, we extract what is needed from the playlists and store this information in an Elasticsearch index locally. I will discuss the details of the index in the third paragraph. By fetching new playlists regularly and adding them to the system we have access to new artists and new trends through the activity of the curators of these playlists.

2/ Building a user’s profile

As mentionned previously, Deezer’s API gives access to one crucial element here : the user’s listening history. It also makes the user’s library of favorite tracks, albums and artists available. Naturally, people’s tastes evolve over time, one gets bored with a specific album, new artists hit the charts… The profile needs to take this into account. This is done by reducing the weight of older events compared to newer events.

Another important part of building the profile is choosing a good projection space. This is needed because we are comparing things that are not necessarily comparable directly. Indeed, on one side we have the playlists, which are lists of tracks, on the other, we have a mix of tracks, albums and artists that represent the raw user’s history. Also, it is important that the final model has a certain “flexibility”, i.e., we don’t want the resulting profile to be too precise and prevent any discovery for example. Finally, there is a natural structure in the data we are dealing with here which is that every track and album has an artist. Furthermore, the artist is a common grouping element in music in general, thus, using the artist space as the pivotal vector space onto which projections are made makes sense both structurally and computationally as it entails no overhead in its generation.

We now have our raw data and destination vector space. The final elements we need to sort out are the various weights we will be giving the different events and the rate of descent. The act of listening to a track or an album, like anything else, is something we generally forget over time. When listening to a radio, we often encounter tracks that we may have forgotten but are happy to remember. To try to model this in skeep.io, the idea was to use a rate of descent that does better, on average, than what is called the “forgetting rate”.

The “forgetting rate” in humans has been studied since the late 19th century. Although, no real consensus seams to emerge, the forgetting curve seems to well modeled by a function called Wickelgren’s Power Law (see this paper for more): m = \lambda (1 + \beta t)^{- \Psi}, where m is memory strength, and t is time (i.e., the retention interval). The equation has three parameters: \lambda is the state of long-term memory at t = 0 (i.e., the degree of learning), \Psi is the rate of forgetting, and \beta is a scaling parameter. Although this seems very simplistic compared to the complexities of something like memory, it captures the essence of what we need. We won’t go into the details of the mathematical formulation we have used, but a gross representation can be seen in the following figure.

The differences have been exaggerated, but the idea is there. Overall, the hope is to be able to remember what a user liked in the past better than them so as to maybe “surprise” them with something they haven’t listened to in a while.

To wrap up the profile generation, we apply our forgetting curve to all our events based on when they occurred: the date a track was streamed and the date a track/album/artist was added to favorites. We previously applied different weights to each event, the values of these weights are the result of various experimental results of parameter optimization. Weights for tracks and albums are mapped onto the corresponding artist and all is summed and normalized. The final output is a vector in the artist space were each coordinate w_a is the weight of artist a in the user’s profile.

3/ Combining the data

The next step towards building our radio system is matching the correct playlists with each user. This is were the Elasticsearch index previously mentioned comes in handy. Indeed, as we are projecting on the artist space, it becomes clear that we index the artists of the tracks present in each playlist. The index is then queried with the user’s profile to fetch playlists that match to the user.

No need to come up with a brilliant plan to conquer the world at this point, the list of artists in each playlist is simply indexed as a string of ids separated by spaces. The built-in whitespace analyzer is used to tokenize the string into the ids internally and each occurrence of the same artist changes the term frequency in the document. This naturally assigns a weight to each artist present in the playlist depending on how frequent it is.

Querying is also quite straight forward using Elasticsearch’s query_string. It allows us to set a boosting parameter for each individual term in the query very easily. That means we can directly query the index using the user’s profile vector where the weights w_a are the boosting parameters. It is important to notice at this point that there is no direct link between the user using skeep.io as a radio and the people who created and/or curated the playlists, the relationship is solely within the query matching or not what is in the index.

How can we be sure we are getting the correct results for our needs? Internally, Lucene uses the TFIDFSimilarity class to compute scoring information between query and documents. It is, basically, using a cosine similarity function to score the documents in order. Perfect!

At the end of this step we now have a number of lists of curated tracks that all match, with a certain score, the profile of the user using the radio.

4/ Building the model

From now on, all we need are these lists of tracks. The whole point of this approach is to take advantage of the way the ordering of the tracks was curated. That means we need to select a probabilistic model that takes this ordering into account.

From a past life, I have had some experience with hidden Markov models (aka HMM), and, it is their function to model transitional probabilities between states of latent variables. The observed sequence is itself obtained through a probability over the latent variable at the same step. A schematic representation of a temporal HMM is shown below (source).

At time t, the state of the latent variable sequence is x(t) and depends solely on the state x(t-1) at time t-1 and the transition probability P \left( x(t) | x(t-1) \right). These latent variables are not directly “observed” as is, it is the sequence y(t) which is seen. Each y(t) is selected with an emission probability P \left( y(t) | x(t) \right).

To make a long story short, we build a transition matrix that holds the transition probabilities between all the artists in all the ordered lists we obtained at the previous step. Then, for each artist, we also construct an emission vector that holds the emission probabilities from artist to track. That way we model how frequently an artist A_i follows any other artist A_j within our lists of tracks and how frequently the tracks seen for each artist appear. We also need to compute the probability of each artist independently to bootstrap the generated sequence or exit closed loops.

5/ Generating a sequence

Here we are at the final step. We have a matrix of transition probabilities between artists T, a vector of start probabilities for all artists S and a vector of emission probabilities for each artist E_A. Building a sequence of tracks from this model is then simply a matter of selecting elements from a list with probabilities.

This is quite classically done using the cumulative distribution function associated to the discrete probability distribution function that we are using. This is called Pseudo-random number sampling on finite discrete distributions for which we use a binary search algorithm.

Our first step goes as follows. We first build the cdf C(S) from S, draw an X at random between 0 and 1, then select the index of i of C(S) such that C(S)_{i-1} \leq X < C(S)_i. The corresponding index i is the index of the artist A_i we will start with in S. We use the same procedure to select the first track from E_{A_i}. We repeat the same procedure over the column T_i of the transitions from A_i, etc...

From there it is just a matter of repeating the same process until we get the number of tracks we want, and there it is, we have a personalized radio! Generating a sequence with a thousand tracks takes less than a quarter of a second, on average. It is done server side for now, but it is easy to imagine packaging the model and generating a sequence client side.

6/ What’s next

The stats on the version described in this post are quite promising, i.e. 76% of tracks streamed are listened to in full and 12% of tracks streamed get added to favorites by the user. The original version of skeep.io, which didn’t go beyond step 3 and would simply push a playlist matching the listener, had 54% of streamed tracks streamed in full and 7% of streamed tracks added to favorites.

Although we have a pretty decent system already, it doesn’t take into account the user’s feedback at all (except when tracks are favorited, which influences the profile). This would be the first evolution I can think of. That is, rerank tracks and/or artists based on if the user skipped or listened to whole track, for example.

Another evolution would be to add genre information to generate sequences of tracks with the same or a similar genre. Either by generating several models using subsets of the matched playlists by genre, or simply assigning the tracks a genre and rewriting part of the selection process to take this into account.

To wrap up the post, I just wanted to emphasize on the fact that this whole system works and holds on a single low key webserver. We are working with data, sometimes there can be a lot. The advantage of working with generative models like the one we use is that once the model is built, we need nothing else, and each sequence of tracks is quick to build.

07 Nov 2014