A Thought Process For A Personal Data Science Project

An attempt to explain the thought process before and during the making of a project.

Gregory Janesch
9 min readJul 30, 2020

I see a lot of examples of data science projects and their technical underpinnings. I also see plenty of posts giving broad but ultimately vague advice on how to do such a project. But few, if any, seem to actually walk through a thought process for making a project. Since I recently finished a moderately complicated project myself — an RShiny app designed to make recommendations from one Spotify playlist based on the content of another — I thought I would try to fill that gap by outlining my thought process and my attempts to reason through the project, particularly with regard to the issues and frustrations that I came across.

The code is available on GitHub here; I won’t go into many specifics in this post since this is more about the thought process behind it.

Background

“What do I want to do” is the first question to ask in this scenario. The advice about how you should do something that you find interesting is cliche, but it’s true enough.

For my part, I wanted to do a project that hit a few different goals:

  • Something that could be shown to others (ideally as part of a portfolio), not just something that resided in a GitHub repository since that can be a little dull.
  • Something that would combine a bunch of different pieces or technologies, since it’s obvious these days that that sort of broad knowledge is necessary.
  • Something that partially existed in familiar territory, since trying to delve into something totally foreign would probably exhaust me pretty quickly.

These goals ended up rolling around in my head for a while, but I eventually settled on making something in RShiny (which I sort of knew) and trying to deploy it somewhere with Docker (which I didn’t know). It was another little while before I saw some visualizations using song data grabbed from Spotify, and I started wondering if, instead of the typical recommender systems based on user data, it was possible to just make song recommendations from one playlist (a target playlist) based on the content of another (a reference playlist). That piqued my interest enough, so I got to work on it.

Part 0: First Outline

I recently had to go through Patricia Goodson’s book Becoming an Academic Writer for a class, and one recurring idea throughout it is that it’s important to use a brain dump to get all of your ideas down and in front of you. This is especially important for me, since I always have difficulties when trying to hold too many pieces in my head all at once. So I’ve been trying to practice it as much as possible.

So writing it down (I seem to work better with paper for brain dumping) was my first step:

First outline of the project using some engineering paper. Source: Myself.

My handwriting’s not great, and the pencil doesn’t show up as well as I’d hoped, but hopefully you get what I’m going for: outlines of what seemed to be the major components and possible stumbling blocks, a rough layout of the app itself, notes about how to write it up once I was done, and other questions that came to mind. There’s some stuff that’s pretty wrong in retrospect, particularly with regard to how to host and deploy the Docker image. But this was just a first foray into the topic, so that’s okay.

Part 1: How To Make Recommendations

Next on the list was the methodology of the app: how to get the data, determine the recommendations, and create the visual outputs.

The first one isn’t much of a problem — with Spotify being as well-known as it is and having an API, it’s reasonable to guess that someone’s built an R library for that. Sure enough, there’s the spotifyr library which makes the API access fairly painless.

…Except that when I was trying to code this up, spotifyr wasn’t available for download through CRAN. Apparently it was due to one of its dependencies (an library for accessing lyrics from the website Genius) being removed from CRAN. Regardless, a normal install wasn’t an option. The library is available on GitHub, though, and since I didn’t need the lyrics information, I just copied the code for the functions needed to download playlists and track information from the API.

For the second item on the list, getting the recommendations was a little trickier. Spotify has several possible metrics for describing songs, but typical recommendation metrics like cosine similarity didn’t seem like they’d work here, often due to the rather low-dimensional nature of the data.

For instance, “energy,” “danceability,” and “valence” (positivity, more or less) are three are the metrics that Spotify will return, and all are continuous-valued variables with values from 0 to 1. If you had one song that scored 0.1 on all three metrics and another that scored 0.9 on all three, those two songs obviously wouldn’t be similar at all, but they would have a cosine similarity of 1 since they lie on the same line. That’s less of a problem in higher dimensions since there are more dimensions for two points to differ, but it’s not going to work well in this situation.

I ended up solving this in parallel with the visualization problem, actually. Since any visualization would require compressing the data down to two dimensions anyway, I figured I could take the dimensionally-reduced data and just use the Euclidean distances between the points. Specifically, I made the recommendations based on the sum of the five shortest distances from a target playlist track to reference playlist tracks. It’s a fairly unsophisticated method, but it’s fairly easy to code (and it seemed to work well enough, but I’ll come to that in a bit).

This just leaves the question of which dimensionality reduction technique to use. Something like t-SNE might have been better to experiment with since I didn’t have any experience with it, but I settled principal components analysis. There was a slight twist, though: I ran PCA only on the reference playlist, and used that model to predict the values for the songs in the second playlist. This was for two reasons:

  1. The fact that you can easily describe the weights for each principal component means that it makes a good secondary visualization of the reference playlist.
  2. If some feature in the reference playlist doesn’t vary much over the tracks, then both principal components would have similar scaling for that feature, which would push more dissimilar tracks farther away.

It actually took some time and testing with various playlists to convince myself that the second point above actually held. That’s something in favor of development strategies that incorporate experimentation, I suppose.

The other part of this that I had to work out was selecting the PCA function to use. R actually provides two slightly different functions — princomp() and prcomp() — in its base libraries, so there’s some care to be taken. I ended up using the latter, since it was easier to get it to normalize the data and there is a note in the documentation that says

Unlike princomp, variances are computed with the usual divisor N - 1.

Part 2: Test What I Have So Far

Even without the uncertainty I had in the previous section, the methodology should be given some testing, as a sanity check if nothing else. The code for this makes up the “prediction_example_script.R” file in the GitHub repository.

For a quick demonstration, consider two playlists, one covering jazz pianist Bill Evans and one for British power metal band DragonForce. We start by using PCA on the former playlist, and the breakdown of the first two principal components playlist looks like this:

Breakdown of the first two principal components. Source: Myself.

The first component has relatively large positive weights for valence and energy, suggesting that DragonForce’s more active songs would end up offset from the Bill Evans playlist by quite a bit. Sure enough:

Songs after being transformed by PCA; Bill Evans playlist (reference) is in red, DragonForce (target playlist) is in greens. Source: Myself.

The recommended songs aren’t unreasonable — the point with the largest value on PC2 is a short and relatively tame instrumental track called “Avant La Tempête,” with the other recommendations including an acoustic track and a cover of Celine Dion’s “My Heart Will Go On.” So it looks like this recommendation method isn’t totally off the mark.

The next step, then, is moving this into a Shiny app. But first…

Part 3: Updated App Sketch

The project has changed somewhat by this point, mostly with the ease of information about the PCA model. As a result, I thought it would be good to make another sketch of the app’s UI to better include the new stuff:

Second sketch of the app, plus some implementation notes. Source: Myself.

There’s more detail on the recommendation table here and inclusion of the PCA info, along with other wants and goals for the app itself. This was still something of a wish list, as I didn’t know how difficult it would be to include some of the ideas, but it’s based on more specific ideas and understanding of the problem than the first outline.

Part 4: Coding the App

With the app layout now reestablished, it was time to code up the app. I didn’t want to get too experimental with the design, as that wasn’t a goal for this project, so I drew on some experience that I had with the shinydashboard library. As the name implies, it’s designed to make it easy to produce a dashboard with RShiny.

This was the easiest part in one sense, since the recommendation workflow was already established. Most of this work amounted to just plugging things together and playing with the layout to get something I liked. Eventually I ended up with this:

The app, after making a prediction. Source: Myself.

As mentioned in the outline, I wanted to use Plotly to render that main scatterplot. I had wanted to include some information on what songs were which on the plot, and since I figured I’d be pressed for space I thought that Plotly’s tooltips could do the job. They do fairly well, as it turns out:

Plotly scatterplot, showing track name, artist, and album for one of the recommendations. Source: Myself.

Unfortunately, I didn’t get as far as I’d hoped on some of the other implementation details. The error checking in particular ended up getting little attention after some experimentation proved difficult and I couldn’t find much material on the best practices for that. I didn’t want to lose motivation to frustration towards that, so I just let it slide. (Though since crashing the app just requires it to be reloaded, it’s not that difficult to deal with.)

Part 5: Docker + AWS

There are plenty of tutorials on this — even specifically for the RShiny/Docker/AWS technology combination . My final result ended up using mostly information from this post, but swapping out the AWS login command at the end of step 4c for (I’m on Linux):

aws ecr get-login-password --region <region> | sudo docker login --username AWS --password-stdin <ID>.dkr.ecr.<region>.amazonaws.com

Getting the app into Docker was actually pretty quick. I had known that there were pre-built Docker images for R and Shiny, but it turned out that there was a shiny-verse image, which had both Shiny and the tidyverse already instsalled and made setup pretty quick. The Dockerfile is pretty brief as a result.

Getting it onto AWS proved to be a bit more of a challenge. I did a lot of searching through tutorials, but most of them seemed to be missing one thing or another. The one that I linked above was the closest to complete, because it was the only one I saw that talked about adjusting the port settings for the ECS instance to make the app properly accessible. And even then, as mentioned, it was still wrong on one point. This ended up taking a few evenings to get right, though if I’d had all the information in one place, it probably would’ve taken 20–30 minutes.

Conclusion

The app is here.

This project went pretty well, apart from stumbling over RShiny errors and the issues with AWS. Since I was just working on it on evenings and some weekends, it did end up taking over a month (you may have noticed that the time between the two outlines was just about a month).

But the most important thing is that I do have a better handle on the technologies that I didn’t know before, even if it’s going from knowing nothing to knowing a little bit. And I’d like to think I learned a little about the best ways to tackle such projects in the future.

--

--

Gregory Janesch

Early-career data scientist/statistician, recently finished a Master’s in Statistics.