Analysis of My Spotify Data

K-means

Categorisation of my Spotify listening history using k-means clustering.

Author

Harry Zhong

Published

March 23, 2024

Introduction

The motivation for this project was:

I thought it would be fun.
That’s it.

So, let’s get into how we can use R and Spotify’s Web API to categorise songs that we have listened to.

Data Extraction

For this project, there are two key datasets that we need:

Listening activity history.
Track features.

The first can be obtained by requesting your Spotify data. For this project, we will using the extended streaming history option, which takes longer to process but gives us our full listening history as opposed to only the most recent year.

The second dataset can be generated from our listening activity using Spotify’s Web API to pull track features for each song in our streaming history.

Activity History Data

Our Spotify activity history data is given as a set of .json files. We can extract the data from all the .json files into a dataframe and perform some preliminary cleaning using the code below.

Activity History Code

files <- list.files(
  "data/full_history_data", 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub(
      "spotify:track:", 
      "", 
      spotify_track_uri
    ),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(
    -spotify_track_uri,
    -username,
    -platform,
    -ip_addr_decrypted
  ) %>%
  filter(month >= ymd("2019-04-01"))

ts	ms_played	conn_country	user_agent_decrypted	track_name	artist_name	master_metadata_album_album_name	episode_name	episode_show_name	spotify_episode_uri	reason_start	reason_end	shuffle	skipped	offline	offline_timestamp	incognito_mode	track_uri	month
2024-02-08T23:57:04Z	2072	AU	unknown	Kiss Me More (feat. SZA)	Doja Cat	Planet Her	NA	NA	NA	trackdone	fwdbtn	FALSE	TRUE	FALSE	1707436622	FALSE	3DarAbFujv6eYNliUTyqtz	2024-02-01
2024-02-08T23:57:01Z	219724	AU	unknown	vampire	Olivia Rodrigo	GUTS	NA	NA	NA	fwdbtn	trackdone	FALSE	FALSE	FALSE	1707436403	FALSE	1kuGVB7EU95pJObxwvfwKS	2024-02-01
2024-02-08T23:53:23Z	2284	AU	unknown	Judas	Lady Gaga	Born This Way	NA	NA	NA	fwdbtn	fwdbtn	FALSE	TRUE	FALSE	1707436400	FALSE	7F25roCtYi55JouckaayPC	2024-02-01
2024-02-08T23:53:20Z	1721	AU	unknown	Heads Will Roll	Yeah Yeah Yeahs	It’s Blitz!	NA	NA	NA	fwdbtn	fwdbtn	FALSE	TRUE	FALSE	1707436398	FALSE	2WRFD9WczJ975X2K1Y9YVs	2024-02-01
2024-02-08T23:53:18Z	4018	AU	unknown	What You Waiting For?	Gwen Stefani	Love Angel Music Baby	NA	NA	NA	fwdbtn	fwdbtn	FALSE	TRUE	FALSE	1707436394	FALSE	0ny5zITdmyNwyTPVzRGscU	2024-02-01
2024-02-08T23:53:14Z	2404	AU	unknown	Space Song	Beach House	Depression Cherry	NA	NA	NA	fwdbtn	fwdbtn	FALSE	TRUE	FALSE	1707436392	FALSE	3CLhX1JkJZ4s5umNnOqCRh	2024-02-01

From here, we can use the ggplot2 and shiny packages to visualise trends in my most listened to artists and tracks.

app.R

library(shiny)
library(tidyverse)
library(here)
library(foreach)

files <- list.files(
  paste0(here(), "/data/full_history_data"), 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub("spotify:track:", "", spotify_track_uri),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(-spotify_track_uri) %>%
  filter(month >= ymd("2019-04-01"))

min_date <- full_streaming_history %>%
  pull(month) %>%
  min()

max_date <- full_streaming_history %>%
  pull(month) %>%
  max()

ui <- fluidPage(
  titlePanel("Spotify Streaming History"),
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        "top_n",
        "Top n",
        min = 1,
        max = 20,
        value = 10
      ),
      sliderInput(
        "dates",
        "Streaming History Date Range",
        min = min_date,
        max = max_date,
        value = c(max_date %m-% months(6), max_date)
      )
    ),
    mainPanel(
      tabsetPanel(
        type = "tabs",
        tabPanel(
          "Artist History Plot", 
          h2(textOutput("artist_history_title")),
          plotOutput("artist_history_plot")
        ),
        tabPanel(
          "Track History Plot",
          h2(textOutput("track_history_title")),
          plotOutput("track_history_plot")
        )
      )
    )
  )
)

server <- function(input, output) {
  output$artist_history_title <- renderText({
    top_n <- input$top_n
    
    paste0(
      "Proportion of Hours Listened: Top ",
      top_n,
      " Artists"
    )
  })
  
  output$artist_history_plot <- renderPlot({
    top_artists <- data.frame(
      month = full_streaming_history %>%
        select(month) %>%
        distinct()
    ) %>%
      mutate(
        top_artists = map(
          month,
          ~full_streaming_history %>%
            filter(month == .x) %>%
            group_by(artist_name) %>%
            summarise(time = sum(ms_played)) %>%
            slice_max(time, n = as.numeric(input$top_n)) %>%
            pull(artist_name)
        )
      )
    
    artist_summary <- full_streaming_history %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      left_join(
        top_artists,
        by = "month"
      ) %>%
      rowwise() %>%
      filter(artist_name %in% top_artists) %>%
      group_by(
        artist_name, 
        month
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(artist_summary, aes(x = month, y = hours_listened, fill = artist_name, label = artist_name)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity") +
      geom_text(size = 3, position = position_fill(vjust = 0.5)) +
      theme(legend.position = "none")
  })
  
  output$track_history_title <- renderText({
    top_n <- input$top_n
    
    paste0(
      "Proportion of Hours Listened: Top ",
      top_n,
      " Tracks"
    )
  })
  
  output$track_history_plot <- renderPlot({
    top_tracks <- data.frame(
      month = full_streaming_history %>%
        select(month) %>%
        distinct()
    ) %>%
      mutate(
        top_tracks = map(
          month,
          ~full_streaming_history %>%
            filter(month == .x) %>%
            mutate(track_artist_name = paste(track_name, artist_name, sep = "\n")) %>%
            group_by(track_artist_name) %>%
            summarise(time = sum(ms_played)) %>%
            slice_max(time, n = as.numeric(input$top_n)) %>%
            pull(track_artist_name)
        )
      )
    
    track_summary <- full_streaming_history %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      mutate(track_artist_name = paste(track_name, artist_name, sep = "\n")) %>%
      left_join(
        top_tracks,
        by = "month"
      ) %>%
      rowwise() %>%
      filter(track_artist_name %in% top_tracks) %>%
      group_by(
        track_artist_name, 
        month
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(track_summary, aes(x = month, y = hours_listened, fill = track_artist_name, label = track_artist_name)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity") +
      geom_text(size = 3, position = position_fill(vjust = 0.5)) +
      theme(legend.position = "none")
  })
}

shinyApp(ui = ui, server = server)

Track Feature Data

Next, we’ll need to use Spotify’s Web API to obtain track features, which requires the track ID of each track we’re interested in. Fortunately, since we requested our full activity history, this data is included as a column.

Note

On the topic of Spotify’s Web API, it’s interesting to note that it also includes genres. However, genres are linked to artists, not tracks, which makes this feature less noteworthy compared to track features.

We can use the httr and jsonlite packages to create a function that takes a Spotify track ID and returns its track features.

get_audio_features <- function(track_id) {
  url = paste0("https://api.spotify.com/v1/audio-features/", track_id)
  response <- GET(
    url,
    add_headers(Authorization = paste("Bearer", spotify_token))
  )
  data <- fromJSON(
    content(
      response, 
      "text", 
      encoding = "UTF-8"
    )
  )
  return(data)
}

Given the large number of track IDs, using this function on all tracks in our dataset is a long and painful process, where we will get rate limited many times by Spotify. Conveniently, I have a local file containing all of our tracks and their associated track features, which I will load in.

The description for each feature can be found in Spotify’s documentation.

track_name	artist_name	track_uri	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	time_signature
Psychedelic Switch	Carly Rae Jepsen	7zy2kNoeD72x2NEDaAsJOX	0.681	0.805	4	-6.676	1	0.0480	0.00182	1.58e-02	0.341	0.304	127.907	272035	4
Sick Feeling	boy pablo	7zxLkZbUxITHabPzGN8Xgc	0.415	0.504	9	-10.003	1	0.0318	0.02200	3.80e-06	0.363	0.401	165.860	155714	4
You Get Me So High	The Neighbourhood	7zwn1eykZtZ5LODrf7c0tS	0.551	0.881	7	-6.099	0	0.0542	0.18600	7.91e-02	0.152	0.387	88.036	153000	4
No Different	Epik High	7ztlf9mCrjoLXAYYf0LCYx	0.732	0.600	1	-6.127	1	0.0535	0.03070	8.25e-05	0.137	0.238	131.912	200362	4
Sorry	The Rose	7zmrZMinkTMJ2kZgM9Kqgp	0.388	0.642	10	-4.659	1	0.0337	0.47100	0.00e+00	0.306	0.402	173.610	215477	4
Livin It Up (with Post Malone & A$AP Rocky)	Young Thug	7zjEyeBsaw9gV0jofJLfOM	0.767	0.313	7	-12.059	1	0.0798	0.83800	0.00e+00	0.105	0.765	82.582	210907	4

We can then remove discrete track features and scale the remaining features so that the clusters are not affected by the difference in magnitude of different features.

feature_matrix <- full_track_features %>%
  mutate(track_artist = paste(track_name, artist_name, sep = " - ")) %>%
  select(
    -track_name, 
    -artist_name
  ) %>%
  column_to_rownames(var = "track_artist") %>%
  select(
    -track_uri,
    -key,
    -mode,
    -time_signature
  ) %>%
  scale()

	danceability	energy	loudness	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
Psychedelic Switch - Carly Rae Jepsen	0.5034878	1.3400006	0.4206709	-0.3325940	-1.3132909	-0.3305910	1.3463448	-0.5950222	0.3365591	1.3459625
Sick Feeling - boy pablo	-1.2632525	-0.0712252	-0.3296591	-0.5474916	-1.2506975	-0.3915509	1.5205894	-0.1757695	1.6071509	-0.7309358
You Get Me So High - The Neighbourhood	-0.3599567	1.6963234	0.5508003	-0.2503492	-0.7420099	-0.0863063	-0.1505748	-0.2362802	-0.9982436	-0.7793940
No Different - Epik High	0.8422237	0.3788667	0.5444855	-0.2596349	-1.2237122	-0.3912473	-0.2693779	-0.8802870	0.4706386	0.0662492
Sorry - The Rose	-1.4425833	0.5757820	0.8755599	-0.5222876	0.1419900	-0.3915657	1.0691375	-0.1714473	1.8666057	0.3361258
Livin It Up (with Post Malone & A$AP Rocky) - Young Thug	1.0746895	-0.9667206	-0.7933436	0.0892422	1.2803337	-0.3915657	-0.5228246	1.3975087	-1.1808328	0.2545290

K-means

Now that we have the required data, we can move on to the machine learning algorithm of interest, k-means clustering.

What is K-means?

The k-means algorithm basically goes:

Choose $k$ random points within the domain of your factors.
Create $k$ clusters by assigning each observation to its nearest point, which is now referred to as a mean.
The centroid of each cluster then becomes the new mean.
Repeat until convergence.

The obvious question is: how do we determine the value of $k$ for a given set of factors? One method would be to use something called a silhouette value. We can understand the silhouette value by considering a group of clustered data points, shown below.

Chart Code

set.seed(2687)

rand_data <- data.frame(
  x = rnorm(50),
  y = rnorm(50)
)

km <- kmeans(rand_data, 3)

fviz_cluster(
  km,
  data = rand_data,
  geom = "point"
)

We’ll let $s_i$ be the silhouette value for point $i$ which belongs to cluster $C_I$, then

\[ \begin{split} s_i&=\frac{b_i-a_i}{\text{max}(a_i,b_i)},\text{ if }|C_I|>1,\\ s_i&=0,\text{ if }|C_I|=0, \end{split} \]

where

\[ \begin{split} a_i&=\frac{1}{|C_I|-1}\sum_{j\in C_I,i\neq j}\text{d}(i,j),\\ b_i&=\text{min}\frac{1}{|C_J|}\sum_{j\in C_J}\text{d}(i,j),\text{ where }J\neq I. \end{split} \]

Simply put, $a_i$ is the average of some measure of distance between point $i$ and every other point in cluster $C_I$ besides itself, and $b_i$ is the minimum average of some measure of distance between $i$ and every other point in some other cluster $C_J$. The cluster $C_J$, used to determine $b_i$, is sometimes referred to as the neighboring cluster of point $i$ as it is the next closest cluster after $C_I$.

Thus, given the definition of $s_i$, higher values of $s_i$ indicate a better fit of a point $i$ in its cluster $C_I$.

Following the definition of a silhouette value for a single point, the clustering performance of the entire dataset is calculated via the average silhouette value of all points.

Thus, we can determine the optimal number of clusters by:

Running the k-means algorithm using $n$ clusters.
Evaluating the average silhouette value.
Repeat for a reasonable range of $n$.
Ranking $n$ by maximum average silhouette value.

Conveniently, the fviz_nbclust function takes care of this process for us.

Feature selection

The next problem is finding the optimal features to include our model. Generally speaking, an easy way to determine the relevant features to include in a model is to use domain knowledge. However, we do not have this luxury as I know nothing about audio engineering. So, we will do it the hard way, by trying every combination of features and selecting the ones with the best performance.

To do this, we can use a function I wrote that does the following:

Takes inputs n, data, nstart, itermax. Where n is is the number of factors to consider, and data is the feature matrix previously generated. The nstart and itermax inputs are values passed on to the kmeans function.
Finds all combinations of n factors within data.
For each combination, determine the optimal number of clusters using using average silhouette value, and fit k-means clusters.
Records performance metrics and cluster plot.

The function then outputs a dataframe that contains a row for each combination of n factors.

Function Code

kmeans_select_features <- function(n, data, nstart, itermax) {
  comb_n <- data %>%
    colnames() %>%
    combn(n, simplify = FALSE)
  
  old_cols <- seq(1, n)
  new_cols <- paste0("factor_", seq(1, n))
  
  new_cols_sym <- syms(new_cols)
  
  factor_combinations_n <- do.call(rbind.data.frame, comb_n) %>%
    rename_with(~new_cols, all_of(old_cols)) %>%
    mutate(factors = pmap(list(!!!new_cols_sym), c)) %>%
    mutate(n_factors = n) %>%
    select(-(!!new_cols)) %>%
    mutate(data = map(factors,
                      ~data %>%
                        as.data.frame() %>%
                        select(all_of(.x)))) %>%
    mutate(n_clusters = map(data,
                            ~fviz_nbclust(.x, 
                                          kmeans, 
                                          nstart = nstart, 
                                          iter.max = itermax)[["data"]] %>%
                              slice(which.max(y)) %>%
                              select(clusters) %>%
                              as.numeric(),
                            .progress = paste("Finding optimal n_clusters:", 
                                              n, 
                                              "factors")) %>%
             as.numeric()) %>% 
    mutate(km = map2(data,
                     n_clusters,
                     ~kmeans(.x, 
                             .y, 
                             nstart = nstart, 
                             iter.max = itermax,
                             algorithm = "MacQueen"),
                     .progress = paste("Calculating kmeans:", 
                                       n, 
                                       "factors"))) %>%
    mutate(total_withinss = map(km,
                                ~.x$tot.withinss) %>%
             as.numeric(),
           bsstssRatio = map(km,
                             ~.x$betweenss/.x$totss) %>%
             as.numeric()) %>%
    mutate(km_plot = map2(km,
                          data,
                          ~fviz_cluster(.x,
                                        data = .y,
                                        geom = "point",
                                        ellipse.type = "convex"))) %>%
    arrange(desc(bsstssRatio))  
  
  return(factor_combinations_n)
}

The function can then be used on all values of n, from 2 to the total number of factors. As this process is computationally intensive, and most R packages do not support multi-threading, we can use the foreach and doParallel packages to write a multi-threaded for loop which utilises all cores of the local computer. If we paid for all our CPU cores, we might as well use them right?

set.seed(2687)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

kmeans_nfact <- foreach(n = seq(2, ncol(feature_matrix)),
                        .packages = c(
                          "tidyverse",
                          "cluster",
                          "factoextra",
                          "rlang"
                        ),
                        .combine = bind_rows) %dopar% {
                          kmeans_select_features(n, feature_matrix, 25, 1000)
                        } %>% 
  arrange(desc(bsstssRatio))

stopCluster(cl)

This results in a dataframe containing all possible combinations of factors for our dataset, and their k-means clustering results and performance, based on a value of $k$ determined by average silhouette value.

Results

We can now visualise the top 20 k-means results from our previous calculation using a shiny application. It’s interesting that the highest performing clustering results only contain 2-3 factors.

app.R

library(shiny)
library(tidyverse)
library(here)
library(foreach)
library(english)

# Load Data

load("data/R_data/kmeans_nfact.RData")

files <- list.files(
  paste0(here(), "/data/full_history_data"), 
  pattern = "*.json", 
  full.names = TRUE
)

full_streaming_history <- foreach(
  file = files, 
  .packages = c("jsonlite"),
  .combine = rbind
) %do% {
  fromJSON(file, flatten = TRUE)
} %>%
  rename(
    track_name = "master_metadata_track_name",
    artist_name = "master_metadata_album_artist_name"
  ) %>%
  mutate(
    track_uri = gsub("spotify:track:", "", spotify_track_uri),
    month = ts %>%
      substring(1, 7) %>%
      paste0("-01") %>%
      ymd()
  ) %>%
  select(-spotify_track_uri) %>%
  filter(month >= ymd("2019-04-01"))

min_date <- full_streaming_history %>%
  pull(month) %>%
  min()

max_date <- full_streaming_history %>%
  pull(month) %>%
  max()

# UI

ui <- fluidPage(
  titlePanel("Spotify Clustering"),
  sidebarLayout(
    sidebarPanel(
      selectInput(
        "index", 
        "Clustering Performance Rank", 
        choices = seq(1, nrow(kmeans_nfact_save))
      ),
      sliderInput(
        "dates",
        "Streaming History Date Range",
        min = min_date,
        max = max_date,
        value = c(max_date %m-% months(6), max_date)
      )
    ),
    mainPanel(
      tabsetPanel(
        type = "tabs",
        tabPanel(
          "Cluster Plot",
          h1(textOutput("cluster_title")),
          textOutput("cluster_description"),
          h2("Cluster Centres"),
          tableOutput("cluster_table"),
          h2("Cluster Plot"),
          plotOutput("cluster_plot")
        ),
        tabPanel(
          "Cluster History Plot", 
          h1("Cluster History Plot"),
          p("This plot shows the proportion of hours listened for each cluster."),
          plotOutput("cluster_history_plot")
        ),
        tabPanel(
          "Track Clusters", 
          h1("Track Clusters"),
          textOutput("track_clusters_desc"),
          dataTableOutput("track_clusters")
        )
      )
    )
  )
)

# SERVER

server <- function(input, output) {
  output$cluster_title <- renderText({
    rank <- input$index
    
    paste0(
      "K-means Cluster: Rank ",
      rank
    )
  })
  
  output$cluster_description <- renderText({
    total_withinss <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(total_withinss) %>%
      round(digits = 2)
    
    bsstssRatio <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(bsstssRatio) %>%
      round(digits = 2)
    
    index_ordinal <- input$index %>%
      as.numeric() %>%
      ordinal()
    
    paste0(
      "The graph below shows the k-means cluster with the ", 
      index_ordinal,
      " highest BSS/TSS ratio of ",
      bsstssRatio,
      " and a total within sum of squares of ",
      total_withinss,
      "."
    )
  })
  
  output$track_clusters_desc <- renderText({
    start_date <- input$dates[1]
    end_date <- input$dates[2]
    
    paste0(
      "This table shows the cluster of all tracks listened to from ",
      start_date,
      " to ",
      end_date,
      "."
    )
  })
  
  output$track_clusters <- renderDataTable({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("cluster") %>%
      as.data.frame() %>%
      rownames_to_column("track_artist") %>%
      rename(cluster = 2) %>% 
      mutate(cluster = paste0("Cluster ", cluster)) %>%
      left_join(
        full_streaming_history %>%
          filter(
            month %>%
              between(input$dates[1], input$dates[2])
          ) %>%
          select(
            track_name,
            artist_name
          ) %>%
          na.omit() %>%
          distinct() %>%
          mutate(
            track_artist = paste(
              track_name,
              artist_name,
              sep = " - "
            )
          ),
        .,
        by = "track_artist"
      ) %>%
      select(
        cluster,
        artist_name,
        track_name
      ) %>%
      arrange(
        cluster,
        artist_name,
        track_name
      ) %>%
      rename(
        Cluster = "cluster",
        `Artist Name` = "artist_name",
        `Track Name` = "track_name"
      )
  })
  
  output$cluster_plot <- renderPlot({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km_plot)
  })
  
  output$cluster_table <- renderTable({
    kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("centers") %>%
      as.data.frame() %>%
      rownames_to_column("cluster") %>%
      mutate(cluster = paste0("Cluster ", cluster))
  })
  
  output$cluster_history_plot <- renderPlot({
    track_clusters <- kmeans_nfact_save %>%
      slice(as.numeric(input$index)) %>%
      pull(km) %>%
      pluck(1) %>%
      pluck("cluster") %>%
      as.data.frame() %>%
      rownames_to_column("track_artist") %>%
      rename(cluster = 2) %>% 
      mutate(cluster = paste0("Cluster ", cluster)) %>%
      left_join(
        full_streaming_history %>%
          select(
            track_name,
            artist_name
          ) %>%
          na.omit() %>%
          distinct() %>%
          mutate(
            track_artist = paste(
              track_name,
              artist_name,
              sep = " - "
            )
          ),
        .,
        by = "track_artist"
      ) %>%
      select(-track_artist) %>%
      arrange(
        artist_name,
        track_name
      )
    
    cluster_summary <- full_streaming_history %>%
      select(
        month,
        track_name,
        artist_name,
        ms_played
      ) %>%
      na.omit() %>%
      filter(
        month %>%
          between(input$dates[1], input$dates[2])
      ) %>%
      left_join(
        track_clusters, 
        by = c("track_name", "artist_name")
      ) %>%
      group_by(
        month, 
        cluster
      ) %>%
      summarise(hours_listened = sum(ms_played/(1000*60)))
    
    ggplot(cluster_summary, aes(x = month, y = hours_listened, fill = cluster, label = cluster)) +
      xlab("Date") +
      ylab("Proportion") +
      geom_bar(position = "fill", stat = "identity")
  })
}

shinyApp(ui = ui, server = server)

Cluster Playlists

Just for fun, and to see if our subjective interpretation of grouping songs together aligns at all with k-means and Spotify’s API, we can calculate the top 5 tracks by hours listened for each cluster in the highest performing k-means result.

Top 5 Tracks Code

cluster_top_tracks <- kmeans_nfact_save %>%
  slice(1) %>%
  pull(km) %>%
  pluck(1) %>%
  pluck("cluster") %>%
  as.data.frame() %>%
  rownames_to_column("track_artist") %>%
  rename(cluster = 2) %>% 
  mutate(cluster = paste0("Cluster ", cluster)) %>%
  left_join(
    full_streaming_history %>%
      select(
        track_name,
        artist_name,
        ms_played
      ) %>%
      na.omit() %>%
      group_by(
        track_name,
        artist_name
      ) %>%
      summarise(
        hours_listened = sum(ms_played/(1000*60))
      ) %>%
      mutate(
        track_artist = paste(
          track_name,
          artist_name,
          sep = " - "
        )
      ),
      .,
      by = "track_artist"
    ) %>%
  select(-track_artist) %>%
  group_by(cluster) %>%
  slice_max(hours_listened, n = 5) %>%
  ungroup() %>%
  select(
    cluster,
    track_name,
    artist_name
  )

cluster	track_name	artist_name
Cluster 1	i <3 u	boy pablo
Cluster 1	Apocalypse	Cigarettes After Sex
Cluster 1	Nothing	Bruno Major
Cluster 1	Sunsetz	Cigarettes After Sex
Cluster 1	Wonder	ADOY
Cluster 2	May I Ask	Luke Chiang
Cluster 2	drunk	keshi
Cluster 2	is your bedroom ceiling bored? (feat. Rxseboy) - Fudasca Remix	Sody
Cluster 2	drivers license	Olivia Rodrigo
Cluster 2	Location Unknown ◐	HONNE
Cluster 3	change ur mind	Sarcastic Sounds
Cluster 3	stay4ever (feat. Mounika.)	Powfu
Cluster 3	affection	BETWEEN FRIENDS
Cluster 3	boyfriend (with Social House)	Ariana Grande
Cluster 3	not ur friend	Jeremy Zucker

Then, we can make some playlists:

Analysis of My Spotify Data

Introduction

Data Extraction

Activity History Data

Track Feature Data

K-means

What is K-means?

Feature selection

Results

Cluster Playlists

Cluster 1 Playlist

Cluster 2 Playlist

Cluster 3 Playlist