My bitemoji Zach data enthusiast. options trader.

Shopee Code League 2021 - Data Analytics Problem

» data

Introduction

I participated in this year’s Shopee Code League 2021 alongside two of my colleagues. As the most junior member in the team, it has been quite a growing experience thus far. The Code League has three parts, spread over three weekends and the first challenge was held last Saturday. It was a Data Analytics problem and it had to be solved within three hours. Here, I talk about the solution that I eventually came up with. This was not the solution that we submitted but was something that I developed after the three hours was up as I was sure that there was a more elegant solution to the problem than the one we had presented.

The Challenge

Briefly, this is the gist of the challenge.

  1. Customer Service is important to Shopee and Shopee wants to ensure that every customer is attended to.
  2. However, some customers contact Shopee more than once using different channels (e.g. via their app, via a phone call, an email etc.)
  3. Shopee wants to measure the Repeat Contact Rate (RCR) to ensure that they are doing a good job at responding to their customers.
  4. The dataset consists of a number of support tickets with varying levels of contact details. Some support tickets have email, phone, and order ID (the three pieces of information). Others, however, only have one or two of these pieces of information.
  5. The task is to group these support tickets together, matching them together in order to count how many times the same customer has contacted Shopee.

More information can be found on Shopee’s Google drive.

The Dataset

The dataset consists of 500,000 support tickets each with varying degrees of information. Here’s a screenshot from Shopee of how the first few rows of the dataset look like.

image

As you can see, this is the same person (linked by email, phone number, and order ID). We now have to find all the links between every customer.

The Solution

My colleagues and I collaborated on the problem. What I present below is a solution that I developed after the challenge was over, incorporating concepts from the solution that my colleagues and I developed but making it (a) more efficient and (b) more elegant.

I was inspired by the idea of “tidy” data from R. In a tidy dataset, the three rules are that every variable must have its own column, every observation must have its own row, and each value must have its own cell. The rule that stuck out to me was the second: every observation must have its own row. The way that the data was given to us was as follows:

image

As you can see, each observation (i.e. email, phone, OrderId etc.) did not have its own row. This gave me the idea to melt the dataframe.

import pandas as pd
import numpy as np

df = pd.read_json('contacts.json') #import data
df_grouped = df[["Id","Email","Phone","OrderId"]].melt(id_vars="Id").groupby("value").agg(lambda x: list(x))

First, I selected only the columns that I needed. I left out the Contacts column because that could be added back in after we’d grouped the various support tickers together. Next, I used melt in order to pivot the columns into rows, keeping only “Id” as the identifier. These gave me:

image

As you can see, each observation now had its own row! I then used the groupby function from pandas in order to group them by values (so that those tickets with the same information would be grouped together) and then aggregated it using the list function to form the IDs into lists.

image

Now, I had every association between the support tickets based on the individual pieces of information. This was not enough however because remember, not every support ticket had every piece of information. We now had a list of associations but we needed to find the intersections between the lists and merge them - a simple problem of using sets and comparisons.

Solution A - Using Sets

L = list(df_grouped["Id"][1:]) # create lists of lists

len_l = len(l) # store len of lists 
i = 0
while i < (len_l - 1): # iterate through all the lists within the list
    for j in range(i + 1, len_l):
        i_set = set(l[i]) # comparison set
        j_set = set(l[j]) # all other sets that are NOT the comparison set
        # if there is an intersection, remove both matching sets, join them together
        # and add them back to the original list
        if len(i_set.intersection(j_set)) > 0: 
            l.pop(j)
            l.pop(i)
            ij_union = list(i_set.union(j_set))
            l.append(ij_union)
            len_l -= 1
            i -= 1
            break

    i += 1

This method is rather self-explanatory. Essentially, we used set to compare the values within lists. If there was an intersection, we then merged the lists together and added the merged lists back to the original list of lists. We iterated our way through the entire list of lists, comparing each list to every other list.

While this method is intuitive, a HUGE downside of this is that it is extremely slow. A faster way to do it would be using graphs…

Solution B - Using Graphs

import networkx as nx
from networkx.algorithms.components.connected import connected_components

# create function to add nodes and edges to graph
def to_graph(l):
    G = nx.Graph()
    for part in l:
        G.add_nodes_from(part) # add each sublist as a node
        G.add_edges_from(zip(part[:-1], part[1:])) # iterate through the elements within the sublists using zip and add them as edges
    return G

G = to_graph(L)
connected_list = [list(i) for i in connected_components(G)]

Since we need to compare multiple values within the list, this is a connected components problem. Each sublist is hence a node and is connected to another sublist by a common element. The elements are the edges between the nodes (sublists). We simply add each sublist into the graph as a node and then create edges between the individual elements.

Finally, we pull out the connected components from the graph, creating another nested list.

temp_df = pd.DataFrame({'A':connected_list, 'B':connected_list,
                       'GroupID': range(0,len(connected_list))}) # create three columns

# explode and sort to get the right order
temp_df.explode('A').sort_values('A').reset_index().drop("index", axis=1).rename(columns={"A":"Id", "B":"trace"}) 

Once we have our lists of associations between support tickets (using either Solution A or B), we create a dataframe with two columns of the list and then add a “GroupID”. This ensures that when we use explode, we will still retain the values of the connected list. The GroupID will allow us to more easily calculate the contacts later.

image

full_df = temp_df.join(df["Contacts"], how="left", on="Id")
full_df["contact_count"] = full_df.groupby('GroupID')['Contacts'].transform(sum)
full_df[["Id", "trace", "contact_count"]]

We then join the contacts into the dataframe and use groupby to sum up contacts and add them. And viola…! We have our end result. A table of support tickets, with their traces, and the total number of contacts for each individual customer.

image

Conclusion

This was a really interesting challenge and a good exercise in thinking about data and the best way to join data from different sources together. I learnt quite a bit particularly in how to think about nodes and edges, something that I had used before during my Contact Tracing stint but had never thought of applying elsewhere. Quite excited for next week’s challenge!