Introduction

A couple of days ago, one of my friends mentioned that it would be interesting to do a visualisation of our group chat - to see things like who spoke the most, who spoke after who the most, and what were the words most commonly used. In this article, I show you how I did that using pandas, json, and plotly.

Post Outline

Gathering Data
Cleaning and Preparing Data
Visualizing the Data

1. Gathering Data

how-to-export

The group that we were in is hosted on telegram and the data was easy to pull. As shown, you just had to use Telegram Desktop to export the files. You have the option of either exporting as a .html or .json but I would 100% recommend .json just because its easier for python to read.

2. Cleaning and Preparing Data

The cleaning of the data was less fun. The files came in a nested json file and hence had to be flattened (or normalized in python). So, we imported the files…

import pandas as pd
import json
from pandas.io.json import json_normalize

#open json file
with open('hist.json', encoding="utf8") as f:
    d = json.load(f)

#normalize it according to parent node
norm_msg = json_normalize(d['messages'])
msg_df = pd.DataFrame(norm_msg) #store it in a dataframe

…and then proceeded to normalize it. So, how json_normalize works is that it basically unpackages a nested json based on the parent node. In this case, if we had just used the pandas read_json, it would have looked something like this.

pdread = pd.read_json('hist.json') #use pandas to read json file
pdread.loc[:, pdread.columns != 'name'] #show all columns except group name

As you can see, the actual messages are stored in the parent node “messages” and hence we unpackaged it by norm_msg = json_normalize(d['messages']) and pointed it to the parent node “messages”.

Once that was done, I needed to extract the important columns. I took only the messages types that were labelled “message” because the output had also recorded random things like starting a poll, sending a location, sending a sticker etc. and these did not contain actual text. I then also filtered it to only give me the important components of date, text, and who the text was from. I then relabeled it to make it easier to understand.

msg_df_filtered = msg_df[msg_df.type=="message"] #filter only message type message 
msg_df_filtered = msg_df_filtered[["date","text","from"]] #filter important variables 
msg_df_filtered = msg_df_filtered.replace({'from' : { 'FriendA' : 'A', 'FriendB' : "B", 'FriendC' : "C", "FriendD" : "D", "FriendE" : "E" , "FriendF" : "F" , "FriendG" : "G", "FriendH" : "H" }}) #rename senders
msg_df_filtered.head(3)

I also noticed that some of the messages contained random characters and I needed to get rid of those so that proper words could be formed.

temp = msg_df_filtered #duplicate the df
temp['text'] = temp['text'].str.replace('[^A-Za-z0-9]+', " ") #replace all non-alaphabet characters
temp = temp.dropna() #remove the na

3. Visualizing the Data

Most common word used

Finally, we were done and ready to get on with the fun stuff! In order to count the number of words, I first joined all the words from the text column together before changing them all to lower caps and splitting them up. I then used value_counts and pulled the top 100 words used in the chat

 common_word = pd.Series(' '.join(temp['text']).lower().split()).value_counts()[:100] #show top 100 words used
 common_word = common_word.reset_index().rename(columns={"index":"Word", 0:"Count"}) #reset the index and rename the columns 

I then plotted it out using plotly.express in a simple bar chart. Clearly the word “the” was extremely popular

import plotly.express as px

fig = px.bar(common_word, x='Word', y='Count')
fig.show()

common word

Who sent the most messages?

In order to see who had sent the most messages, I then used value_counts on the from column and plotted it out too.

messages_sent = temp['from'].value_counts()
messages_sent = messages_sent.reset_index().rename(columns={"index":"User", "from":"Count"})

fig2 = px.bar(messages_sent, x='User', y='Count')
fig2.show()

AtoA

Who responds to who?

This was an interesting one. I wanted to gauge interactions and see who replied the most after whom. I hence created a simple loop to create a list, recording which person tended to reply after which other person. From the looks of it, I guess person A had a lot of fun replying after himself.

replies = [temp.iloc[n,2] + " to " + temp.iloc[(n-1),2] for n in range(1,len(temp))] #list constructor to create replying list
replies.insert(0,"A to A") #insert an initial convo to make list same length as df

temp['Replies'] = replies #added list to df

replies_to = temp['Replies'].value_counts() #count
replies_to = replies_to.reset_index().rename(columns={"index":"Replies", "Replies":"Count"}) #reset index and rename

fig3 = px.bar(replies_to, x='Replies', y='Count')
fig3.show()

Atoreply

Custom words

Finally, I wanted to a list of custom words that allowed me to see how many times the word had been used. I created a list of these words and then counted the number of times they appeared in the pd.Series from our text data.

list_words = ["fun", "love","joy","anger","hate","upset"] #create list of custom words
int_word_count = [sum(pd.Series(' '.join(temp['text']).lower().split())==i) for i in list_words] #count number of times they appeared

#create df to store
int_words = pd.DataFrame(
{"Word" : list_words,
"Count" : int_word_count}
)

#plot df
px.bar(int_words, x="Word",y="Count",color_discrete_sequence=px.colors.qualitative.G10)

funwords

I guess we all love each other a lot and have plenty of fun.

Conclusion

So there you have it! A fun little exploration of words and phrases in a group chat. Thanks for reading and hope you found it interesting!