Ever wondered how engaging was the content you delivered? Was it clear or confusing? Or if people misunderstood your message at that company-wide meeting?
Remote environments give very little chance for teachers and leaders to gain feedback and optimize their content towards better performance.
Since a considerable amount of my career was already done remotely (pre-covid times, actually!), I found these questions sparkling excitement and joy in my brain hungry for creative solutions. I had the data, all I had to do is draft the questions I want to answer and do it.
This was also the final, e2e project of my Bootcamp and I had subject matter experts at hand as stakeholders (my lead teacher and my project mentor) to direct me towards a value-driven product.
Data Source
My dataset included public Slack conversations from the first day till the last at the Ironhack Bootcamp provided by the Slack admin. Could have been done through the Slack API as well, but that was out of scope for my project.
For the sake of keeping this blog post concise, note that I focused on highlighting the exciting parts of the code, not the insights it gave.
If you’re looking for the:
- visuals (done in Tableau) check out my presentation
- detailed code, browse my GitHub repo here.
Data Cleaning & Wrangling
To represent the challenge I had with the number of JSON files, here is just the general channel’s folder containing all the conversations broken down by days, like this:

So I started off by loading each channel’s JSON files into a single dataframe.
# defining file path
path_to_json = '../raw_data/general/'
# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)
# an empty list to store the data frames
dfs = []
for file in file_list:
# read data frame from json file
data = pd.read_json(file)
# append the data frame to the list
dfs.append(data)
# concatenate all the data frames in the list
channel_gen = pd.concat(dfs, ignore_index=True)
# test
channel_gen.tail(100)
Then went on to merging each channel’s separate dataframes into a single one for convenience.
#frames = [channel_gen, channel_books, channel_dmemes, channel_dresource, channel_dbootcamp, channel_funcommittee, channel_dvizbeauties, channel_frustrations, channel_finalproject, channel_frustrations, channel_funcommittee, channel_katas, channel_labhelp, channel_music, channel_random, channel_vanilla]
df = pd.concat([channel_gen, channel_books,
channel_dmemes, channel_dresource,
channel_dbootcamp, channel_funcommittee,
channel_dvizbeauties, channel_frustrations,
channel_finalproject, channel_frustrations,
channel_funcommittee, channel_katas,
channel_labhelp, channel_music,
channel_random, channel_vanilla], ignore_index=True, join="outer")
By this time, my dataframe has 5263 lines and 13 columns with a bunch of unrelated data to my project. Cleaning was tough.
Columns to clean & wrangle:
- subtype: filter out it's values from df, remove the original column
- ts: changing it to datetime, remove miliseconds, get days of the week, months of the year, type of the day, parts of the day
- user_profile: extract real_name in new column, remove the original
- attachments: extract title, text, link in new columns
- files: extract url_private and who shared
- attachments: extract title, text, link in new columns
- reactions: extract user, count, name of the emoji
Since almost all the data was nested in JSON libraries, the majority of my project’s time was spent iterating through feature engineering tasks to conjure variables I can use to train models with. At the same time, extracting data is what I enjoyed the most. Below you can see a couple of examples of the functions created to draw insights from the dataframe.
Who sent the most replies:
# user_profile column: extract real_name
def getrealnamefromprofile(x):
"""this function is applied to column user_profile
"""
if x != x:
return 'noname'
else:
return x['real_name']
df_clean['real_name'] = df_clean['user_profile'].apply(getrealnamefromprofile)
df_clean
What kind of emojis were used the most in the cohort:
# reactions column: extract frequency
def getcountfromreactions(x):
"""this function is applied to column reactions
"""
if x != x:
return 0
else:
return x[0]['count']
df_clean['reactions_count'] = df_clean['reactions'].apply(getcountfromreactions)
df_clean
What links people shared on the channels:
# files column: extract link
def geturlfromfile(x):
"""this function is applied to column files
"""
if x != x:
return 'nofile'
else:
try:
return x[0]['url_private']
except KeyError:
return 'nolink_infiles'
df_clean['link_of_file'] = df_clean['files'].apply(geturlfromfile)
df_clean
To help me find the source of the communication, I created another function to differentiate the Lead Teacher and the Teaching Assistants from the students.
# create a new column with teaching and students
def applyFunc(s):
if s == 'siand the LT (she/her)':
return 'teacher'
if s == 'Florian Titze':
return 'teacher'
if s == 'Kosta':
return 'teacher'
else:
return 'student'
return ''
df_clean['participant'] = df_clean['real_name'].apply(applyFunc)
df_clean['participant'].value_counts()
Finally, before getting ready for the models – I took a moment to check my cleaned dataframe with appreciation:


Natural Language Processing
While working through this section, I realized this is what I want to specialize in. Text analytics, sounds so cool, right? Just imagine the amount of time people spend reading cumbersome text and trying to analyze it with a brain full of biases when a machine can do it in milliseconds. Making me shiver.
My scope originally included text feature extraction as well (since that’s the most valuable thing you can get out of written communication) and this is something I’m working on right now, however, there was no time for that in those 5 days I had and the topic was out of scope for the Bootcamp as well.
Instead, I focused on getting the sentiment score for each comment and generating an awesome worldcloud from the most frequently used words as a present to my peers. ❤️
Code
def clean_links(df):
#replace URL of a text
df_sent['text'] = df_sent['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')
clean_links(df_sent)
df_sent['text']
# load VADER
sid = SentimentIntensityAnalyzer()
# add VADER metrics to dataframe
df_sent['scores'] = df_sent['text'].apply(lambda text: sid.polarity_scores(text))
df_sent['compound'] = df_sent['scores'].apply(lambda score_dict: score_dict['compound'])
df_sent['comp_score'] = df_sent['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
#test
df_sent.head()

This was easy. Now, to the challenging preprocessing part to create a worldcloud without links, numbers, punctuation, stopwords:
# set of stopwords to be removed from text
stop = set(stopwords.words('english'))
# update stopwords to have punctuation too
stop.update(list(string.punctuation))
def clean_text(text_list):
# Remove unwanted html characters
re1 = re.compile(r' +')
x1 = text_list.lower().replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '$').replace('n', "n").replace('quot;', "'").replace(
'<br />', "n").replace('"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
' @-@ ', '-').replace('', ' ')
text = re1.sub(' ', html.unescape(x1))
# remove non-ascii characters
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
# strip html
soup = BeautifulSoup(text, 'html.parser')
text = soup.get_text()
# remove between square brackets
text = re.sub('[[^]]*]', '', text)
# remove URLs
text = re.sub(r'httpS+', '', text)
# remove twitter tags
text = text.replace("@", "")
# remove hashtags
text = text.replace("#", "")
# remove all non-alphabetic characters
text = re.sub(r'[^a-zA-Z ]', '', text)
# remove stopwords from text
final_text = []
for word in text.split():
if word.strip().lower() not in stop:
final_text.append(word.strip().lower())
text = " ".join(final_text)
# lemmatize words
lemmatizer = WordNetLemmatizer()
text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
text = " ".join([lemmatizer.lemmatize(word, pos = 'v') for word in text.split()])
# replace all numbers with "num"
text = re.sub("d", "num", text)
return text.lower()
# apply cleaning function
df_train['prep_text'] = df_train['text'].apply(clean_text)
df_train['prep_text'].head(5)
# apply wordcloud function
make_wordcloud(df_train['prep_text'])
and the result: (ta-daa)

Machine Learning Models
To highlight something cool here, I took the Random Forest Classification model to see what features you would need to get a reply (and in this case, to get help from the cohort) with an accuracy score of 0.86:
# feature importance
feat_importances = pd.Series(importances, index=X.columns)
plt.figure(figsize=(10,10))
feat_importances.nlargest(15).plot(kind='barh', color='#FF9B48', width= 0.7)
plt.xlabel('Level of importance', fontsize=16)
plt.ylabel('Features', fontsize=16)
plt.yticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14], ['length_of_text', 'neutral_tone', 'positive_tone',
'amount_of_reactions', 'negative_tone',
'may', 'morning','march', 'files_attached',
'teacher_posted', 'evening', 'early_morning',
'labhelp_channel', 'general_channel', 'got_reaction'])
plt.title("Top 15 Important Features", fontsize=20)
plt.show()

It seems like you have a better chance to get a reply if you: write a lengthy message in a neutral or positive tone, receive a lot of reactions to it, sending it in the morning also helps or if you have a file attached to it.
Conclusion
Some things I learned through this project were that:
- working on something you are invested in is a game-changer
- having stakeholders by your side is invaluable
- iteration is key
- functions save you time in the long run
To continue, I’ll take this data set and implement my newly acquired knowledge from Udemy’s NLP course to extract some cool stuff from the comments too.