When Social scientists turn Computational: a conference to save humanity?

A symposium to tackle “Bias and Discrimination” challenges in data.

m a r i a m
Towards Data Science

--

The 2nd European Symposium Series on Societal Challenges in Computational Social Science (or EuroCSS for short) was held in Cologne Germany on the 4th — 7th of December 2018. I attended it to participate in the first ever “Paper Hackathon” organized by Associate Professor Bennett Kleinberg (UCL), that was taking place as a workshop on Linguistic Temporal Trajectory Analysis (LTTA) or a very fancy way of saying text analysis: how text language changes with time. I was intrigued with the “Paper hackathon” as the aim was to, by the end of the day, have a publishable paper (for those that do not come from an academic background: a “paper” is a scientific report that records the findings of research conducted, that is published in journals and is a way to quantify success for academics.. keep your comments for this later please!). Having attended numerous and conventional hackathons within Computer Science, to say the least, I was intrigued of this mash-up of a hackathon that was applied on the social sciences!

The EuroCSS held at Cologne Germany and its twitter hashtag.

Day 1 — Pre-Symposium Paper Hackathon

The first day in Cologne comprised of the pre-symposium workshop held at the Marriott Hotel. Numerous training courses were running at the same time, but I attended the LTTA as I wanted to apply the methods on my own research (later post on my crazy research!).

The workshop started off with a background check to bring everyone in the room up to speed with the technical workings of the methods that would be used later on. The room had around 30 some students and researchers from various backgrounds ranging from pure social sciences to hardcore computer scientists and developers.

We moved on to the practical using R and RStudio on our own laptops, to put in practice all the theory we had just absorbed, but also as a sanity check that all the right packages and data-sets were downloaded, installed and were running properly. I was fortunate enough to have done all this before arriving to the workshop and so had the comfort of playing around with the parameters within the methods as everyone was setting up.

UCL Data Science Team running the LTTA workshop, from left to right: Isabelle van der Vegt, Maximilian Mozes and Bennett Kleinberg.

Bennett and his students, Isabelle van der Vegt and Maximilian Mozes, had done a great job in preparing all the necessary files, scripts and data on their Github repository — please do check it out and apply it to your own research here! There was a choice of two parsed and pre-processed (a programmer’s dream!) data-sets of vlogs (if you’re not familiar with this all-new term, these are video-version blogs – translated into text script for analysis), one from different Media Channels categorized as either Right or Left in regards to political influence and the other from the YouTube “Creators for Change” users, whose aim is to balance toxic material found online with positive content. The LTTA method in lay-man terms basically measures how positive or negative the sentiment of the text is as you progress from the start to the end of the script. These then so called “trajectories” boil down to a few types, from memory a total of 7, with funny names to describe their shapes such as “rags-to-riches” for an upwards trajectory or my favorite “mood swings” of one resembling a squiggly line alternating between positive and negative sentiment — you know girls, that once-a-month type thing.

Getting ready for our hackathon on LTTA!

Before getting our hands dirty, we charged up in brain juice over food during lunch time and so headed out around the Cologne Cathedral and Centre. Once we filled up in carbs, we came back to the Marriott to come up with our awesome Research questions, to split into two groups and to start hacking!

I was in group 1 with the research question of comparing the sentiment trajectory of topics between the different (left or right) media channels. This required topic modelling, sentiment analysis and clustering (if you’re interested in more technical details — be my favorite person and — contact me below!). The other (and not so liked, as it is a competition overall) group 2 focused on the effect of popularity (number of likes of the vlogs) on the sentiment between left and right media channel vlogs.

We found a nice and comfy space within the Marriott (that obviously included couches and many plugs) to plan our 3-hour project. After a good 20-min brainstorm we split into half between our “writing” group members and our “analysis” group members. Funnily enough we ended up having three sub-groups within the analysis team: the fierce python coders running against the legendary (unbiased opinion here…) R programmers that conducted the same topic modelling method, LDA, but in different programming languages and then the less fortunate and MANUAL (still makes me cringe), yes you heard me right, manual and very much predisposed to bias, of coders that had pre-selected two topics (immigration and trade war) and were (for some very odd reason) manually coding regex keywords and using dictionaries to search for text with related terms to manually — I can’t say the word manual anymore (insert face palm here) — but you get the point. Three sub-groups, doing exactly the same thing in different ways. We were all trying to find the top topics discussed in the media channels to then apply sentiment analysis and see how they differ.

Our LDA results from our analysis on the top topics found in Media Channels from 2016–2018 — no surprise “trump” is in there…!

Despite the redundant method work we were doing, our group was pretty fun. We had people from Italy, Persia, the UK, Greece and Germany, all working together. We may have had a little bit too much fun as when spying on group 2, they seemed to have their sh*t together in comparison. They had already prepared a PowerPoint and had everything ready to go when the time was up. What was group 1 doing? Still running LDA. In three different ways. Our LDA was still running when it was time to present our findings and had finally loaded the second I began introducing our research and outcomes (insert a bigger face palm here). Jokes aside, it was a very creative way of getting social scientists engaged in computational methods, and getting researchers in general focused in churning out solid work within a single day!

I’m looking forward to the next one – officially volunteering here to help organize it – with a little more time for the hacking part, more ready available snacks and most importantly caffeine, because it can’t be a hackathon without over consumption of any form of energy as well as some sort of sleep deprivation!

As soon as the hackathon was over and after a few beers, the “science slam” wrapped up the first day of the conference, with awkward PhD students giving short 7-min stand-ups of their work. Yes. There was free beer.

Day 2— Symposium Part 1 at the Maternushaus

Having had my first coffee after an early 10km run alongside the Rhine river, it was great to catch up with a group member on all the “to-dos” we would need to address if we were happy to pursue publishing a paper on the hackathon conducted the previous day. It was great to share the commitment and interest with a fellow researcher who had a beautiful mix of a computer science background but social science interest.

All caffeinated and full of exercise endorphins, I was ready for a day full of talks and networking – bring it on! The first Keynote speaker, Christo Wilson from Northeastern University, gave an excellent presentation on how career search engine algorithms were surveyed for gender bias. It was good to find out that the first 5–10 results weren’t bias, but as soon as a recruiter looked at the rest of the ranks, it was found that despite equal qualification and merit, male applicants were ranked first. It was scary to see that the algorithm churns this out without the input of the gender when creating an account…

The talk of the day however that – hands down – blew my mind – was from Juergen Pfeffer on the sampling algorithm of the Twitter API. It was essentially an awesome hack on how that 1% snippet of data is provided from Twitter to the data-user, and it seriously only requires a good (aka expensive) internet connection to hit a specific millisecond window (657–666ms or some specific range of the sort) and boom! You can dangerously manipulate, exploit, bias or whatever your preferred term is, the whole of the sampling data-set! Mike. Drop.

So as you can see, no 2-min poster pitches (no alliteration intended) could match that, although shout out to Justin Chu-Ting Ho, a PhD student from Edinburgh, who had a rad poster on his work related to how the new Facebook API is heavily bias. Right. Sanity check: all this “bias” talk really means the following: researchers use social media data to answer fundamental questions about the world, humans, and how humans behave in this world. Facebook, Twitter and all the rest, own massive amounts of data that these very humans willingly give away – for free – and yet Facebook, Twitter and the likes, refuse to share, this very publically available data. I’ll let you ponder the implications there, but let me know if you need any help.

On that slightly cynical note, we had our very much needed lunch break. Got the chance to speak to researchers all around the world and even discussed for potential collaborations. That’s what conferences are for anyway, right?

The second day of the symposium ended with a highly technical but great last keynote speaker, Sara Hajian, sharing her few solutions on the before-mentioned (and scarily numerous) biases we come across in our data collection.

We all visited the local Christmas markets of Cologne, literally every few blocks there was a different Christmas market — enough for the next 5 years or so! Got a delicious half-a-meter-meat-stick, gluhwein (mulled-wine) and loads of sweets!

Day 3— Symposium Part 2 at the Maternushaus

The last day of the symposium, may best be described as a little sluggish, either the caffeine was severely diluted in the coffee provided or my brain was heavily saturated from the information received the previous day. Regardless, it was another talk-heavy day, with another chance to look at the posters and vote for the best one.

The highlight of the day was finally hearing from the SocioPatterns researchers running an experiment during the symposium as we all agreed to wear RFID chips on our lanyards. Truly looking forward to the outcomes of the experiment, I imagine I would be a social bee if you were to track my interactions – is there a test for being humble? I’m not too sure how I would do on that…

Overall the conference was an eye-opener to what considerations you must make in your data collection as a researcher, but also as a data consumer how all this information may reinforce bias in our society. The people I met and go the chance to converse with provided for a great exchange of ideas and acquaintances that may be maintained in the future. And finally, Cologne was a lovely city to visit and explore, with cute markets and delicious pretzels. Bis zum nächsten Mal!

--

--