The excitement and tension in the air were palpable as crowds lined up only to be turned away as conference rooms filled to capacity at PyCon DE 2023 in mid-April in Berlin. The release of ChatGPT mere months before set off an AI frenzy, sparking a tsunami of innovation and collaboration to develop the first fully open source state-of-the-art instruction-following LLM. During the three days of the conference, alone, the open source world announced the release of LLaVA, StableLM, and the RedPajama dataset.
If I could summarize PyCon DE 2023 in one sentence, I would say: "LLMs in isolation are not the future."
To summarize talks by Erin Mikail Staples of Label Studio and Ines Montani of Explosion, LLMs perform better on downstream tasks when they are used with task-specific data. Furthermore, the most prevalent conversations among attendees were OpenAI’s intrusive data collection policies, which prohibit many companies and even entire industries from using ChatGPT and GPT-4 for commercial purposes.
LLMs in isolation are not the future.
The purpose of this article is to give you an overview of my favorite talks at this year’s PyCon DE. Below I summarize my top five favorite talks, and include links to the program descriptions, slides, and code, when available. All talks from the conference were recorded and will be fully accessible to the public once they are uploaded.
Improving machine learning from human feedback
(Erin Mikail Sharples, Label Studio)
Models trained on enormous datasets, like ChatGPT, impose internet-scale biases on downstream tasks. Prompt engineering, the process of iteratively selecting and designing prompts to elicit a desired response from a generative language model, while popular, simply adapts to a model’s known limitations. Fortunately, there’s a better alternative for addressing bias in LLMs.
In this talk, Reinforcement Learning from Human Feedback (RLHF) is the guest star. RLHF is a process by which a model iteratively learns from feedback provided by a human in order to improve model performance. RLHF gives you finer control over LLMs, aligning model output with your specific needs and use case while also reducing the bias associated with LLMs. Label Studio is an open source data labeling platform with a user-friendly UI and Python client that allows you to incorporate RLHF into your own machine learning workflows. RLHF not only improves accuracy on downstream tasks, but it also increases truthfulness and reduces toxicity at minimal cost.
I enjoyed this talk particularly because it demystified the concept of RLHF, a method that played a critical role in the development of ChatGPT. Furthermore, Label Studio demonstrates that RLHF is a powerful and practical open source tool that can be added to your current workflow with ease.
GitHub: heartexlabs/RLHF
Notebook: RLHF_with_Custom_Datasets.ipynb
Incorporating GPT-3 into practical NLP workflows
(Ines Montani, Explosion)
When I first tried out ChatGPT, I seriously wondered how open source NLP libraries could compete with the might of OpenAI. Ines Montani argues that LLMs complement, rather than replace, existing machine learning workflows.
Explosion has released a repository of recipes that enable users to leverage the power of OpenAI models alongside human feedback collected via their enterprise annotation tool, Prodigy. The pipeline works like this:
- Prompt GPT-3.5 (ChatGPT’s base model) with a task.
- Retrieve the response and treat it as a zero- or few-shot classification.
- Have a human decision-maker mark the response as accurate or inaccurate.
- Use the resulting annotations to train or evaluate your task-specific model.
If I still needed convincing that RLHF was the way of the future, this talk did it. As in the talk I outlined above, Ines demonstrates that incorporating human feedback into NLP workflows results in better performance on downstream tasks than using LLMs in isolation. And given my own experience with using ChatGPT, I don’t doubt these claims for a second. While I’ve found it to perform broad tasks well, I definitely would not trust ChatGPT or GPT-4 unequivocably with sensitive tasks that require subject matter expertise.
GitHub: explosion/prodigy-openai-recipes
Explosion · Makers of spaCy, Prodigy, and other AI and NLP developer tools
Methods for Text Style Transfer: Text Detoxification Case
(Daryna Dementieva, Technical University of Munich)
GitHub: dardem/text_detoxification
Publication: ParaDetox: Detoxification with Parallel Data
Global adoption of the internet has provided a platform for individuals to share information, ideas, and opinions with an ever-growing audience. Research conducted in 2020 even found that Facebook’s Feed recommendation algorithm priveleged incendiary content, as they tended to increase user engagement on the platform. And while hate speech and toxic text detection has been the subject of much research, less has work been done to actually detoxify such text.
In this talk, Daryna introduces ParaDetox, a novel pipeline and set of parallel datasets trained on parallel toxic and detoxified datasets that use Text Style Transfer (TST) to detoxify toxic text. Treated as a seq2seq text generation task, the first step in ParaDetox is to curate pairs of datasets of toxic text and detoxified text. These parallel datasets are then used to train a language model that automatically detoxifies text inputs. ParaDetox models that can detoxify Russian and English text, as well as the parallel datasets used to train the models are currently hosted on the HuggingFace Hub.
Before the widespread use of generative text models, like ChatGPT, by the public, we only had to worry about toxic text that was produced by humans on the internet. Now, however, we need to worry about both humans and machines generated toxic, harmful, and hateful text. ParaDetox also uses a creative approach to solve an age-old problem with the use of parallel corpora. This method is yet another powerful example of leveraging LLMs and human input to create an effective solution to a downstream task.
GitHub – s-nlp/paradetox: Data and info for the paper "ParaDetox: Detoxification with Parallel…
Actionable machine learning in the browser with PyScript
(Valerio Maggio, Anaconda)
If you’re used to primarily using Jupyter notebooks for end-to-end Data Science projects, you might be scared of deploying your first web app. PyScript aims to change this, by providing an easy framework for coders of all skill levels to create dynamic Python web applications. According to Valerio, "you can program Python code in the browser with no installation whatsoever."
PyScript is built on top of Pyodide, which provides access to the full PyData Stack (minus a few unsupported modules) immediately available in the browser. Unlike PHP, PyScript is a client-side technology, meaning there is no server or installation of any kind required. It can be used to share interactive dashboards, visualize data, and create client-side Python web apps.
Although PyScript applications may not be as advanced as those developed with Streamlit or Gradio, they present a user-friendly opportunity for data scientists to familiarize themselves and build confidence with web app deployment. As someone who is ashamedly allergic to programming languages that are not Python or R, PyScript had me at "deployment is as simple as ‘deploying’ an HTML file."
GitHub: pyscript/pyscript
How are we managing? Data teams IRL
(Noa Tamir)
While talking about management is admittedly less sexy than talking about the latest SOTA Machine Learning model, package, or platform, this keynote was well attended, and for good reason. Although the role of data scientist has been around for 15 years now, most of our conferences address the management of processes and platforms but neglect to focus on the management of people.
In this talk, Noa explains that data-driven work is probabilistic, meaning it’s difficult to manage and requires different management techniques than non data-driven work. The rapid development of new machine learning techniques and technologies presents unique challenges to managers of data teams. And as the field has evolved over the years, so have the roles in data-driven teams. We have a hard time understanding the nuances in job titles and descriptions, which makes it harder to hire the right people for the right role and can also negatively impact job satisfaction and career development for employees.
Good managers can mitigate these consequences by building a shared understanding and communicating specific definitions of data roles with current and potential employees. Managers can also support their employees by helping them develop as either specialists or generalists, both of which bring value to data teams.
Noa’s talk succinctly described the challenges facing data teams in a way that was honest and validating. They gave practical advice for data team managers while also underscoring the fact that we work in a new, rapidly changing field and that we’re all still learning.
Both highly motivating and overwhelming, PyCon DE 2023 was truly an outstanding conference. In addition to selecting captivating and pertinent presentations and workshops, the organizers did a wonderful job of fostering a safe and inclusive atmosphere.
This year, I was relieved to hear that despite the all the hype surrounding LLMs, humans are still playing essential roles in creating valuable data solutions. It’s hard to imagine what the state of AI will look like a year from now, but one thing I do know is that I’ll be right back in Berlin for PyCon DE 2024.
P.S. I’ll include the link to access the recordings of the conference as soon as it’s posted.
If you’d like to stay up-to-date on the latest data science trends, technologies, and packages, consider becoming a Medium member. You’ll get unlimited access to articles and blogs like Towards Data Science and you’ll be supporting my writing. (I earn a small commission for each membership).
Want to connect?
- 📖 Follow me on Medium
- 💌 Subscribe to get an email whenever I publish
- 🖌 ️ Check out my generative AI blog
- 🔗 Take a look at my portfolio
- 👩 🏫 I’m also a data science coach!
I’ve also written:
GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Limitations
References
(1) H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual Instruction Tuning (2023).
(2) Stability AI, Stability AI Launches the First of its StableLM Suite of Language Models (2023).
(3) Together, RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens (2023).
(4) PYSV E.V., PyCon DE & PyData Berlin 2023 (2023).
(5) M. Burgess, ChatGPT Has a Big Privacy Problem (2023).
(6) OpenAI, Introducing ChatGPT (2023).
(7) OpenAI, GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses (2023).
(8) OpenAI, Learning from human preferences (2023).
(9) Heartex Labs, heartexlabs/RLHF (2023).
(10) Heartex Labs, Implementing RLHF with Custom Datasets (2023).
(11) I. Montani, Incorporating LLMs into practical NLP workflows (2023).
(12) Explosion, explosion/prodigy-openai-recipes (2023).
(13) D. Dementieva, dardem/text_detoxification (2023).
(14) V. Logacheva1, D. Dementieva1, S. Ustyantsev, D. Moskovskiy, D. Dale, I. Krotova, N. Semenov, and A. Panchenko, ParaDetox: Detoxification with Parallel Data (2022).
(15) L. Munn, Angry by design: toxic communication and technical architectures (2020).
(16) V. Maggio, Actionable Machine Learning in the Browser with PyScript (2023).
(17) PyScript, pyscript/pyscript (2023).
(18) N. Tamir, How Are We Managing? Data Teams Management IRL (2023).