The world’s leading publication for data science, AI, and ML professionals.

A Question-Answering Bot Powered by Wikipedia Coupled to GPT-3

Still fascinated by the possibilities offered by GPT-3 and its power, here coupled to Wikipedia

If you follow me, you’ve seen I’m fascinated with GPT-3 both as a tool for productivity and as a tool for information retrieval through natural questions. You’ve also seen that GPT-3 often provides correct answers to a question, but sometimes it does not and it can even be misleading or confusing because its answer appears confident despite being wrong. In some cases, but not always, when it cannot find a reasonable completion (i.e. it "doesn’t know" the answer) it tells you so, or it just doesn’t provide any answer. I showed you that factual accuracy can be improved by fine-tuning the model, or more easily, by few-shot learning. But it isn’t easy to decide what information to use in these procedures, let alone how to apply it. Here I present you a rather simple way to enhance your bot by using information that it retrieves directly from Wikipedia. As you will see by reading on, it works quite well.

Introduction

GPT-3 is powering many projects that were unthinkable until a year or so ago. Just look here at the articles I wrote presenting various example applications – with the twist that they are all web-based and running on the client, thus easily achieving things as futuristic-looking as having a natural talk with the computer:

All my articles on GPT-3 as of October 2022

Need for more accurate information

Although there’s a good chance that GPT-3 will provide correct answers to a question given the right settings, sometimes it will reply that it doesn’t know or even not reply at all. However, and this is very bad, it will often provide incorrect answers that can be very misleading or confusing because they are provided with seemingly high confidence. This is something that we saw can be corrected with fine-tuning, or more easily, by few-shot learning. But how, exactly?

Now this last month I’ve been experimenting with ways to make "smarter" question answering bots that can use information from reliable sources.

Why? Because I see a tremendous potential in using GPT-3 as an assistant when doing my research or when studying a new subject for my work as scientist, and even, why not, as a 24/7 tutor for students—which I specifically elaborated upon here.

How? Essentially by querying Wikipedia for relevant articles and using the obtained text for few-shot learning followed by the question.

In some more detail (full details further down in the article), the bot first removes all stopwords from the user’s question (in NLP jargon, stopwords are words that are filtered out before or after text processing because they are of low or no significance); then it queries Wikipedia with the cleaned-up words and gets back a list of articles that presumably contain information about the subject; and finally appends the original question to each article and processes with GPT-3 for it to compute a plausible answer based on the Wikipedia article. The answer is thus not a single one but actually one per Wikipedia article, and each answer includes a link to the article so that the user can consult it more deeply. The tool is not infallible, but it works quite well in the tests I’ve done.

A first example

You can now see the bot in action with an example, and I show another example near the end of the article. And yes, like in all my examples, you can visit and use this web app in a link to my website (recall you need to get a free API key from OpenAI to use GPT-3).

In this first example, I asked the bot about the evolution of chloroplasts -organelles that plants use for photosynthesis, which probably originated when non-photosynthetic cells engulfed photosynthetic cyanobacteria.

The exact question was:

What’s the evolutionary origin of the chloroplast?

You’ll see the bot got the information quite well, especially from the last articles shown in the list:

This is the full output, when a maximum of 20 Wikipedia articles was requested and all the retrieved articles were fed to GPT-3 together with the questions:

From Wikipedia article "Chloroplast" (link)

From Wikipedia article "Evolution of photosynthesis" (link) The chloroplast is a organelle that is thought to have originated through endosymbiosis.

From Wikipedia article "Archaeplastida" (link) The chloroplast is a organelle that was acquired by a single endosymbiosis event by feeding on a cyanobacterium.

From Wikipedia article "Euglena viridis" (link) The chloroplast is a organelle that is found in the cells of plants and algae. It is thought to have originated from a symbiotic relationship between a bacterium and an ancestor of eukaryotic cells.

From Wikipedia article "Plant evolution" (link) The chloroplast is a organelle found in plant cells and other eukaryotic organisms that conduct photosynthesis. Chloroplasts capture light energy to convert carbon dioxide and water into organic matter, such as glucose.

From Wikipedia article "Evolutionary history of plants" (link)

From Wikipedia article "History of life" (link)

From Wikipedia article "Mitochondrion" (link)

From Wikipedia article "History of evolutionary thought" (link)

From Wikipedia article "History of Earth" (link)

From Wikipedia article "Species" (link) The chloroplast is a organelle that is found in the cells of plants and algae. It is the site of photosynthesis, where the energy from the sun is used to convert carbon dioxide and water into oxygen and glucose. The chloroplast is thought to have originated from a cyanobacterium that was engulfed by a eukaryotic cell.

From Wikipedia article "Allium" (link) The chloroplast is a type of organelle found in plant cells and other eukaryotic organisms that conduct photosynthesis. Chloroplasts capture light energy from the sun and convert it into chemical energy that can be used by plants to make food. The chloroplast is a descendants of an ancient bacterium that was engulfed by an early ancestor of plants.

From Wikipedia article "Lynn Margulis" (link) Lynn Margulis proposed that chloroplasts were once independent bacteria that were taken up by other cells in a symbiotic relationship. This theory is now widely accepted.

From Wikipedia article "Origin of replication" (link) The chloroplast is a organelle that is found in the cells of plants and algae. It is thought to have originated from a symbiotic relationship between a bacterium and an ancestor of modern plants.

From Wikipedia article "Eukaryote" (link) The chloroplast is a organelle that is found in the cells of plants and algae. It is thought to have originated from a symbiotic relationship between a bacterium and an ancestor of modern plants.

From Wikipedia article "Chloroplast DNA" (link) The chloroplast is a type of organelle found in the cells of plants and algae. Chloroplasts are unique to photosynthetic organisms, and are thought to have arisen through a process of endosymbiosis, in which a primitive prokaryote was taken into the cell of another organism and became established as a symbiont. Over time, the symbiont became more and more integrated into the cell, until it became the chloroplast.

From Wikipedia article "Photosynthesis" (link) The chloroplast is a organelle that is found in the cells of plants and algae. It is the site of photosynthesis, where light energy is converted into chemical energy that can be used by plants to create glucose from carbon dioxide and water. The chloroplast is thought to have originated from a cyanobacterium that was engulfed by a eukaryotic cell, and over time, the two organisms became symbiotic.

From Wikipedia article "Symbiogenesis" (link) Symbiogenesis, endosymbiotic theory, or serial endosymbiotic theory, is the leading evolutionary theory of the origin of eukaryotic cells from prokaryotic organisms. The theory holds that mitochondria, plastids such as chloroplasts, and possibly other organelles of eukaryotic cells are descended from formerly free-living prokaryotes (more closely related to the Bacteria than to the Archaea) taken one inside the other in endosymbiosis.

From Wikipedia article "Hornwort" (link) The chloroplast is a type of plastid, a organelle found in the cells of plants and algae. Plastids are membrane-bound organelles that have a variety of functions, including photosynthesis, storage of nutrients, and synthesis of lipids and other molecules. The chloroplast is unique among plastids in that it contains the pigment chlorophyll, which is used to capture sunlight and convert it into chemical energy that plants can use to grow and thrive. Chloroplasts are thought to have originated from a type of bacteria that was engulfed by an ancestral plant cell. Over time, the

From Wikipedia article "Evolution" (link) The chloroplast is a organelle found in the cells of green plants and algae. It is the site of photosynthesis, the process by which green plants and algae produce their own food. Chloroplasts are thought to have originated from a type of bacteria called cyanobacteria. Cyanobacteria are thought to have been among the first organisms to produce oxygen gas as a by-product of photosynthesis. This oxygen gas would have built up in the atmosphere, eventually leading to the formation of the ozone layer. The ozone layer protects the Earth’s surface from harmful ultraviolet radiation from the Sun.

Important things to note:

  • First, that the bot replies with its own answers generated from each article, but…
  • It does not reply anything when it hasn’t found the answer to the question as posed. This is important because some articles such as the one titled "History of evolutionary thought" include one of the keywords ("evolutionary") yet do not really relate to the question being asked. And thanks to the power of GPT-3, our bot "understands" this, so it just doesn’t provide any reply.
  • Last, of course one could pool all the answers and then come up with a single paragraph that summarizes all the information. This could be done with GPT-3 itself or with some other program. Here for example I used Quillbot’s summarizing tool for a quick test, and I got this paragraph out:

Chloroplasts are a type of organelle found in the cells of green plants and algae. They capture light energy from the sun and convert it into chemical energy that can be used by plants to make food. The chloroplast is thought to have originated from a symbiotic relationship between a cyanobacterium and an ancestor of modern plants.

As you see that makes up a concise but informative paragraph that includes the answer to the question. For deeper reads, the reader can follow the links in the article-by-article answers.

To know more about Quillbot, check this out:

Could This Automatic Summarization Tool Revolutionize Writing?

How does this bot work, exactly?

To create plausible answers to the user’s question based on Wikipedia articles, the bot goes through these steps:

  • First, it removes all stop words from the user’s question. In NLP jargon, the co-called stopwords are words of low or no importance, so they are often removed from texts as they are processed.
  • Then, the bot queries Wikipedia with the cleaned-up words and thus gets back a list of articles that contain the queried words or some of them, hence presumably relate to the subject of the question.
  • The bot then extracts the first 2000 characters from each article, and on each of these 2000-character long texts…
  • It appends the original question, and sends the resulting string to GPT-3 so that it computes a plausible answer, based on the Wikipedia article that would act as the prompt.
  • Last, the bot extracts the answer sent back by GPT-3 and displays it in the output together with a link to the Wikipedia article fed in each GPT-3 call. In this way the user can consult the article in detail, and hopefully verify or refute the bot’s answer.

Source code and details

To remove stopwords I used a JavaScript function written by http://geeklad.com using a list of stopwords from http://www.lextek.com/manuals/onix/stopwords1.html. An article describing this function is available at http://geeklad.com/remove-stop-words-in-javascript.

To query Wikipedia I use a typical fetch … then … then clause:

fetch(endpointurl).then(function(resp) {
     return resp.json()
 }).then(function(data) {
     //code to analyze each retrieved article
})

Where endpointurl points to a URL of this type:

https://en.wikipedia.org/w/api.php?action=query&list=search&prop=info&inprop=url&utf8=&format=json&origin=*&srlimit=20&srsearch=${userinput.removeStopWords()}

Then, for each object (Wikipedia article) retrieved in the list the bot makes a new call to obtain its full text, using one fetch command for each:

data.query.search.forEach(result => {
   fetch("https://en.wikipedia.org/w/api.php?action=query&pageids=" + result.pageid + "&format=json&origin=*&prop=extracts").then(function(article) {
 return article.json()
 }).then(function(dataarticle) {
 pageidnumber = Object.keys(dataarticle.query.pages)[0]
 //console.log(strip(dataarticle.query.pages[pageidnumber].extract))
 chatbotprocessinput(stripHTML(dataarticle.query.pages[pageidnumber].extract), dataarticle.query.pages[pageidnumber].pageid, dataarticle.query.pages[pageidnumber].title)
 })

Note that the full text of a Wikipedia article is called "extract."

Also, note that this extract is in HTML format, so to remove all HTML tags the bot uses a function called stripHTML that looks like this:

function stripHTML(html){
 let htmltoclean = new DOMParser().parseFromString(html, 'text/html');
 return htmltoclean.body.textContent;
}

Finally, the function chatbotprocessinput() is similar to those I’ve described in previous posts to call GPT-3 when some prompt is provided for few-shot learning. Essentially, this function appends the original question to the text coming from the Wikipedia article, and sends this to the GPT-3 API via PHP just as I showed in other articles too.

You can see all the code right in the web app I leave as an example.

Testing the bot yourself

You can try the web app right here:

https://lucianoabriata.altervista.org/tests/gpt-3/wikipedia-gpt3-2.html

And you can see its full source code by doing CTRL+U. It is a bit messy, yes, that’s why I’ve left the description above including key parts of the code.

Remember that you’ll need an API key from OpenAI to use GPT-3, and that you can get one with some free tokens from their site.

And an extra cool thing: You’ll see the bot also listens and reads answers out loud. It’s doing this just as I showed you here:

Coupling GPT-3 with speech recognition and synthesis to achieve a fully talking chatbot that runs…

One more example

Finally, another example using the method I presented here, for you to better grasp how well it works without having to get an API key and try the bot yourself.

Question: how is virtual reality used in chemistry education?

Answers:

From Wikipedia article "List of education journals" (link) Virtual reality can be used to create simulations of chemical reactions and processes that can be used for educational purposes.

From Wikipedia article "Reality" (link) Virtual reality can be used to create simulations of chemical reactions and processes that can be used for educational purposes.

From Wikipedia article "Schell Games" (link) Virtual reality can be used in chemistry education to help students visualize and understand complex chemical reactions. It can also be used to create simulations of dangerous or difficult-to-observe chemical reactions.

From Wikipedia article "Digital library" (link) Virtual reality can be used in chemistry education to help students visualize and understand complex chemical processes. For example, VR can be used to show the three-dimensional structure of molecules, or to simulate chemical reactions. VR can also be used to create virtual laboratories, where students can experiment with chemicals without the need for expensive and dangerous equipment.

From Wikipedia article "Virtual reality" (link) Virtual reality can be used in chemistry education to create immersive environments for students to learn in. For example, students can be placed in a virtual laboratory where they can experiment with chemicals and reactions.

From Wikipedia article "Mixed reality" (link) Virtual reality can be used in chemistry education to help students visualize and understand complex chemical reactions. For example, VR can be used to show the formation of a molecule or the movement of electrons in a chemical reaction. VR can also be used to create virtual laboratories where students can experiment with chemicals without the need for expensive and dangerous equipment.

From Wikipedia article "Augmented reality" (link) Virtual reality can be used to create simulations of chemical reactions that would be too dangerous to perform in a real-world setting. This allows students to learn about the reactants, products, and intermediates involved in a reaction, as well as the kinetics and thermodynamics of the reaction.

From Wikipedia article "Simulation" (link) Virtual reality can be used in chemistry education to create immersive environments for students to explore chemical concepts. For example, students can use virtual reality to visualize the structure of molecules and how they interact with each other. VR can also be used to create simulations of chemical reactions, allowing students to see how different variables affect the outcome of the reaction.

Here several of the answers are very informative and reflect real applications of virtual reality to chemistry education, connecting also to closely related areas and concepts such as augmented reality, simulations, educational games for chemistry, etc.

Limitations and further possibilities to explore

You may have already spotted some limitations. Let’s look at them and discuss how one could possibly improve the bot by addressing them.

Probably the most critical limitation is that the bot only extracts the first 2000 characters from each Wikipedia article retrieved. This is because sending characters as URL arguments in the API call is limited. By capturing only 2000 characters we are leaving some space for the characters taken up by the question, which we append to the 2000 characters extracted from the article (usually the URL limit is at 2048 characters).

As a general trend, I’ve found that if an article is really dealing with the topics asked in the question, then the introduction will already have some information to help create a reasonable answer. However, if the part of the article that contains the information that helps to answer the question is beyond the first 200 characters, it will be missed and the bot won’t be able to use it. This is probably why in some cases (check my first example) some articles are retrieved that then don’t produce any answers.

One possible workaround would be to extract not the first 2000 characters but say portions of text, even smaller than 2000 characters, around the keywords. It’s not entirely obvious however how to treat multiple occurrences of the keywords, which would eventually again result in a long text that needs to be capped.

Another possible workaround is to first summarize each full article with a technique less powerful than GPT-3 but that can process longer texts, and then append the question to this summary and feed this to GPT-3. This possibility is interesting because one could even make a single summary pooled from all the retrieved articles, then append the question and process this text only once with GPT-3 thus obtaining a single output that might probably work. This is also possible also as an enhancement of my bot as it is now, if one extended it to pool together all the partial answers and then use that as a new piece of text to produce a new, more synthetic, answer.

There’s also the possibility of not using articles at all, but rather some pre-made summaries. Indeed Wikipedia provides article summaries and snippets for most articles when using its search engine with on-the-fly updates. Then there’s also Wikidata, a free and open knowledge base that can be read and edited by humans and computer programs. Besides, many Wikipedia articles have a summarized versions that you can access by editing the URL to begin with "simple"; for example, the link https://en.wikipedia.org/wiki/Photosynthesis becomes https://simple.wikipedia.org/wiki/Photosynthesis which consists in around 15–20% of the material, supposedly curated by humans -i.e. not generated automatically. (Note that the primary goal of Simple Wikipedia is actually to reproduce texts in a simplified, more accessible English language, and not necessarily shorter versions of the articles, although in practice this ended up happening, as you can verify).

As an alternative to article texts, I tested the snippets that one can programmatically obtain from Wikipedia searches (see example JavaScript code here). This didn’t work very well, probably because snippets consist in too short pieces of text that are not sufficiently informative.

Other desirable yet feasible features? One—that I intend to work on next— would be pooling all the partial answers into a single, definitive answer, which will be more helpful to the user.

Another point, which depends more on how GPT-3 works and not so much on what we can do as programmers using it, is assigning reliability scores to the GPT-3 generations. And, related to this, it would be great if GPT-3 could return not only answers but also portions of the original text from which the answer was generated, as Google’s lambda seems to be capable of doing.

Usability in general? This requires testing, so I invite you to try out the bot and comment on how it worked. Have fun and I hope that this article and the open-source code I left here provide you with inspiration to create something even cooler.

Related readings, resources and projects

Language Models are Few-Shot Learners

How to build a Wikipedia Search App with JavaScript

Control web apps via natural language by casting speech to commands with GPT-3

GPT-3-like models with extended training could be the future 24/7 tutors for biology students

Coupling GPT-3 with speech recognition and synthesis to achieve a fully talking chatbot that runs…



www.lucianoabriata.com I write and photoshoot about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here.


Related Articles