The world’s leading publication for data science, AI, and ML professionals.

GPT-3-like models with extended training could be the future 24/7 tutors for biology students

Despite rather discouraging results for other subjects, GPT-3 turns out quite good at biology, even more so if its training is extended…

SERIES ON USING GPT-3-BASED SOFTWARE TO ASSIST STEM EDUCATION

Jump to: Introduction | Main examination | First analysis and deeper investigations by extending training | Discussion | Conclusion

Introduction

In my recent tests on GPT-3’s potential to assist students on their science subjects, I found little or no applicability for chemistry and physics beyond some (questionable) potential as a tool to more naturally retrieve theoretical concepts that the system has "learned" from the web.

Devising tests to measure GPT-3’s knowledge of the basic sciences

Testing GPT-3 on Elementary Physics Unveils Some Important Problems

Biology, at least at the high school and early university levels, is highly descriptive and thus could potentially be a niche of applications for GPT-3 and other NLP models. In this article I have explored precisely this point, by following the same procedure of my previous articles: I took GPT-3’s DaVinci model at OpenAI’s playground as if it were a student, and I evaluated it for a series of concepts through an exam-like chat.

Of course GPT-3 lacks any thinking capabilities, but I argued in the article below that it might still be possible to use it as an advanced resource for information retrieval through natural language:

A Note on GPT-3 and Its (Obviously Null) "Thinking" Capabilities

Main examination

Let’s first see the main exam I subjected DaVinci to. In the text below, each line starting with Q is the question I asked, and A is DaVinci’s answer (run with default settings). I placed all the chat in quotation style, and I comment on the chat as it goes in bold format:

Q: Hello, let’s discuss some topics about biology. A: Sure. What would you like to know?

Q: Let’s first see what you know about general biology. How can you divide organisms according to their cellular types? A: There are three types of cells: prokaryotic cells, eukaryotic cells, and archaeon cells.

Strictly speaking what GPT-3 says here is wrong, because it says "there are three types of cells". But considered within the context of the actual question ("how can you divide organisms according to their cellular types", not "how many cell types are there?"), the answer is correct. GPT-3 has replied my question at the domain level, where you have bacteria and archaea (two domains of life, both consisting of single prokaryotic cells), plus eukaryotes which consist of one or more eukaryotic cells.

The answer as given is thus not the best for a student, but I admit the question itself was not the clearest. I think a student could have been confused too. Let’s move forward.

Q: What are the similarities and differences between prokaryotic cells, eukaryotic cells, and archaeon cells? A: Prokaryotic cells are the simplest cells, and they do not have a nucleus or other organelles. Eukaryotic cells are more complex, and they have a nucleus and other organelles. Archaeon cells are similar to prokaryotic cells, but they have a different type of cell wall.

All this is correct, and is consistent with having separated cells in three main types above.

Q: Can you compare a bit more the prokaryotic and eukaryotic cells? A: Prokaryotic cells are generally smaller than eukaryotic cells, and they do not have a nucleus or other organelles. Eukaryotic cells are more complex, and they have a nucleus and other organelles.

Here I tried to guide it to develop specifically on prokaryotic vs. eukaryotic cells, and it did well. Since it mentioned organelles, this is a good point to start digging a bit into them:

Q: Good. What kinds of organelles can you find inside eukaryotic cells? A: Eukaryotic cells have a nucleus and other organelles, including mitochondria, ribosomes, and Golgi bodies.

Q: What functions do mitochondria carry out? A: Mitochondria are responsible for producing energy in the cell.

Q: And how do they produce energy? A: Mitochondria use oxygen to break down glucose and other molecules to produce energy.

Q: What’s the name of that process? A: The process is called cellular respiration.

Q: Good, and which is the main respiratory enzyme? A: The main respiratory enzyme is called cytochrome c oxidase.

Q: Good. What reaction does cytochrome c oxidase catalyze? A: Cytochrome c oxidase catalyzes the reaction of oxygen with glucose to produce water and carbon dioxide.

Up to this point GPT-3 has replied all these questions remarkably well, and it could well correspond to high school biology. I now try to go deeper:

Q: Where does it take electrons from? A: The electrons come from the glucose molecule.

Well one of the intermediate sources is glucose, so this is right. But I was expecting GPT-3 to tell me the electrons come from the Krebs (a.k.a. citric acid) cycle, or NAD molecules, or the electron transport chain, etc. So I asked a bit more:

Q: But how are the electrons extracted from glucose? A: The electrons are extracted from glucose by the cytochrome c molecule.

This is wrong. Cytochrome c is the ultimate electron donor, but it’s not itself directly taking the electrons from glucose.

Q: Not exactly. What I mean is, how is glucose degraded inside cells? A: Glucose is degraded inside cells by the process of glycolysis.

Correct, by definition.

Q: Right, what’s the product of glycolysis? A: The product of glycolysis is two molecules of pyruvate.

Perfect, so let’s dig a bit deeper:

Q: And what happens to pyruvate? A: Pyruvate is converted into acetyl CoA, which is then used in the citric acid cycle.

Q: What is the citric acid cycle? A: The citric acid cycle is a series of reactions that convert acetyl CoA into carbon dioxide and water.

Q: What happens to electrons during the citric acid cycle? A: The electrons are transferred to NAD+ and FAD to produce energy.

Q: Where do these electrons in NAD and FAD go? A: The electrons in NAD and FAD go to the electron transport chain.

This section was perfect.

Q: And where is the respiratory enzyme? A: The respiratory enzyme is called cytochrome c oxidase.

That’s not what I asked. Let’s see of I can guide it:

Q: Is it part of the electron transport chain? A: No, the electron transport chain is a separate chain of proteins.

Well this is wrong, because it’s actually considered the end of the electron transport chain. I think I’ve gone as deep as GPT-3 can correctly answer.

Q: OK let’s check out something else. Do animals and plants contain eukaryotic or prokaryotic cells? A: Animals and plants contain eukaryotic cells.

Q: So you mean plants are eukaryotic. A: Yes, plants are eukaryotic.

Q: Do animal and plant cells contain the same kinds of organelles? A: No, animal and plant cells contain different kinds of organelles.

Q: Explain more A: Animal cells contain organelles such as mitochondria and ribosomes, while plant cells contain organelles such as chloroplasts and cell walls.

All good. Now, since it mentioned chloroplasts, let’s delve into them:

Q: What are chloroplasts? A: Chloroplasts are organelles in plant cells that are responsible for photosynthesis.

Q: What is photosynthesis? A: Photosynthesis is the process of converting light energy into chemical energy.

Q: Please tell me more about how photosynthesis works. A: In photosynthesis, light energy is converted into glucose and other molecules.

Q: How? A: Chloroplasts use light energy to convert carbon dioxide and water into glucose and other molecules.

All this section was great. Let’s test it a bit more.

Q: Great. In what compartment does all this happen? A: Chloroplasts are located in the chloroplast membrane.

Q: In any special kind of cell? A: No, chloroplasts are located in the chloroplast membrane in all plant cells.

Well I wouldn’t say this. Let’s see if I can guide it a bit:

Q: But do non-green parts of cells also do photosynthesis? A: No, only green parts of cells can do photosynthesis.

Good that it could "correct" itself, even if it was just chance.

Q: Ah ok, good. And why are these parts green? A: The green parts of cells contain chlorophyll, which absorbs light energy.

Q: What is chlorophyll? A: Chlorophyll is a molecule that absorbs light energy and uses it to produce glucose.

Although it is involved in photosynthesis, saying that chlorophyll kind of directly uses light to produce glucose is too much, definitely confusing.

Q: And in what subcellular structures does photosynthesis take place? A: Photosynthesis takes place in the chloroplasts of plant cells.

Q: Any more particular location? A: No, photosynthesis takes place in the chloroplasts of plant cells.

I wanted to know a more precise location, so I kept asking:

Q: But where are the proteins that absorb light for photosynthesis? A: The proteins that absorb light for photosynthesis are located in the thylakoid membrane of the chloroplast.

Good. Now I move to ecology, just to probe GPT-3 on another topic.

Q: OK good. Now let’s change topic. What is ecology? A: Ecology is the study of the relationships between organisms and their environment.

Q: What kinds of factors do you have in an ecological system? A: In an ecological system, you have abiotic factors and biotic factors.

Q: What are biotic and abiotic factors? A: Biotic factors are the living components of an ecosystem, while abiotic factors are the non-living components of an ecosystem.

Q: OK give me examples of abiotic factors, please. A: Abiotic factors can include things like water, light, temperature, and soil.

Q: Would animals be biotic or abiotic? A: Animals would be biotic.

Good, now let me see if I can confuse it:

Q: And what about bacteria, plants and rocks? A: Bacteria, plants, and rocks would be biotic.

Well it got wrong the rocks. Let’s ask it individually and using related words too:

Q: Rocks are biotic? A: Yes, rocks are biotic.

Q: Are stones biotic or abiotic? A: Stones are abiotic.

Q: Are rocks biotic or abiotic? A: Rocks are abiotic.

Asked in this way it now got it right. But you see how sure it was above when it said rocks are biotic. Not good for students.

I asked a couple more, and it got them right:

Q: Are minerals biotic or abiotic? A: Minerals are abiotic.

Q: Are plants biotic or abiotic? A: Plants are biotic.

Moving on to something different yet about ecology:

Q: OK let’s see something else, within the scope of ecology. What kinds of relationships can be established between biotic factors? A: There are three types of relationships that can be established between biotic factors: competition, predation, and symbiosis.

Q: What is competition, what is predation, and what is symbiosis? A: Competition is the relationship between two or more organisms where each organism tries to use the same resources as the other. Predation is the relationship between an organism and its prey, where the predator eats the prey. Symbiosis is the relationship between two or more organisms where both organisms benefit from the relationship.

Q: Can you mention one example of competition? A: One example of competition is the relationship between two plants where each plant tries to use the same resources as the other.

Q: An example of predation? A: An example of predation is the relationship between a lion and a zebra, where the lion eats the zebra.

Q: And what’s one example of symbiosis? A: One example of symbiosis is the relationship between a fungus and a tree, where the fungus helps the tree to absorb nutrients from the soil.

Another perfect block. Let’s ask around a bit more:

Q: OK, this brings me to another question: what is a lychen? A: A lychen is a type of fungus that lives in symbiosis with a tree.

Not the best definition. Let’s dig a bit deeper:

Q: Any other kinds of lychens? A: No, there are only three types of lychens: lichens, mycorrhizal fungi, and endophytes.

I’m not an expert in lychens, but this answer is wrong.

Let’s move on once more to something different:

Q: OK now something different, back to cells and organelles. Is there any theory about how eukaryotic cells emerged? A: There is a theory about how eukaryotic cells emerged called the endosymbiotic theory.

Q: What does it say? A: The endosymbiotic theory says that eukaryotic cells evolved from prokaryotic cells that engulfed other prokaryotic cells.

Q: Any specific example of that? A: One example of the endosymbiotic theory is the origin of mitochondria. Mitochondria are organelles in eukaryotic cells that produce energy. It is believed that mitochondria evolved from prokaryotic cells that were engulfed by another prokaryotic cell.

Perfect. Let’s try to dig deeper to know what prokaryotes are suspected to have become mitochondria.

Q: What prokaryotic cell? A: It is believed that the prokaryotic cell that engulfed the other prokaryotic cell was the ancestor of the eukaryotic cell.

Kind of obvious circular answer, not informative. And anyway not what I asked; but my question was a but short so let me be clearer:

Q: Remember what kind of prokaryotic cell gave place to the mitochondrion? A: The prokaryotic cell that engulfed the other prokaryotic cell was the ancestor of the eukaryotic cell.

Still a circular answer. Let’s leave mitochondria.

Q: Any other example? A: Another example of the endosymbiotic theory is the origin of chloroplasts. Chloroplasts are organelles in eukaryotic cells that produce energy from sunlight. It is believed that chloroplasts evolved from prokaryotic cells that were engulfed by another prokaryotic cell.

Q: You mean like cyanobacteria? A: Cyanobacteria are prokaryotic cells that produce energy from sunlight. It is believed that chloroplasts evolved from cyanobacteria that were engulfed by another prokaryotic cell.

Kind of ok, but I helped it.

First analysis and deeper investigations by extending training

I am astonished. Although there were some incorrect and some only partially correct answers, I’m surprised by the high quality and variety of "knowledge" that GPT-3 has learned. In particular, the answers were quite rich in examples, most of them correct. All this goes in line with my conclusions in previous articles: GPT-3 is quite good at retrieving factual data that it has seen during training.

Although it would be risky to use GPT-3 "as is" as an unsupervised 24/7 consultant for students learning biology, it is clear that the performance in this discipline is superior to that in physics or chemistry. Now the good part is that the wrong answers, such as saying first that rocks are biotic or pointing at three main types of cells, are all curable. It is highly possible that by further training the system it can correct these problems. See these further tests where I first teach the system some information, and then ask:

Test on teaching and then asking

Just like before, if I ask bluntly about the types of cells, the system returns an incorrect answer:

Q: What kinds of cells are there? A: There are three types of cells: prokaryotic cells, eukaryotic cells, and plant cells.

But if I tell it first in one paragraph that there are two main types of cells, and I then ask about this, the system gives the expected answer:

Cells are of two main types: prokaryotic and eukaryotic.

Q: What kinds of cells are there? A: There are two main types of cells: prokaryotic and eukaryotic.

Note that this works even if the question is not posed in the same way as the information was given: "Cells are of two main types" vs. "What kinds of cells are there?"

Besides, it’s interesting that this still works even if I give the system more information in-between the sentence with the right answer and the actual question:

Cells are of two main types: prokaryotic and eukaryotic. Eukaryotic cells contain organelles such as nucleus, Golgi apparattus, etc. Plants also have chloroplasts. Interestingly, mitochondria and choloroplasts seem to have evolved after some early eukaryotic cells engulfed bacteria. Rickettsia probably gave place to modern mitochondria, while certain species of cyanobacteria probably became the modern chloroplasts.

Q: What kinds of cells are there? A: There are two types of cells: prokaryotic and eukaryotic.

Note that if I remove the first sentence, then GPT-3’s answer is wrong again:

Eukaryotic cells contain organelles such as nucleus, Golgi apparattus, etc. Plants also have chloroplasts. Interestingly, mitochondria and choloroplasts seem to have evolved after some early eukaryotic cells engulfed bacteria. Rickettsia probably gave place to modern mitochondria, while certain species of cyanobacteria probably became the modern chloroplasts. Q: What kinds of cells are there? A: There are three types of cells: prokaryotic cells, eukaryotic cells, and plant cells.

Now get ready for this. It’s jaw-dropping. I asked again about mitochondria and their origin, with this paragraph I fed the system, and this is what I got:

Cells are of two main types: prokaryotic and eukaryotic. Eukaryotic cells contain organelles such as nucleus, Golgi apparattus, etc. Plants also have chloroplasts. Interestingly, mitochondria and choloroplasts seem to have evolved after some early eukaryotic cells engulfed bacteria. Rickettsia probably gave place to modern mitochondria, while certain species of cyanobacteria probably became the modern chloroplasts.

Q: What kinds of cells are there? A: There are two main types of cells: prokaryotic and eukaryotic.

Q: What are mitochondria? A: Mitochondria are organelles found in eukaryotic cells. They are responsible for producing energy in the form of ATP.

Q: Where do mitochondria come from? A: Mitochondria are thought to have evolved from bacteria that were engulfed by early eukaryotic cells.

Q: What bacteria? A: It is thought that Rickettsia bacteria gave rise to modern mitochondria, while certain species of cyanobacteria became the modern chloroplasts.

I didn’t explicitly tell GPT-3 that rickettsia are bacteria (although it can be derived from my text as written), yet the system came up with this sentence which is perfect and now gives the right, complete answer I was trying to pull out from it in the exam!

Now if I remove the crucial piece of information about rickettsia being the purported precursors of mitochondria, then the answer to that last question turns out wrong:

Cells are of two main types: prokaryotic and eukaryotic. Eukaryotic cells contain organelles such as nucleus, Golgi apparattus, etc. Plants also have chloroplasts. Interestingly, mitochondria and choloroplasts seem to have evolved after some early eukaryotic cells engulfed bacteria. For example, certain species of cyanobacteria probably became the modern chloroplasts.

Q: What kinds of cells are there? A: There are two main types of cells: prokaryotic and eukaryotic.

Q: What are mitochondria? A: Mitochondria are organelles found in eukaryotic cells. They are responsible for producing energy in the form of ATP.

Q: Where do mitochondria come from? A: Mitochondria are thought to have evolved from bacteria that were engulfed by early eukaryotic cells.

Q: What bacteria? A: It is thought that certain species of cyanobacteria became the modern mitochondria.

That last answer is wrong.

Discussion

As summarized in the opening paragraphs of the previous section, the exam I took to GPT-3’s DaVinci revealed that it’s quite good at answering factual questions on biology, even being quite rich in examples. Yet it fails at times, and what’s worst is that all answers appear to be given with high confidence -a critical flaw for applications in supporting students.

Now, finding that you can train the system with curated facts is a game-changer. One could in principle feed GPT-3 with a curated corpus that will help correct all these kinds of problems, maybe even complete gaps in the content. Let me show you one final example of this, pushing the limits of training to the limit:

If I ask GPT-3’s DaVinci this question, highly specific to phosphate lipid biology:

Q: Which are the cellular receptors for inositol pyrophosphate?

The run just hangs and throws out an error asking me to rephrase the input prompt.

But if I input (here just a part of) the introduction of a paper about that specific topic, and I then ask, I get the information retrieved correctly:

Two mechanisms were put forward regarding the regulation of the SPX–PHR interaction in response to changes in nutrient availability: SPX domains were proposed to act as direct Pi sensors, with the SPX–PHR interaction occurring in the presence of millimolar concentrations of Pi. Alternatively, the integrity of the SPX–PHR complex could be regulated by protein degradation. Indeed, SPX degradation via the 26S proteasome is increased under Pi starvation. Fungal, plant, and human SPX domains have been independently characterized as cellular receptors for inositol pyrophosphates (PP-InsPs), which bind SPX domains with high affinity and selectivity. PP-InsPs consist of a fully phosphorylated myo-inositol ring, carrying one or two pyrophosphate groups at the C1 and/or C5 position, respectively. In plants, inositol 1,3,4-trisphosphate 5/6-kinase catalyzes the phosphorylation of phytic acid (InsP6) to 5PP-InsP5 (InsP7 hereafter). The diphosphoinositol pentakisphosphate kinases VIH1 and VIH2 then generate 1,5(PP)2-InsP4 (InsP8 hereafter) from InsP7. Plant diphosphoinositol pentakisphosphate kinases have been genetically characterized to play a role in jasmonate perception and plant defense responses and, importantly, in nutrient sensing in Chlamydomonas and Arabidopsis.

Q: Which are cellular receptors for inositol pyrophosphates? A: Fungal, plant, and human SPX domains are cellular receptors for inositol pyrophosphates.

Note that I didn’t ask the question exactly in the way as its answer is phrased inside the text (in bold), yet the system could find it.

Moreover, I can keep asking for more questions:

(same introductory text as above)

Q: What enzyme phosphorylates phytic acid? A: In plants, inositol 1,3,4-trisphosphate 5/6-kinase catalyzes the phosphorylation of phytic acid (InsP6) to 5PP-InsP5 (InsP7 hereafter).

And note that again the question is not exactly as the answer is phrased.

Clearly, training of GPT-3 to extend and correct its knowledge base works. I will investigate this in more detail for a future article -it would also be great to hear about this from OpenAI and NLP experts too.

Conclusion

Globally, I suspect that GPT-3 has true potential to work as a 24/7 tutor for biology, if trained with a reliable corpus of data. Possible sources of such data could include material prepared by expert teachers, the text from academic reviews and papers as exemplified above, and perhaps even the full Open Access corpus of peer-reviewed scientific material on biology, such as full free articles from PubMed or at least its full record of abstracts. Naturally, such retraining would not be trivial given the large volumes of data. But it being comparable to that already used to pre-train the system, the idea sounds not too far-fetched.


I am a nature, science, Technology, programming, and DIY enthusiast. Biotechnologist and chemist, in the wet lab and with computers. I write about everything that lies within my broad sphere of interests. Check out my lists for more stories. Become a Medium member to access all its stories and subscribe to get my new stories by email (original affiliate links of the platform for which I get small revenues without special costs to you). Donate here through various means. Contact me here for any kind of inquiry.

To consult about small jobs (on programming, biotech + bioinf project evaluations, science outreach + communication, molecular data analysis and design, molecular graphics, photography, private courses and tutorials, private lessons, teaching and tutoring, etc.) check my services page here.


Related Articles