The world’s leading publication for data science, AI, and ML professionals.

A Gentle Introduction To Generative AI For Beginners

Let's understand the big picture behind generative AI

Image by Susan Cipriano on Pixabay
Image by Susan Cipriano on Pixabay

The last months have seen the rise of the so-called "Generative AI", which is a sub-field of Artificial Intelligence (AI). Tools like ChatGPT have become one of the most spoken words and are becoming fundamental tools for everyday tasks in many jobs (even to learn to code).

Words like "DALL-E", "ChatGPT" and "Generative AI" have pervaded socials, media, chats with colleagues, and everything related to our world over the last few months. Literally, everyone is talking about that.

But what is generative AI? Why is that any different from "normal" AI?

In this article, we’ll clarify the big picture behind generative AI. So, if you’ve participated in discussions but don’t have clear ideas on this topic, this article is definitely for you.

This is a discursive explanation to understand the basics of what’s behind the scene of generative AI. So, don’t worry: you won’t find any code here. Just ideas and descriptions and these will be presented in a very short and concise way. In particular, we’ll focus on Large Language Models and Image Generation Models.

Here’s a summary of what you’ll learn here:

Table of Contents:

What is generative AI and how does it differ from trditional AI?
Large Language Models
Image generation

What is generative AI and how does it differ from traditional AI?

Generative AI is a subfield of AI that involves creating algorithms that can generate new data such as images, text, code, and music.

The big difference between generative AI and "traditional AI" is that the former generates new data based on the training data. Also, it works with types of data that "traditional AI" can’t.

Let’s say it a little more technically:

  • "Traditional AI" can be defined as discriminative AI. In this case, in fact, we train Machine Learning models so that they can make predictions or classifications on new, unseen data. These ML models can work only with numbers, and sometimes with text (for example, in the case of Natural Language Processing).
  • In generative AI, we train an ML model and it creates an output that is similar to the data it has been trained on. These kinds of ML models can work with different kinds of data like numbers, text, images, and audio.

Let’s visualize the processes:

The process behind traditional AI. Image by Author.
The process behind traditional AI. Image by Author.

So, in traditional AI, we train an ML model to learn from data. Then, we feed it with new and unseen data and it can discriminate, making predictions or classifications.

Regarding the presented example, we’ve trained an ML model to recognize dogs from images. Then, we feed the trained ML model with new and unseen pictures of dogs and it will be able to classify whether these new images are representing dogs or not.

This is the typical task for a Deep Learning algorithm, in the case of a classification problem.

The process behind generative AI. Image by Author.
The process behind generative AI. Image by Author.

In the case of generative AI, instead, we train an ML model with data from various sources using a vast quantity of data. Then, thanks to a prompt (a query in natural language inserted by a user), the model gives us an output that is similar to the data it’s been trained on.

To stick to the example, our model has been trained on a massive amount of (text) data that, among others, explains what a dog is. Then, if a user queries the model asking what a dog is, the model will describe what a dog is in natural language.

This is the typical task performed by tools like ChatGPT.

Now, let’s see some types of generative AI models.


Large Language Models

Let’s start to dive into the various kinds of generative AI subfields by starting with Large Language Models (LLMs). [An LLM is](https://en.wikipedia.org/wiki/Large_language_model#:~:text=A%20large%20language%20model%20(LLM,learning%20or%20semi%2Dsupervised%20learning.) (from Wikipedia):

a computerized language model consisting of an artificial neural network with many parameters (tens of millions to billions), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning.

Though the term large language model has no formal definition, it often refers to deep learning models with millions or even billions of parameters, that have been "pre-trained" on a large corpus.

So, LLMs are Deep Learning (DL) models (aka, Neural Networks) trained with millions of parameters on a huge amount of text (this is why we call them "large") and are useful to solve some language problems like:

  • Text classification
  • Question & Answering
  • Document summarization
  • Text generation

So, another important difference between standard ML models is that, in this case, we can train a DL algorithm that can be used for different tasks.

Let me explain better.

If we need to develop a system that can recognize dogs in images as we’ve seen before, we need to train a DL algorithm to solve a classification task that is: tell us if new, unseen images are representing dogs or not. Nothing more.

Instead, training an LLM can help us in all the tasks we’ve described above. So, this also justifies the amount of computing power (and money!) needed to train an LLM (which requires petabytes of data!).

As we know, LLMs are queried by users thanks to prompts. Now, we have to spot the difference between prompt design and prompt engineering:

  • Prompt design. This is the art of creating a prompt that is specifically suitable for the specific task that the system is performing. For example, if we want to ask our LLM to translate a text from English to Italian, we have to write a specific prompt in English asking the model to translate the text we’re pasting into Italian.
  • Prompt engineering. This is the process of creating prompts to improve the performance of our LLM. This means using our domain knowledge to add details to the prompt like specific keywords, specific context and examples, and the desired output if necessary.

Of course, when we’re prompting, sometimes we use a mix of both. For example, we may want a translation from English to Italian that interests a particular domain of knowledge, like mechanics.

So, for example, a prompt may be:" Translate in Italian the following:

the beam is subject to normal stress.

Consider that we’re in the field of mechanics, so ‘normal stress’ must be related to it".

Because, you know: "normal" and "stress" may be misunderstood by the model (but even by humans!).

The three types of LLMs

There are three types of LLMs:

  • Generic Language Models. These are able to predict a word (or a phrase) based on the language in the training data. Think, for example, of your email auto-completion feature to understand this type.
  • Instruction Tuned Models. These kinds of models are trained to predict a response to the instructions given in the input. Summarizing a given text is a typical example.
  • Dialog Tuned Models. These are trained to have a dialogue with the user, using the subsequent responses. An AI-powered chatbot is a typical example.

Anyway, consider that the models that are actually distributed have mixed features. Or, at least, they can perform actions that are typical of more than one of these types.

For example, if we think of ChatGPT we can clearly say that it:

  • Can predict a response to the instructions, given an input. In fact, for example, it can summarize texts, give insights on a certain argument we provide via prompts, etc… So, it has features like an Instruction Tuned Model.
  • Is trained to have a dialog with the users. And this is very clear, as it works with consequent prompts until we’re happy with its answer. So, it has also features like a Dialog Tuned Model.

Image generation

Image generation has been around for quite some time, contrary to what one might believe.

Anyway, in recent times it gained popularity, especially with tools like "DALL-E" or "stable diffusion" that have cleared their use, making this technology accessible to the masses worldwide.

We can say that image generation can be divided into four categories:

  • Variational Autoencoders (VAEs). Variational autoencoders are "probabilistic generative models that require neural networks as only a part of their overall structure". In operational words, they encode images to a compressed size and decode them to the original size. During this process, they learn the distribution of the data.
  • Generative Adversarial Models (GANs). These are generally the most known, at least as a word that resonates in the field of generative AI. A GAN is "a class of ML framework in which two Neural Networks are pith against each other where the gain of one is the loss of the other". This means that one Neural Network creates the image while the other predicts if it is real or fake.
  • Autoregressive models. In statistics, an autoregressive model is the representation of a random process. In the context of generative images, these kinds of models generate images by treating images as a sequence of pixels.
  • Diffusion models. Diffusion models have been inspired by thermodynamics and are definitely the most promising and interesting kinds of models in the subfield of image generation.

This is the process working under the hood of diffusion models:

  • Forward distribution process. We have an initial, iterative, process where the structure of the image is "destroyed" in a data distribution. In simple words, is like we iteratively add noise to the image, until all the pixels become pure noise and the image is not recognizable (by the human eye).
  • Reverse diffusion process. Then, there is a reverse diffusion process which is the actual learning process: this restores the structure of the data. It is like our model learns how to "de-noise" the pixels to recreate the image.

The power of connecting it all

If you maintained the attention until now, a question should naturally pop up in your mind:" Ok, Federico, it’s clear. But I’m missing something: when I use "DALL-E" I insert a prompt and it outputs an image: we haven’t talked about that, do we?!".

No, we haven’t.

Above we made a brief description of the most promising (and currently, the most used) model for generating images, but the missing part is the prompt.

We’ve discussed, in fact, how they work at a high level. Meaning: we gave a short explanation of how their learning process works.

But the real power of these models arrives when they are coupled with LLMs. This coupling, in fact, gives us the possibility to combine the power of prompt engineering to ask our models for outputs.

In other words: we’ve combined the possibility to use natural language as inputs to models that can actually understand it and can generate images according to it.

Isn’t it a superpower?!?


Conclusions

Concluding, we can say that generative AI is a subfield of AI that generates new data similar to the train data.

While, on the one hand, LLMs can generate text based on training data and image generation models can generate new images based on the training images, the real power of generative AI, at least in the case of images, relies on the combination of LLM and models for image generation. This gives us the possibility to create images according to prompts as inputs.


NOTE: this article has been freely inspired by the Generative AI course provided by Google, and some references are taken from it. I suggest taking this course, for a better understanding of generative AI.


Federico Trotta
Federico Trotta

Hi, I’m Federico Trotta and I’m a freelance Technical Writer.

Want to collaborate with me? Contact me.


Related Articles