
In this article, I will show how you can structure and explore the content of your own articles using graph technology and some programming.
The idea of using NLP techniques for structuring unstructured data is not new, however, the latest progress in LLMs (Large Language Models) has sparked countless opportunities for doing just that. The accessibility for amateurs through the booming technology Chat-GPT has created a lot of attention towards LLMs and generator models.
In fact, generative AI is on the agenda in many companies already!
The way we will work with the technology in this article is through the programming language Python using OpenAI’s developer API. We will work on data from Medium (meta huh?) and build a knowledge graph. That may sound like a mouthful, but it is actually surprisingly easy to get started with.
Getting started
First things first. The plan of attack is the following.
- Get the API to work and access it through Python.
- Use a sample text to do prompt engineering ensuring that the GPT-4 model understands what you want from it.
- Download your articles from Medium (you can of course use other pieces of text if you want) and pre-process the data.
- Extract and collect output from Chat-GPT.
- Post-process the output from Chat-GPT
- Write code to structure the data further into a graph using the Cypher query language.
- Play around with your new best friend and explore your articles.
Without further ado, let’s get started by quickly setting up the basic tech.
Setup
We need to have the programming language Python and the graph database Neo4j installed on our local computer.
The first thing to do is to ensure that you have a plus account at OpenAI so that you can use GPT-4. The second thing you should make sure of is that you have signed up for the API use. Once that is in place, you need to generate an API key. Then you need to pip install openai.
Before connecting to ChatGPT, let’s go to the browser and try to find the right way to ask about this task. This is called prompt engineering and it is very important to get right. By trying out different ways to ask using a random article of mine as an example, I found that the right way to ask was to provide a detailed and guided prescript before giving it the actual text.
I ended up with the following prescript:

As an example, I gave it a snippet from the article about the Gamma function that I wrote a long time ago:

What it came up with was the following:

Even though it clearly didn’t really understand the task, it did okay, especially with the format. However, sometimes it creates duplicates, and note that it hallucinated some entities and relationships even though we asked it not to. Annoying disobedient machine! We will deal with this later.
For future uses, we will store this prescript in a Python file called _promptinput.py.
Now that the basic setup is in place, let’s test if it actually works.
If the code is only for you and only on your local machine, you can hardcode the API key in the Python file, otherwise you can set it as an environment variable or place it in a config file that you don’t push anywhere!
Let’s test if this setup works. We create a file called connect.py containing the basic connection to ChatGPT from Python.
We verify that this works!
Data
I need to fetch articles from my Medium account. At the time of writing, I have published 123 articles, but the download feature from Medium returns 259 files! This is because it classifies comments and drafts as posts too. We only want the published articles, but that is not the only problem. The files are HTML files! That is of course great if you want to read them in a browser, but not if you want to work with the pure text.
Well, nice try, Medium, but that can’t stop a data scientist armed with programming languages and dirty tricks!
We also note that the file names of the downloaded files are quite messy. A standard name is for example "_2020–12–11The-Most-Beautiful-Equation-in-the-World-5ab6e49c363.html"
Let’s store these files in a folder called raw.
We write a small module called _extract_text_fromhtml.py with some functionality to extract the text from these files:
Before we can use it to actually get results from ChatGPT, we need to be able to split the text up into batches. The reason is that GPT-4 has a token limit. Luckily, this is easy. In a file called preprocess.py, we write:
Now we are ready to actually get some data from ChatGPT.
We write a file called _processarticels.py where loop through the articles, retrieve the titles from the frightening file names, extract the actual text from the HTML files, run each batch of text through ChatGPT, collect the results from the files, and save the outputs from the model in new files that we store in a folder called data. We also save the actual texts in a folder called cleaned for later use.
Phew!! that was a lot. But actually, the code is simple because we have already done some of the work in other files.
The above code might take a while to execute as the GPT-4 model is relatively slow compared to the other sub-performing models available. We make sure to use a cashing setup so that if the program crashes, then we don’t start all over, we just start where we left off.
Now (several hours in agony later) we have a structured dataset of results from GPT-4. Perfect. Now we "just" need to build a graph from it.
Building the knowledge graph
We will merge the preprocess and graph creation process into one single function. This is normally not very advisable (separation of concerns and all), but because we need to look at a "relationship and entity" – level in the preprocessing anyway, we might as well create the nodes and relationships in the graph while we have our hands dirty.
Let us create a small API containing a driver so we can talk to our graph.
We need to loop over the results, make sure that the entities are not too long, clean the results, and define the nodes and relationships in the output from the gpt-model, we don’t want to call the graph with the same query multiple times, we want the entities to be connected to the original articles, and then we need to make sure that each entity and relationship cooked up by Chat-GPT is actually in the text so we don’t build a graph of machine dreams!
The last of the above requirements is to raise the probability that we can trust the graph even though it is not fool proof if you think about it.
No biggie! We write the following:
There are of course many ways to create a schema for a knowledge graph. It is not always easy to see what should be nodes and what should be relationships but since we don’t want relationships between relationships, we went for the above.
Moreover, I chose a minimalistic approach for this article. Normally, we would enrich the nodes and relationships with more properties.
Now we just need a main point of entry.
That is it. Now we have ourselves a knowledge graph containing information from my articles on Medium. In fact, we have about 2000 nodes and 4500 relationships.
Exploring the graph
So what can we do with this thing? What should we ask of it?
Let’s try to find out how many articles the different persons were found in. We have the following:

Not surprisingly, Euler tops the list but I am surprised that Ramanujan and Newton were found in 4 of my articles. I could of course find the titles of the articles if I wanted but let’s move on. Okay, so this was fun but you don’t need a graph to figure this out.
Let’s try something else. Let’s see how many articles mention both Riemann and Euler.

Let’s see how many of Euler’s discoveries my articles mention.

Hmm no Euler line? I have to write an article about that.
Let’s find out how many articles shared some mathematical keyword with the article Group Theory.

The result is displayed here as 27 other articles connected by the non-orange nodes in the above image. Even though this is merely a toy example, one could imagine how this could just as well show how business-related documents are related by various sensitive keywords important within some disciplines such as GDPR or Audit.
Takeaways
Obviously, this work should be seen as what we call a "proof of concept". We can’t use my articles for anything really, but if this had been texts from a company containing information about their customers and employees from emails to word files, pdfs, and so on, this could be used to map out how customers are related and which employees work closely together.
This in turn would give us a 360 view of how data flows through the organization as a whole, who is the most important person for a specific type of information flow, who is the right one to reach out to if you want to know about a specific topic or document, the client we contacted in our department was once contacted by another department, etc.
Extremely valuable information. Of course, we can’t use ChatGPT for this because we don’t know what happens to the data we send it. Therefore it is not a good idea to ask it about sensitive or business-critical information. What we need to do is to download another LLM (large language model) that only lives on our laptop. A local LLM. We can even fine-tune it on our own data. This is done as we speak by many companies already for building chatbots, assistants, and so on.
But using it to build a knowledge graph over your unstructured data is next-level if you ask me and I think I have shown that it is more than doable!
If your company wants to know how we can use your data to weave a spiderweb of possibilities for your business, then please reach out to me or my colleague Kenneth Nielsen.
Thank you for reading.
If you like to read articles like this one on Medium, you can get a membership for full access. To join the community, simply click here.