Background – The need for Efficient Analytics
In my opinion, analytics has been one of the toughest arenas to operate in due to the immense volumes of ad hoc requests. Typically, it involves writing a SQL query or conducting some analysis on a spreadsheet that ends up taking longer than anticipated. This results in analytics teams spending most of their time firefighting, building tactical solutions, and never having the chance to be proactive.
I have frequently pondered on the idea of an AI assistant that could manage ad hoc analytics requests, much akin to the chatbots that are now ubiquitous in customer service. However, this always felt rather far-fetched due to the perceived complexity surrounding some analytics queries. Now, with advances in generative AI, we are at a stage where automating mundane, ad hoc requests is feasible. In this article, I present a prototype analytics bot. I evaluate the performance of the bot on some "typical" analytics requests and briefly discuss the implications for commercial analytics.
AI-Powered Data Analyst
The analytics bot serves as an AI data analyst through a chat-like interface. Anyone requesting ad hoc analytics can pose questions to the bot in the same manner they would to a data analyst. The bot interprets queries in natural language, converts them into SQL, and queries the data behind it to produce a response in natural language. I have linked to the bot at the end of the article; you will need an OpenAI API key and a data source URL to try it.
Some Technical Bits – How to Build Your Own
Building your own prototype analytics bot is readily achieved in seven lines of Python code. The core functionality of the bot centres around Langchain and OpenAI (although you can leverage any large language model for the task). However, if you are using OpenAI, you will need an API key, for which you can register here. The bot utilises the SQLDatabaseChain API, which employs SQLAlchemy to connect to SQL databases. It can be used with any SQL dialect supported by SQLAlchemy. You can construct a basic bot in Jupyter notebooks with the following code:
Testing the Analytics Bot
I tested the analytics bot on a CO2 emissions dataset sourced from Kaggle recording emissions, engine size, and fuel consumption data for various vehicles.
Please note the data is available for both commercial and non-commercial purposes under the Open Data Commons License.
The test comprises three sets of questions of varying difficulty, designed to evaluate the analytical capability of the bot.
Beginner Level: These are straightforward univariate statistics that require beginner-level analytics reasoning. There is no requirement for domain knowledge as everything the analyst needs to understand to answer the query lies within the dataset.
Intermediate Level: These involve bivariate analysis, requiring additional steps to calculate further variables from the data before arriving at the answer. The main difference between the intermediate and beginner-level queries is the additional computational steps required in writing the pandas query.
Higher Level: These types of questions necessitate a level of understanding about the domain context to correctly answer. Notice in the diagram that the actual python query is the same as the beginner level, but the thought process is more complex.
I have manually completed each test as an analyst would do using pandas. The notebook for my test responses is linked here. You should make yourself comfortable that the responses from the bot are aligned with my own.
Beginner Level Queries
Question 1 – Which vehicle has the highest CO2 emissions on Average?
Question 2 – Which vehicle has the lowest CO2 emissions on average?
In each case, the bot returns the correct answers, demonstrating basic data manipulation skills – something one would expect from a junior data analyst.
Intermediate Level Queries
Question 1 – Which vehicle has the highest ratio of emissions to engine size?
Question 2 – Which vehicle has the lowest ratio of emissions to engine size?
The bot returns the correct answers, showing the capacity to respond to slightly more intricate queries, though still relatively basic. This time, the bot needed to create a ratio variable and then perform a sort on the table to locate the highest and lowest values.
Higher Level Queries
Question 1 – Which vehicle is the most fuel-efficient?
Question 2 – Which car is the least fuel-efficient?
Question 3 – Which vehicle is the worst for the environment?
The bot correctly answered all three questions. What sets these questions apart is that they require some degree of domain knowledge. The bot had to "understand" fuel efficiency and environmental impact to answer these questions. An analyst adept in data manipulation but lacking knowledge about these subjects may have perhaps struggled. These types of higher-level queries are typically posed by product managers or even CXO-level stakeholders.
Commercial Implications
Let’s explore some of the commercial implications of implementing an analytics bot.
Shift Towards Engineering: If you’re an analyst, you might be concerned about redundancy through AI. However, I would like to take this opportunity to reassure you that I don’t believe that this will happen – although the roles of data analysts will evolve. Here is how I see that occurring. For these AI bots to function effectively, they need clean and well-curated data sources to query from, much like a data analyst would require. Therefore, I envisage analytics roles shifting their focus more towards data engineering or analytics engineering. This will involve ensuring data generating processes are up to scratch, concentrating on curation, scaling, and acquisition of high-quality data sources rather than ad hoc analytics.
On-Demand Insights: Product managers, portfolio directors, CEOs, or just anyone needing insights, your frustrations will be alleviated. It will be a gradual transition, due to the necessary engineering work, but these bots will make obtaining quick insights about your business easier than ever. However, with great power comes great responsibility. The onus will be on the consumers of insights to be well-trained in interpreting these insights, asking the right questions, and making data-driven decisions. This takes us further into the realm of decision sciences.
Advancement of ML & Data Sciences: As analytics resources are redirected towards data curation, acquisition, and management at scale, a byproduct is the enablement of data sciences and machine learning. These types of well-engineered data management structures are precisely what’s required for data scientists and machine learning engineers to build custom machine learning solutions, thereby accelerating the adoption of AI within a business.
Try for Yourself
You’re welcome to try out the analytics bot yourself. You will need an API key, and it’s recommended that you link to a data source hosted on GitHub. Try it with the CO2 emissions dataset if you wish. You will need to present the raw link to the GitHub file as shown below:
https://raw.githubusercontent.com/john-adeojo/open_datasets/main/CO2%20Emissions_Canada.csv
Watch the Live Demonstration
Conclusion
Despite only being tested on a small sample size, the analytics bot has demonstrated the potential of AI for automating ad hoc analytics queries that are the bane of many commercial analytics teams. The next challenges in analytics will be effective data management, necessitating a transition for analysts towards more engineering-focused roles. Decision-makers will also have to evolve, learning the discipline of decision science to help them ask the right questions and make data-informed decisions.
If you’re interested in deploying an analytics bot solution within a commercial setting, there are a few considerations you should take into account:
- Data Privacy: This prototype utilises the OpenAI API. Data privacy must be prioritised when utilising any similar solution whenever data is being transferred to a third party.
- Scaling: The prototype is designed to hold a single dataset in memory at a time and cannot execute queries across relational datasets. Implementing this in a commercial setting would necessitate a consideration of infrastructure requirements at scale.
- Data Management: I alluded to this earlier, but data management will be key here. Datasets should be well curated and the data-generating process must be thoroughly understood for any insights to be meaningful.
- Hallucinations: The bots have a propensity to return inaccurate responses, and it’s challenging to quantify the rate at which this happens. It’s true that human analysts also make errors. Decision makers seeking to utilise these bots will need to be aware of their tendency to hallucinate and learn to view responses with a healthy degree of scepticism and apply appropriate scrutiny.
Follow me on LinkedIn
Subscribe to medium to get more insights from me:
Should you be interested in integrating AI or data science into your business operations, we invite you to schedule a complimentary initial consultation with us: