!["Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure" [Photo by Adi Goldstein on Unsplash]](https://towardsdatascience.com/wp-content/uploads/2023/12/023XYlevvRB_85wRf-scaled.jpg)
Introduction
Since I started working with the Google Platform, Google has not stopped surprising me with its BigQuery (BQ) features and development.
A real "wow" moment for me happened four years ago.
I remember it like it was yesterday, and I was sitting in the front row at the Big Data London 2019 conference. Little did I know back then about the possibility of creating machine learning models using only BQ functions, or, better said, what BQ Machine Learning (BQML) is.
At least until the conference session, where the Google colleague presented how you can create classification, clustering, and time-series forecasting models by simply using Google’s SQL.
The first thought that went through my mind back then was "You must be joking".
The second thought in my head was, "Does this mean that everyone who knows only SQL will be able to create machine learning models?"
As you can assume, the answer is "yes" if you are using BigQuery as your data warehouse.
Now, after using the BQML functions for a while, the correct answer to the question listed above is "maybe."
This means that even though the [CREATE MODEL](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create)
syntax is written in SQL, knowledge of machine learning modelling and statistics is still needed.
In other words, you still need to understand the math behind the available models for different types of machine learning use cases (supervised/unsupervised), conduct feature engineering, hyperparameter tuning and model evaluation tasks.
Fast forward to the year 2023, and BigQuery is further amazing me with its new features.
This time, we are talking about the new generative AI BigQuery machine learning functions.
With these new functions, data engineers and analysts are able to perform generative natural language tasks on textual data stored in BQ tables with a few query lines.
Hence, the goal of this blog post is to showcase the new analytical advances of BQ in generative AI, with a focus on one function—the _GENERATE_TEXT_ function.
About the GENERATE_TEXT function
The main idea behind the _GENERATETEXT function is to assist data professionals with the following tasks using only SQL and prompt instructions in BigQuery [1]:
- Classification
- Sentiment Analysis
- Entity extraction
- Extractive Question Answering
- Summarization
- Re-writing text in a different style
- Ad copy generation
- Concept ideation
In other words, the function can perform generative natural language tasks on textual attributes stored in BQ by using the Vertex AI [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text)
natural language foundation model [1].
It works in the following way [1]:
- First, it sends requests to a Bigquery Ml remote model representing a Vertex AI
text-bison
natural language foundation model (LLM). - Then, it returns the response with defined input parameters.
This means that different function parameters and the input prompt design (prompt instructions for analysis) affect the LLM’s response.
With this said, the following function-specific parameters can be passed to the _GENERATETEXT function and affect the response quality [1]:
#1: model
[STRING
] → specifies the name of a remote model that uses one of the text-bison
Vertex AI LLMs.
#2: query_statement
[STRING
] → specifies the SQL query that is used to generate the prompt data.
#3: max_output_tokens
[INT64
] → sets the maximum number of tokens that the model outputs. For shorter responses, the lower value can be specified.
- Note: A token might be smaller than a word and is approximately four characters.
#4: temperature
[FLOAT64
] → argument in the range[0.0,1.0]
that is used for sampling during the response generation, which occurs when top_k
and top_p
are applied.
- Note: The argument presents the degree of randomness in token selection, i.e., the lower values are good for prompts that require a more deterministic response, while higher values can lead to more diverse results.
#5: top_k
[FLOAT64
] → argument in the range[1,40]
that changes how the model selects tokens for output.
- Note: To get fewer random responses, lower values should be specified.
#6: top_p
[FLOAT64
] → argument in the range[0.0,1.0]
that changes how the model selects tokens for output.
- Note: To get fewer random responses, lower values should be specified. Tokens are selected from the most probable (based on the
top_k
value) to the least probable until the sum of their probabilities equals thetop_p
value.
After we understand the purpose of the function and the role of each parameter, we can start to demo how the BigQuery _GENERATETEXT function _c_an be used.
The 4-step methodology for using the function
This section will present four methodological steps for testing the Generative Ai function _GENERATETEXT.
![GENERATE_TEXT function workflow [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1x3m8cxqGA_KqJzHvayJe7Q.png)
In summary, the methodology consists of:
- Generation of the small mockup dataset using ChatGPT and exporting it to a Google Sheets document.
- Creation of the BigQuery dataset on top of the Google Sheets document.
- Setting up the connection between Google service Vertex AI and BQ in order to use a remote generative AI model.
- Hands-on use-case examples for testing function _GENERATETEXT on top of the mockup dataset in BigQuery.
More context for every step is shown in the subsections below.
Step 1: Generate a mockup dataset using ChatGPT
As I didn’t have a real-life dataset, I decided to create a mockup dataset for testing purposes.
For this, I used ChatGPT by entering the following prompt text:
"I would need to automatically generate customer reviews for 50 different imaginary hair products.
For each product, I would need 2 positive, 2 negative, and 1 neutral review.
In addition, I would expect reviews to be at least 4 sentences long and contain different information: product name, location of the store where the product was bought, and price of the product.
In negative reviews, please include different reasons, like product quality issues, price issues, and delivery issues."
The outcome was a small dataset table with five attributes (previewed in the image below):
- _product_name –_ contains values of the fake product names,
- _review_type –_ contains the review sentiment type (positive, negative or neutral),
- _store_location –_ contains the random city and state name,
- _product_price – contains the random product price in dollars,_ and
- _product_review—_contains the five-sentence-long fake product review text.
![BigQuery mockup dataset preview [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1Qyf_NEaeGwvRMORD686ofw.png)
The complete dataset can be found in the Git repository here [4].
After preparing the mockup dataset and storing it in Google Sheets, the next step is transferring it to BigQuery.
Step 2: Create a BQ table on top of the Google Sheet document
My go-to method when I have a small dataset, which I need to tweak or change often manually, is to create a BigQuery table on top of the editable Google Sheet document.
To do so, the following substeps are needed [2]:
Substep #1: Create a dataset and table in the BigQuery project by specifying the file location and format
The CREATE TABLE
option located in the top-right corner of the BigQuery environment should be selected for the creation of the new dataset and table. Values marked with a star sign (*****) need to be entered, as shown in the image below.
![Creating a BQ dataset and table on top of the mockup dataset in Google Sheets [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1GS3n3gyVOcMNK0sGhQtvmg.png)
As visible from the image above, the name of the newly created BQ dataset is hair_shop
, while the name of the newly created table is product_review_table.
Important note to keep in mind:
- In case you don’t want to define the table schema in the
Schema
section shown in the image above, all imported attributes from Google Sheets will have the datatypeSTRING
by default.
Substep #2: Query the mockup dataset in the BQ
The second substep is to simply explore the hair_shop.product_review_table
dataset directly in BigQuery.
![Query the mockup dataset in BQ [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1lNC3pH4CZd9r6h_yIjilqw.png)
Now that you can access your mockup dataset in BQ, it’s time to connect the generative AI remote model to it.
Step 3: Connect the Vertex AI service to BQ
A detailed explanation that served as a guideline for this step can be found in Google’s documentation here [3].
To summarize the provided Google tutorial, the three main substeps for connecting the Vertex AI to BQ are:
- Substep #1: Create a cloud resource connection from BigQuery to get the connection’s service account.
- Substep #2: Grant the connection’s service account an appropriate role to access the Vertex AI service.
- Substep #3: Create the
text-bison
remote model that represents a hosted Vertex AI large language model in the created BigQuery datasethair_shop
.
Once these substeps are concluded, a new object Model
will be visible within the hair_shop
dataset.
![A preview of the new database object Model in the selected BQ dataset [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1tCzYQzJzsnAMpogIQ5bpgA.png)
Model
in the selected BQ dataset [Image by author]Finally, the magic can begin, and we can start using the _GENERATETEXT function.
_Step 4: Use the GENERATE_TEXT
function in BQ_
In this step, we will focus on two use cases for testing the function usage: sentiment analysis and entity extraction.
The reason for selecting these two use cases is that we already created two attributes in the input mockup dataset (review_type
and product_review)
which can be used to test the function results. With function results, we mean validating the accuracy of the AI-generated values within the new attributes.
Let’s now present the concrete input-output flow for generating the new attributes for each use case.
Sentiment analysis example
The query for sentiment analysis is shown in the code block below.
--Generate the positive/negative sentiment
SELECT
ml_generate_text_result['predictions'][0]['content'] AS review_sentiment,
* EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
ML.GENERATE_TEXT(
MODEL `hair_shop.llm_model`,
(
SELECT
CONCAT('Extract the one word sentiment from product_review column. Possible values are positive/negative/neutral ', product_review) AS prompt,
*
FROM
`macro-dreamer-393017.hair_shop.product_review_table`
LIMIT 5
),
STRUCT(
0.1 AS temperature,
1 AS max_output_tokens,
0.1 AS top_p,
1 AS top_k));
The breakdown of the query is as follows:
Outer SELECT statement:
- The outer query selects the newly generated string attribute
review_sentiment
obtained from LLM by applying theML.GENERATE_TEXT
function to thehair_shop.llm_model
object. - In addition, it selects all other columns from the input dataset
hair_shop.product_review_table
.
Inner SELECT statement:
- The inner query statement selects a
CONCAT
function with a specific prompt instruction to eachproduct_review
from the input datasethair_shop.product_review_table
. - In other words, the prompt instruction guides the model to extract a one-word sentiment (positive, negative, or neutral) from the
product_review
attribute.
STRUCT for model parameters:
temperature: 0.1
– a lower value of 0.1 will lead to more predictable text generation.max_output_tokens: 1
– limits the model’s output to 1 token (sentiment analysis outcome can be either positive, negative, or neutral).top_p: 0.1
is influencing the likelihood distribution of the next token.top_k: 1
is restricting the number of top tokens considered.
After triggering the presented query, the first five outcomes were analysed:
![The sentiment analysis outcome of the GENERATE_TEXT function [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1c46WIYfbadjskl2b5NsWHQ.png)
As visible from the image, the newly generated attribute named review_sentiment
is added to the mockup dataset. The values of the review_sentiment
attribute were compared to the values in the review_type
attribute, and their records matched.
Following the sentiment analysis outcome, the aim was to identify the reason behind every sentiment, i.e., to perform entity extraction.
Entity extraction example
The query for entity extraction analysis is shown in the code block below.
--Check what's the reason behind the positive/negative sentiment
SELECT
ml_generate_text_result['predictions'][0]['content'] AS generated_text,
* EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
ML.GENERATE_TEXT(
MODEL `hair_shop.llm_model`,
(
SELECT
CONCAT('What is the reason for a review of the product? Possible options are: good/bad/average product quality, delivery delay, low/high price', product_review) AS prompt,
*
FROM
`macro-dreamer-393017.hair_shop.product_review_table`
LIMIT 5
),
STRUCT(
0.2 AS temperature,
20 AS max_output_tokens,
0.2 AS top_p,
5 AS top_k));
The breakdown of the query is analogous to the sentiment analysis query breakdown, except that the prompt instruction and model parameter values have been adjusted for entity extraction analysis.
After triggering the presented query, the first five outcomes were analysed:
![The entity extraction analysis outcome of the GENERATE_TEXT function [Image by author]](https://towardsdatascience.com/wp-content/uploads/2023/12/1hWULXyU3k7nIqjqN-oeTqg.png)
The results of the review_reason
attribute were compared to the values of the product_review
attribute. Like in the previous example, the outcomes were matching.
And here you have it, folks. This was the tutorial that showed the two use cases for creating new attributes from unstructured text by using only SQL and prompt text. All with the aid of the new generative AI function in BigQuery.
We can now share a short summary of this blog post.
Conclusion
The goal of this blog post was to present the usage of the new GENERATE_TEXT
ML function in BigQuery.
With this function, data professionals who are SQL-oriented now have options to create new insights (attributes) from the unstructured text stored directly in the BQ data warehouse table.
In other words, the function enables data professionals to leverage NLP machine learning tasks directly within the BigQuery environment.
Furthermore, it overcomes the knowledge gap between the data professionals who are developing new analytical insights using Python and the ones who are more database-oriented.
Finally, I will conclude this blog post with one saying: "Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure."
Thank you for reading my post. Stay connected for more stories on Medium and Linkedin.
Knowledge references
- [1] Google article "The ML.GENERATE_TEXT function", accessed November 10th 2023, https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text
- [2] Supermetrics article "Google Sheets to BigQuery: a step-by-step guide", accessed November 22nd 2023, https://t.ly/ZZ2lN
- [3] Google tutorial "Generate text by using the ML.GENERATE_TEXT function", section "Create a connection", accessed November 10th 2023, https://cloud.google.com/bigquery/docs/generate-text#generate_text
- [4] Git repository by author: https://github.com/CassandraOfTroy/bigquery-generate_text-AI_function/tree/main