The world’s leading publication for data science, AI, and ML professionals.

The New Generative AI Function in BigQuery

How to use BigQuery GENERATE_TEXT remote function

"Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure" [Photo by Adi Goldstein on Unsplash]
"Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure" [Photo by Adi Goldstein on Unsplash]

Introduction

Since I started working with the Google Platform, Google has not stopped surprising me with its BigQuery (BQ) features and development.

A real "wow" moment for me happened four years ago.

I remember it like it was yesterday, and I was sitting in the front row at the Big Data London 2019 conference. Little did I know back then about the possibility of creating machine learning models using only BQ functions, or, better said, what BQ Machine Learning (BQML) is.

At least until the conference session, where the Google colleague presented how you can create classification, clustering, and time-series forecasting models by simply using Google’s SQL.

The first thought that went through my mind back then was "You must be joking".

The second thought in my head was, "Does this mean that everyone who knows only SQL will be able to create machine learning models?"

As you can assume, the answer is "yes" if you are using BigQuery as your data warehouse.

Now, after using the BQML functions for a while, the correct answer to the question listed above is "maybe."

This means that even though the [CREATE MODEL](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) syntax is written in SQL, knowledge of machine learning modelling and statistics is still needed.

In other words, you still need to understand the math behind the available models for different types of machine learning use cases (supervised/unsupervised), conduct feature engineering, hyperparameter tuning and model evaluation tasks.

Fast forward to the year 2023, and BigQuery is further amazing me with its new features.

This time, we are talking about the new generative AI BigQuery machine learning functions.

With these new functions, data engineers and analysts are able to perform generative natural language tasks on textual data stored in BQ tables with a few query lines.

Hence, the goal of this blog post is to showcase the new analytical advances of BQ in generative AI, with a focus on one function—the _GENERATE_TEXT_ function.


About the GENERATE_TEXT function

The main idea behind the _GENERATETEXT function is to assist data professionals with the following tasks using only SQL and prompt instructions in BigQuery [1]:

  • Classification
  • Sentiment Analysis
  • Entity extraction
  • Extractive Question Answering
  • Summarization
  • Re-writing text in a different style
  • Ad copy generation
  • Concept ideation

In other words, the function can perform generative natural language tasks on textual attributes stored in BQ by using the Vertex AI [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text) natural language foundation model [1].

It works in the following way [1]:

  • First, it sends requests to a Bigquery Ml remote model representing a Vertex AI text-bison natural language foundation model (LLM).
  • Then, it returns the response with defined input parameters.

This means that different function parameters and the input prompt design (prompt instructions for analysis) affect the LLM’s response.

With this said, the following function-specific parameters can be passed to the _GENERATETEXT function and affect the response quality [1]:

#1: model [STRING] → specifies the name of a remote model that uses one of the text-bison Vertex AI LLMs.

#2: query_statement [STRING] → specifies the SQL query that is used to generate the prompt data.

#3: max_output_tokens [INT64] → sets the maximum number of tokens that the model outputs. For shorter responses, the lower value can be specified.

  • Note: A token might be smaller than a word and is approximately four characters.

#4: temperature [FLOAT64] → argument in the range[0.0,1.0] that is used for sampling during the response generation, which occurs when top_k and top_p are applied.

  • Note: The argument presents the degree of randomness in token selection, i.e., the lower values are good for prompts that require a more deterministic response, while higher values can lead to more diverse results.

#5: top_k [FLOAT64] → argument in the range[1,40] that changes how the model selects tokens for output.

  • Note: To get fewer random responses, lower values should be specified.

#6: top_p [FLOAT64] → argument in the range[0.0,1.0] that changes how the model selects tokens for output.

  • Note: To get fewer random responses, lower values should be specified. Tokens are selected from the most probable (based on thetop_k value) to the least probable until the sum of their probabilities equals the top_p value.

After we understand the purpose of the function and the role of each parameter, we can start to demo how the BigQuery _GENERATETEXT function _c_an be used.


The 4-step methodology for using the function

This section will present four methodological steps for testing the Generative Ai function _GENERATETEXT.

GENERATE_TEXT function workflow [Image by author]
GENERATE_TEXT function workflow [Image by author]

In summary, the methodology consists of:

  • Generation of the small mockup dataset using ChatGPT and exporting it to a Google Sheets document.
  • Creation of the BigQuery dataset on top of the Google Sheets document.
  • Setting up the connection between Google service Vertex AI and BQ in order to use a remote generative AI model.
  • Hands-on use-case examples for testing function _GENERATETEXT on top of the mockup dataset in BigQuery.

More context for every step is shown in the subsections below.


Step 1: Generate a mockup dataset using ChatGPT

As I didn’t have a real-life dataset, I decided to create a mockup dataset for testing purposes.

For this, I used ChatGPT by entering the following prompt text:

"I would need to automatically generate customer reviews for 50 different imaginary hair products.

For each product, I would need 2 positive, 2 negative, and 1 neutral review.

In addition, I would expect reviews to be at least 4 sentences long and contain different information: product name, location of the store where the product was bought, and price of the product.

In negative reviews, please include different reasons, like product quality issues, price issues, and delivery issues."

The outcome was a small dataset table with five attributes (previewed in the image below):

  • _product_name –_ contains values of the fake product names,
  • _review_type –_ contains the review sentiment type (positive, negative or neutral),
  • _store_location –_ contains the random city and state name,
  • _product_price – contains the random product price in dollars,_ and
  • _product_review—_contains the five-sentence-long fake product review text.
BigQuery mockup dataset preview [Image by author]
BigQuery mockup dataset preview [Image by author]

The complete dataset can be found in the Git repository here [4].

After preparing the mockup dataset and storing it in Google Sheets, the next step is transferring it to BigQuery.


Step 2: Create a BQ table on top of the Google Sheet document

My go-to method when I have a small dataset, which I need to tweak or change often manually, is to create a BigQuery table on top of the editable Google Sheet document.

To do so, the following substeps are needed [2]:

Substep #1: Create a dataset and table in the BigQuery project by specifying the file location and format

The CREATE TABLE option located in the top-right corner of the BigQuery environment should be selected for the creation of the new dataset and table. Values marked with a star sign (*****) need to be entered, as shown in the image below.

Creating a BQ dataset and table on top of the mockup dataset in Google Sheets [Image by author]
Creating a BQ dataset and table on top of the mockup dataset in Google Sheets [Image by author]

As visible from the image above, the name of the newly created BQ dataset is hair_shop , while the name of the newly created table is product_review_table.

Important note to keep in mind:

  • In case you don’t want to define the table schema in the Schema section shown in the image above, all imported attributes from Google Sheets will have the datatype STRING by default.

Substep #2: Query the mockup dataset in the BQ

The second substep is to simply explore the hair_shop.product_review_tabledataset directly in BigQuery.

Query the mockup dataset in BQ [Image by author]
Query the mockup dataset in BQ [Image by author]

Now that you can access your mockup dataset in BQ, it’s time to connect the generative AI remote model to it.


Step 3: Connect the Vertex AI service to BQ

A detailed explanation that served as a guideline for this step can be found in Google’s documentation here [3].

To summarize the provided Google tutorial, the three main substeps for connecting the Vertex AI to BQ are:

  • Substep #1: Create a cloud resource connection from BigQuery to get the connection’s service account.
  • Substep #2: Grant the connection’s service account an appropriate role to access the Vertex AI service.
  • Substep #3: Create the text-bison remote model that represents a hosted Vertex AI large language model in the created BigQuery dataset hair_shop.

Once these substeps are concluded, a new object Model will be visible within the hair_shop dataset.

A preview of the new database object Model in the selected BQ dataset [Image by author]
A preview of the new database object Model in the selected BQ dataset [Image by author]

Finally, the magic can begin, and we can start using the _GENERATETEXT function.


_Step 4: Use the GENERATE_TEXT function in BQ_

In this step, we will focus on two use cases for testing the function usage: sentiment analysis and entity extraction.

The reason for selecting these two use cases is that we already created two attributes in the input mockup dataset (review_type and product_review)which can be used to test the function results. With function results, we mean validating the accuracy of the AI-generated values within the new attributes.

Let’s now present the concrete input-output flow for generating the new attributes for each use case.

Sentiment analysis example

The query for sentiment analysis is shown in the code block below.

--Generate the positive/negative sentiment
SELECT
  ml_generate_text_result['predictions'][0]['content'] AS review_sentiment,
  * EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `hair_shop.llm_model`,
    (
      SELECT
        CONCAT('Extract the one word sentiment from product_review column. Possible values are positive/negative/neutral ', product_review) AS prompt,
        *
      FROM
        `macro-dreamer-393017.hair_shop.product_review_table`
      LIMIT 5
    ),
    STRUCT(
      0.1 AS temperature,
      1 AS max_output_tokens,
      0.1 AS top_p, 
      1 AS top_k));

The breakdown of the query is as follows:

Outer SELECT statement:

  • The outer query selects the newly generated string attribute review_sentiment obtained from LLM by applying the ML.GENERATE_TEXT function to the hair_shop.llm_model object.
  • In addition, it selects all other columns from the input dataset hair_shop.product_review_table.

Inner SELECT statement:

  • The inner query statement selects a CONCAT function with a specific prompt instruction to each product_review from the input dataset hair_shop.product_review_table.
  • In other words, the prompt instruction guides the model to extract a one-word sentiment (positive, negative, or neutral) from the product_review attribute.

STRUCT for model parameters:

  • temperature: 0.1 – a lower value of 0.1 will lead to more predictable text generation.
  • max_output_tokens: 1– limits the model’s output to 1 token (sentiment analysis outcome can be either positive, negative, or neutral).
  • top_p: 0.1 is influencing the likelihood distribution of the next token.
  • top_k: 1 is restricting the number of top tokens considered.

After triggering the presented query, the first five outcomes were analysed:

The sentiment analysis outcome of the GENERATE_TEXT function [Image by author]
The sentiment analysis outcome of the GENERATE_TEXT function [Image by author]

As visible from the image, the newly generated attribute named review_sentiment is added to the mockup dataset. The values of the review_sentiment attribute were compared to the values in the review_type attribute, and their records matched.

Following the sentiment analysis outcome, the aim was to identify the reason behind every sentiment, i.e., to perform entity extraction.

Entity extraction example

The query for entity extraction analysis is shown in the code block below.

--Check what's the reason behind the positive/negative sentiment
SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  * EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `hair_shop.llm_model`,
    (
      SELECT
        CONCAT('What is the reason for a review of the product? Possible options are: good/bad/average product quality, delivery delay, low/high price', product_review) AS prompt,
        *
      FROM
        `macro-dreamer-393017.hair_shop.product_review_table`
      LIMIT 5
    ),
    STRUCT(
      0.2 AS temperature,
      20 AS max_output_tokens,
      0.2 AS top_p, 
      5 AS top_k));

The breakdown of the query is analogous to the sentiment analysis query breakdown, except that the prompt instruction and model parameter values have been adjusted for entity extraction analysis.

After triggering the presented query, the first five outcomes were analysed:

The entity extraction analysis outcome of the GENERATE_TEXT function [Image by author]
The entity extraction analysis outcome of the GENERATE_TEXT function [Image by author]

The results of the review_reason attribute were compared to the values of the product_review attribute. Like in the previous example, the outcomes were matching.

And here you have it, folks. This was the tutorial that showed the two use cases for creating new attributes from unstructured text by using only SQL and prompt text. All with the aid of the new generative AI function in BigQuery.

We can now share a short summary of this blog post.

Conclusion

The goal of this blog post was to present the usage of the new GENERATE_TEXT ML function in BigQuery.

With this function, data professionals who are SQL-oriented now have options to create new insights (attributes) from the unstructured text stored directly in the BQ data warehouse table.

In other words, the function enables data professionals to leverage NLP machine learning tasks directly within the BigQuery environment.

Furthermore, it overcomes the knowledge gap between the data professionals who are developing new analytical insights using Python and the ones who are more database-oriented.

Finally, I will conclude this blog post with one saying: "Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure."

Thank you for reading my post. Stay connected for more stories on Medium and Linkedin.

Knowledge references


Related Articles