Google launched text embedding with pre-trained TensorFlow models in BigQuery | by Chigozie Boniface

BigQuery is improving its robustness for AI and ML

Introduction

In the realm of language processing and data analytics, the ability to make sense of text data is crucial. Google’s recent release brings together pre-trained TensorFlow models and BigQuery, Google Cloud’s data warehouse, to simplify text analysis. This development enhances the capability to translate text into numerical forms that machines can understand directly within BigQuery’s environment. By combining TensorFlow’s pre-trained models with BigQuery’s infrastructure, users gain streamlined access to powerful text analysis tools.

Understanding Text Embedding

Text embedding simplifies complex text data by converting it into dense vector representations [1]. These representations enable machines to understand the meaning and context of words and sentences more effectively. In essence, a text embedding assigns each piece of text a unique numerical vector in a multi-dimensional space.

The key principle behind text embedding is that semantically similar pieces of text have embeddings that are positioned closely together in this vector space [1]. This means that words or sentences with similar meanings will have similar numerical representations.

By encoding textual information into dense vectors, text embedding facilitates various natural language processing tasks, including sentiment analysis, document clustering, recommendation systems, etc. This technique empowers machine learning models to interpret and process textual data more efficiently.

Google’s Initiative: Text Embedding with Pre-trained TensorFlow Models in BigQuery

Google’s latest initiative brings cutting-edge text embedding capabilities directly into BigQuery, its powerful data warehouse platform. This integration leverages pre-trained TensorFlow models.

By embedding pre-trained TensorFlow models within BigQuery, users gain access to state-of-the-art text analysis tools without the need for additional infrastructure or complex setup. This seamless integration empowers data scientists, analysts, and developers to perform advanced NLP tasks within the familiar BigQuery environment.

Implementation

Implementing text embedding with pre-trained TensorFlow models in BigQuery requires ensuring permissions and understanding costs for tools like BigQuery, BigQuery ML, Cloud Storage, and Vertex AI. For detailed guidelines, refer to [3]. Here are the steps for implementation:

Step 1: Create a dataset

In the Google Cloud Console, go to the BigQuery page. In the BigQuery editor, enter the following statement and click Run, please remember t replace the PROJECT_ID with your project id:

CREATE SCHEMA `<PROJECT_ID>.tf_models_tutorial`;

Step 2: Generate and upload a model to cloud storage

Install the bigquery-ml-utils library using pip in your integrated development environment (IDE)
To generate TensorFlow models like NNLM, SWIVEL, and BERT, you can iterate through the TF models ‘nnlm’, ‘swivel’, and ‘bert’ respectively. You can execute this process in a personal IDE or Colab notebook using the provided code snippet below:

from bigquery_ml_utils import model_generator
import tensorflow as tf
import googleapiclient
from google.colab import auth as google_authPROJECT_ID="sample-project-id" # replace with project ID
google_auth.authenticate_user()
!gcloud config set project {PROJECT_ID}
MODEL_NAMES = ["nnlm", "swivel", "bert"]
GCS_BUCKET = "<bucket_name>" # replace with GCS bucket name
for model_name in MODEL_NAMES:
LOCAL_OUTPUT_DIR = "./"+model_name
# Establish an instance of TextEmbeddingModelGenerator.
text_embedding_model_generator = model_generator.TextEmbeddingModelGenerator()
# Generate a SWIVEL model.
text_embedding_model_generator.generate_text_embedding_model(model_name, LOCAL_OUTPUT_DIR)
# Print generated model's signature to confirm that model has been correctly generated
reload_embedding_model = tf.saved_model.load(LOCAL_OUTPUT_DIR)
print(reload_embedding_model.signatures["serving_default"])
# copy model's content to bucket and list out the bucket's content
!gsutil cp -r {LOCAL_OUTPUT_DIR} gs://{GCS_BUCKET}/
!gsutil ls gs://{GCS_BUCKET}/{model_name}

Step 3: Load the model into BigQuery

Let’s use “swivel” model as an example. The following BigQuery code demonstrates how to load the “swivel” model stored in a storage bucket into BigQuery. The procedure remains consistent for the other two models (“nnlm” and “bert”) as well.

CREATE OR REPLACE MODEL tf_models_tutorial.swivel_model
OPTIONS (
model_type = 'TENSORFLOW',
model_path = 'gs://BUCKET_NAME/swivel/*');

Step 4: Generate text embeddings

Utilize the ML.PREDICT() inference function to generate text embeddings for the review column sourced from the public dataset bigquery-public-data.imdb.reviews [4].

SELECT
*
FROM
ML.PREDICT(
MODEL `tf_models_tutorial.swivel_model`,
(
SELECT
review AS embedding_input
FROM
`bigquery-public-data.imdb.reviews`
LIMIT
500)
);

The resulting outcome closely resembles the example below [4]:

+----------------------+----------------------------------------+
| embedding            | embedding_input                        |
+----------------------+----------------------------------------+
|  2.5952553749084473  | Isabelle Huppert must be one of the... |
| -4.015787601470947   |                                        |
|  3.6275434494018555  |                                        |
| -6.045154333114624   |                                        |
| ...                  |                                        |
+----------------------+----------------------------------------+

Use Case

Semantic Search: Text embeddings enable the representation of user queries and documents in a high-dimensional vector space. Documents closely related to the query have shorter distances in this space, leading to higher rankings in search results [5].

Text Classification: A model is trained to map text embeddings to specific category labels (e.g., cat vs. dog, spam vs. not spam). Once trained, the model categorizes new text inputs based on their embeddings, assigning them to relevant categories [5].

Text embeddings are useful in many areas like reducing dimensions, machine learning, transfer learning, multi-modal support, and more. These applications show how effective text embedding techniques are in solving different computational problems.

In conclusion, adding text embedding to BigQuery is a big step in language processing and data analysis. Google’s use of pre-trained TensorFlow models makes advanced AI more accessible, driving innovation. This shows Google’s dedication to making AI available to everyone and helping users get insights from text data.

[1] Google, Embed text with pretrained TensorFlow models (2024)

[2] Google, BigQuery release notes (2024)

[3] Google, Required permissions (2024)

[4] Google, Generate text embeddings (2024)

[5] Google, Text embeddings use case (2024)

Source link