Chapter 87 – Interact with Google Big Query ML Pre-trained Model Dataset Using Python

Pre-trained machine learning model is getting popular with more and more LLM launched in the market, which is accepted by more users who are adopting those to boost work efficiency. As this trend moves forward, customized machine learning models might be needed with more demands. Therefore, this article would briefly walk through how an AI app can interact with a big dataset using Python for the purpose to provide pre-trained model functions to users.

Pre-trained machine learning model is getting popular with more and more LLM launched in the market, which is accepted by more users who are adopting those to boost work efficiency. As this trend moves forward, customized machine learning models might be needed with more demands. Therefore, this article would briefly walk through how an AI app can interact with a big dataset using Python for the purpose to provide pre-trained model functions to users.

Table of Contents: Interact with Google Big Query ML Pre-trained Model Dataset Using Python

What is Google Big Query

Google BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine. The query engine is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure and without having to create or rebuild indexes.

Google Big query is seamlessly compatible with other parts of Google cloud services and programming languages, such as Python. Its capacity and speed performance are super outstanding notably on dataset creation and data fetching from pre-trained model deployment and running.

New dataset and table creation

There are three fundamental elements to create a query datahouse whatever it’s done by manual or automation. They are the project ID, dataset ID and table ID.

For the project ID, users can create it by logging into the Google cloud and creating a new project. Each project ID is unique, which can be used in Google Big query as well.

Project ID is used to authenticate the users identity when any action requests to interact with a specific database. For more details regarding Google Cloud authentication and credential creation, please check out other Google cloud articles released before in Easy2Digital.

## Authentication ##

project_id=”xxx”

dataset_id = 'aaa'

table_id = 'bbb'

credentials = service_account.Credentials.from_service_account_info(jsonTranslateV2)

client = bigquery.Client(project=project_id, credentials=credentials)

dataset2 = client.dataset(dataset_id)

## Create Dataset ##

def createNewDataSet(dataset):

   dataset.location = 'sg'

   dataset = client.create_dataset(dataset, timeout=30# Make an API request.

   return "Created dataset {}.{}".format(client.project, dataset.dataset_id)

Then, we need to name the new dataset id and table id first by creating two variables. Subsequently, we can use the authenticated object and create_dataset methods to create a new dataset with the name you called earlier.

Last but not least, once the dataset is created, we can create the new table ID under the new dataset ID.

def createADataTable(dataset, table_id):

   table = dataset.table(table_id)

   return table

Upload New Dataset to new dataset and table

There are three steps to be done in terms of newdataset uploading, which is ND-JSON, stringify JSON data and schema configuration.

First thing first, in terms of app interacting with Big query dataset deployed earlier, normally we use JSON format unless users upload CSV manually on Google cloud or integrate with some 3rd party platforms. The cheapest way is to use JSON. In Google Big query, it requires to upload ND-JSON format, or newline delimited JSON instead of pure JSON.

finalNDJson = roundoneDF.to_json(orient='records', lines = True)

Secondly, we need to import and use io to stringify the ND-JSON data before uploading to Big query.

stringio_data = io.StringIO(finalNDJson)

Lastly, it requires us to create a table with configuration settings. Such as column name, data type, data mode, and so on.

table_schema = {

         'name': 'CategoryID',

         'type': 'STRING',

         'mode': 'REQUIRED'

         }, {

         'name': 'CategoryDummy',

         'type': 'INTEGER',

         'mode': 'REQUIRED'

         }, {

         'name': 'Section Name',

         'type': 'STRING',

         'mode': 'REQUIRED'

         }, {

         'name': 'FinetunedCategory',

         'type': 'STRING',

         'mode': 'NULLABLE'

         }, {

         'name': 'Article',

         'type': 'STRING',

         'mode': 'REQUIRED'

         }

Call the current dataset saved on Big query

The new dataset is created and new data is uploaded to the dataset. Big query basically is SQL, so the SQL command is workable in Big Query as well. 

   queryJob = client.query(

       f"""

       SELECT *

       FROM `{project_id}.{dataset_id}.{table_id}`

       """)

For fetching the data, we use authenticated object and query method.

Insert new rows to an existing table

def insertNewRows():

   rows_to_insert = [

       {"a": "cccc", "b":902},

       {"a": "ttt", "b": 40}

   ]

   errors = client.insert_rows_json(table_id, rows_to_insert)

   return

Full Python Script of Google Big Query CRUD

If you are interested in Chapter 87 – Interact with Google Big Query ML Pre-trained Model Dataset Using Python, please subscribe to our newsletter by adding the message ‘Chapter 87 Big query script’. We would send you the script immediately to your mailbox.

I hope you enjoy reading Chapter 87 – Interact with Google Big Query ML Pre-trained Model Dataset Using Python. If you did, please support us by doing one of the things listed below, because it always helps out our channel.