A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery. In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2. aside_block Setting the stage: Installation and configuration You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas. Step 1: Install the Gretel Python client and BigQuery DataFrames: code_block Step 2: Initialize the Gretel SDK and BigFrames: You'll need a Gretel API key to access their services. You can obtain one from the Gretel console. code_block Part 1: De-identifying and processing data with Gretel Transform v2 Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel's Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2. Step 1: Create a BigFrames DataFrame from your BigQuery table: code_block The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column. code_block Step 2: Transform the data with Gretel: code_block Step 3: Explore the de-identified data: code_block Below is a comparison of the original vs de-identified data. Original: code_block De-identified: code_block Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based) Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include: Handles multiple data modalities: numeric, categorical, free text, time series, and JSON Maintains complex relationships across data types and rows Can introduce meaningful new patterns, potentially improving ML/AI task performance Balances data utility with privacy protection NavFT builds on Gretel Navigator's capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning. In this example, we will fine-tune a Gretel model on the de-identified data from Part 1. Step 1: Fine-tune a model: code_block Step 2: Fetch the Gretel Synthetic Data Quality Report: code_block The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report. Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table. code_block Below is a sample of the final synthetic data: code_block A few things to note about the synthetic data: The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct. Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation. How to use BigQuery with Gretel This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuri

admin

Jan 9, 2025 - 08:28

0 17

A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery.

In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2.

aside_block: ), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/bigquery/'), ('image', None)])]>

Setting the stage: Installation and configuration

You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas.

Step 1: Install the Gretel Python client and BigQuery DataFrames:

code_block: =0.22.0"\r\n# Install bigframes if not already\r\n# %%capture\r\n# !pip install bigframes'), ('language', ''), ('caption', )])]>

Step 2: Initialize the Gretel SDK and BigFrames: You'll need a Gretel API key to access their services. You can obtain one from the Gretel console.

code_block: )])]>

Part 1: De-identifying and processing data with Gretel Transform v2

Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel's Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2.

Step 1: Create a BigFrames DataFrame from your BigQuery table:

code_block: )])]>

The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column.

code_block: )])]>

Step 2: Transform the data with Gretel:

code_block: \r\n fake.first_name_female() if row.sex == \'Female\' else\r\n fake.first_name_male() if row.sex == \'Male\' else\r\n fake.first_name()\r\n - name: last_name\r\n value: fake.last_name()\r\n"""\r\n# Submit a transform job against the BigFrames table\r\ntransform_results = gretel_bigframes.submit_transforms(transform_config, df)\r\n\r\n\r\n# Check out our Model ID, we can re-use this later to restore results.\r\nmodel_id = transform_results.model_id\r\n\r\n\r\nprint(f"Gretel Model ID: {model_id}\\n")\r\nprint(f"Gretel Console URL: {transform_results.model_url}")\r\ntransform_results.wait_for_completion()\r\ntransform_results.refresh()'), ('language', ''), ('caption', )])]>

Step 3: Explore the de-identified data:

code_block: )])]>

Below is a comparison of the original vs de-identified data.

Original:

code_block: )])]>

De-identified:

code_block: )])]>

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)

Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:

Handles multiple data modalities: numeric, categorical, free text, time series, and JSON
Maintains complex relationships across data types and rows
Can introduce meaningful new patterns, potentially improving ML/AI task performance
Balances data utility with privacy protection

NavFT builds on Gretel Navigator's capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning.

In this example, we will fine-tune a Gretel model on the de-identified data from Part 1.

Step 1: Fine-tune a model:

code_block: )])]>

Step 2: Fetch the Gretel Synthetic Data Quality Report:

code_block: )])]>

The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report.

Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table.

code_block: )])]>

Below is a sample of the final synthetic data:

code_block: \r\n2 Treatment 01/22/2023 IV Immunosuppression\r\n3 Diagnosis Test 01/22/2023 Follow-up Examination\r\n4 Discharge 01/26/2023 \r\n1 Admission 07/15/2023 \r\n\r\nprovider_name reason result\r\nDr. Angela Clinic Elective right lower lobectomy Transplant successful\r\nOral Health Center Postoperative care Stable with minimal side effects\r\nOrthopedic Inst. Routine check after surgery No signs of infection or relapse\r\nCity Hospital ER End of hospital stay Stabilized with normal vitals\r\nMain Hospital Initial Checkup \r\n\r\ndetails\r\n{}\r\n{"dosage":"Standard", "frequency":"Twice daily"}\r\n{}\r\n{"referral":"Outpatient clinic"}\r\n{}'), ('language', ''), ('caption', )])]>

A few things to note about the synthetic data:

The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct.
Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation.

How to use BigQuery with Gretel

This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuring data privacy and compliance.

To learn more about generating synthetic data with BigQuery DataFrames and Gretel, explore the following resources:

Gretel documentation
BigQuery DataFrames documentation
Overview and Architecture blog
Github code examples
Gretel BigFrames integration documentation

Start generating your own synthetic data today and unlock the full potential of your data!

^{Googlers Firat Tekiner, Jeff Ferguson and Sandeep Karmarkar contributed to this blog post. Many Googlers contributed to make these features a reality.}