Using BigQuery's MERGE in Data Engineering

20 Jul, 2023

In the world of data engineering, efficient data integration and manipulation are critical for businesses to extract valuable insights. BigQuery, a fully managed data warehouse solution provided by Google Cloud, offers a powerful tool called the MERGE statement. In this technical blog post, we will dive into the MERGE statement and discuss its benefits for data engineering tasks. To illustrate its usage, we'll walk through an example involving a fictional company called Avayant, a data integrations and marketing attribution service provider. We will replace specific project, dataset, table, and column names with dummy examples.

Understanding the MERGE Statement

The MERGE statement in BigQuery combines the functionality of both INSERT and UPDATE statements, allowing data engineers to perform conditional updates or inserts in a single operation. By leveraging the MERGE statement, we can efficiently synchronize data between tables, handle updates for existing records, and insert new records where necessary.

Example: Synchronizing External Orders in Avayant's Salesforce Data Warehouse

Let's consider an example where Avayant wants to synchronize order data in their Salesforce data warehouse. We'll use the MERGE statement to update existing records and insert new ones based on a matching condition. Here's the example MERGE statement:

MERGE `project-id.dataset.salesforce.Order` AS d
USING (
  SELECT * EXCEPT(row_number)
  FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY orderId) row_number
    FROM `project-id.dataset.stage_Order`
  )
  WHERE row_number = 1
) AS s
ON d.orderId = s.orderId
WHEN MATCHED THEN UPDATE SET
    d.accountId = s.accountId,
    d.countryCode = s.countryCode,
    d.orderAmount = s.orderAmount,
    d.orderDate = s.orderDate
WHEN NOT MATCHED THEN INSERT (
    orderId,
    accountId,
    countryCode,
    orderAmount,
    orderDate
  )
  VALUES (
    s.orderId,
    s.accountId,
    s.countryCode,
    s.orderAmount,
    s.orderDate
  );

Explaining the MERGE Statement:

We define the target table for merging: project-id.dataset.salesforce.Order is the table where the data will be merged.

The source data is derived from the project-id.dataset.stage_Order table, with the help of a subquery that assigns row numbers based on the orderId column.

The ON clause specifies the join condition between the target and source tables, based on the orderId column.

The WHEN MATCHED THEN block defines the update logic for matched records. Each column in the target table is updated with the corresponding column value from the source table.

The WHEN NOT MATCHED THEN block defines the insert logic for new records. Values from the source table are inserted into the target table.

Benefits of the MERGE Statement for Data Engineering:

The MERGE statement offers several advantages for data engineering tasks:

Efficiency: Performing both updates and inserts in a single operation reduces the number of queries and minimizes network traffic, leading to improved performance and reduced costs.

Simplicity: The MERGE statement simplifies complex synchronization tasks by providing a concise and structured approach.

Atomicity: The operation is atomic, ensuring that either all updates and inserts succeed or none of them do, maintaining data integrity.

Flexibility: The MERGE statement allows for conditional updates and inserts, giving data engineers fine-grained control over the synchronization process.

Scalability: BigQuery's parallel processing capabilities ensure that the MERGE statement can handle large datasets efficiently.

The MERGE statement in BigQuery empowers data engineers with a powerful tool to efficiently synchronize and update data in a single operation. By leveraging the benefits of the MERGE statement, Avayant, and other data integration service providers can streamline their data engineering workflows, reduce costs, and improve overall efficiency.

#bigquery #data-engineering