How to unnest / extract nested JSON data in BigQuery
Working with nested JSON data in BigQuery analytics database might be confusing for people new to BigQuery. Yet if done well, nested data structure (JSON) is a very powerful mechanism to better express hierarchical relationships between entities comparing to the conventional flat structure of tables.
In this article, we will show you how to handle nested data in BigQuery. Beside some obvious naming and syntax differences, you can apply the same approaches to handle nested data in many other SQL databases like PostgreSQL, MySQL...
The examples below use the ga_sessions
sample dataset provided by Google. All queries are written with BigQuery's #StandardSQL mode.
How JSON data is stored in BigQuery
In BigQuery, JSON data may be stored in two ways:
- In a column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in BigQuery.
- In a column of type "STRING": The JSON value is treated just like a normal string that happens to have JSON format. Let's call this "stringified JSON".
Each scenario needs to be handled differently as follow.
Handling BigQuery's JSON data as RECORD type
What is RECORD type?
While PostgreSQL and MySQL have JSON
type, BigQuery uses the RECORD
(or STRUCT
) type to represent nested structure. A column of RECORD
type is in fact a large column containing multiple child columns.
Storing your data in RECORD
columns takes a bit more effort as you need to define the data structure up-front when creating your table. For example, this is from the Create table dialogue in BigQuery:
On the other hand, the explicit structure brings you several benefits:
- Consistency: Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.
- Performance: You can selectively scan child columns instead of scanning the whole parent
RECORD
column, which will save you query money. - Ease of use: You can use dot notation (
parent_col.child_col
) to query fromRECORD
columns, which is quite convenient.
Your RECORD
column can have a simple nested structure (each row only holds a single record), or it can hold nested and repeated data (each row holds multiple records). In this case, the RECORD
field will have REPEATED
mode:
Querying key-value JSON object (non-repeated RECORD field)
Querying single (non-repeated) RECORD
field is like querying key-value JSON objects. You can simply use the dot notations to access the child columns:
#standardsql
select
visitId
, totals.hits
, totals.pageviews
, totals.timeOnScreen
, trafficSource.campaign
, trafficSource.isTrueDirect
, trafficSource.adwordsClickInfo.adGroupId
, trafficSource.adwordsClickInfo.campaignId
from test.ga_sessions_20170801
You can chain your dot notation to query deeper field values, for example trafficSource.adwordsClickInfo.adGroupId
.
Result:
Querying JSON array (REPEATED RECORD fields)
As briefly mentioned above, REPEATED RECORD field is an array that hold multiple records in a single row. To access the records inside, you have to:
- Use
UNNEST
to flatten the field into a table where each row is an element of the array - Join the flattened table back to the base table.
Sample code:
#standardsql
select
visitId
, h.hitNumber
, h.time
, h.hour
, h.isInteraction
, h.isEntrance
, h.isExit
, h.referer
, h.page.pagePath
from test.ga_sessions_20170801
left join unnest(hits) as h -- No need to specify join key
Now we have one row for each hit (interaction) in the visit. Notice that visitId
value is repeated across four rows:
What if the child column is also a nested column?
In the following example, you can see that hit number three interacted with multiple product SKUs. hits.product
is also a REPEATED RECORD:
We can continue to unnest hits.product
and join back to the base table:
#standardsql
select
visitId
, h.hitNumber
, h.time
, h.hour
, h.isInteraction
, p.productSKU
, p.v2ProductName
, p.productPrice
from test.ga_sessions_20170801
left join unnest(hits) as h
left join unnest(h.product) as p -- No need to specify join key
where visitId = 1501589858
As you can see in the results, each row now represents a product interacted, and values of visitId
, hitNumber
, time
, hour
are repeated in each row.
Handling "stringified JSON" in BigQuery
The other common way JSON data is stored in BigQuery is as STRING
data type. For example:
Storing nested data as plain string requires no pre-defined schema, but it will bring you headaches in the future:
- Inconsistency: No explicit schema means that you can have a column holding both JSON-formatted strings and normal free-form strings. This may make your query result unpredictable.
- Bad performance and high cost: When extracting data from stringified JSON, you have to scan the whole text, which result in high query cost (more money) if your JSON string is large.
- Lower ease of use: As demonstrated below, you will need to specify the exact JSON path to the data you want. While this is not much of a hurdle, it still takes some time to get used to.
Similar to RECORD type, stringified JSON can be a single JSON object, or a JSON array. You can check this for BigQuery's documentation on JSON functions.
Handle single-value JSON string in BigQuery
BigQuery has dedicated functions to handle this format:
json_extract(json_expression, json_path)
will return all JSON valuesjson_extract_scalar(json_expression, json_path)
will return only scalar values (string, number, boolean)
Sample code:
#standardsql
SELECT
visitId
, json_extract_scalar(totals, '$.hits') as total_hits_scalar -- returns a number
, json_extract(totals, '$.hits') as total_hits_json -- returns a number
, json_extract_scalar(trafficSource, '$.adwordsClickInfo') as adwords_scalar -- returns nothing
, json_extract(trafficSource, '$.adwordsClickInfo') as adwords_json -- returns a JSON object
FROM test.test_json_string
As the value of hits
in totals
is a scalar, both functions return the same thing. The value of adwordsClickInfo
in trafficSource
is a JSON object so json_extract_scalar()
returns nothing.
Handle stringified JSON array in BigQuery
With this format, you can use json_extract_array(json_expression[, json_path])
to extract array elements (json_path
is optional).
In the example above, hits
is a stringified JSON array:
#standardsql
SELECT
visitId
, json_extract_array(hits) as hits
FROM test.test_json_string
As you can see json_extract_array(hits)
returns a repeated column. To access elements in the new hits
column, you need to unnest it just like with a REPEATED RECORD column:
#standardsql
SELECT
visitId
, json_extract_scalar(h, '$.hitNumber') as hit_number
, json_extract_scalar(p, '$.productSKU') as product_sku
, json_extract_scalar(p, '$.productBrand') as product_brand
, json_extract_scalar(p, '$.productPrice') as product_price
FROM test.test_json_string
left join unnest(json_extract_array(hits)) as h
left join unnest(json_extract_array(h, '$.product')) as p
The result is a nice flattened table:
Handling BigQuery JSON using Holistics (BI tool)
If you are using Holistics BI, there are a few things you can do to enable your end-users to explore BigQuery's nested JSON structures themselves.
Create Calculated Fields to access child columns at run time
If you have created a Table Model out on top of the source table, you can create calculated fields using the same dot notation mentioned above, for examples:
// total_hits
{{#THIS.totals}}.hits
// traffic_adw_google_click_id
{{#THIS.traffic_source}}.adwordsClickInfo.gclId
This way your end-user can drag and drop the child columns like normal during exploration:
Sorry, your browser doesn't support embedded videos.Write SQL transformation to flatten arrays into a table
With Holistics's modeling layer, you can let your end-user have access to data in nested JSON arrays by:
- Write a SQL model to unnest repeated columns in BigQuery into a flat table
- Set a relationship between this derived SQL model with the base model
- Add the derived SQL model in a dataset to expose it to your end user
Conclusion
With this, we conclude our guide to handling JSON/nested data structures in BigQuery. While we only wrote this in the context of BigQuery, you can apply the same approach to handle JSON fields and arrays in other SQL databases (like PostgreSQL, MySQL 8, Snowflake...)
If you think we have missed any frequently encountered issue when dealing with nested data, feel free to drop us a message.
Happy unnesting!