Thursday, June 7, 2018

BigQuery dataloading with the bq CLI



Sometimes you just want a simple way of importing data into BigQuery without streaming, enrichment or complex transformations. Cloud Dataflow might be too much, and the import through the webpage will not handle large amounts of data, bq load is the CLI way of doing this.

For this demo I'll use demo data from Oslobysykkel.no, which is a bike sharing service here in Oslo. Their data is out in csv and json format at https://developer.oslobysykkel.no/data

I download all the data to my local computer, and now I want to upload it to Google Cloud Storage. I've installed and setup the Google Cloud SDK from here.

Except for the initial upload everything I could've done everything here using the Cloud Shell from within the GCP console, I just like working on my laptop.

mymac:dwn $ gcloud config list
[compute]
region = europe-west3
zone = europe-west3-b
[core]
account = magnus.fagertun@emailprovider.com
disable_usage_reporting = False
project = healthy-basis-19xxxx

Your active configuration is: [default]

Let’s create a bucket in Europe and upload to this bucket, using -m for parallel uploads.

mymac:dwn $ gsutil mb -c regional -l europe-west3 gs://bysykkelbucket
Creating gs://bysykkelbucket/...

Bigquery supports csv data that is gzipped or uncompressed, but these are .zip files, so we will uncompress before uploading. Uncompressed csv is usually faster, and can be read in parallel as well, not that it's needed for these small data volumes.

mymac:dwn $ for i in `ls trips*.csv.zip`;do unzip $i;done
...
Archive:  trips-2017.9.1-2017.9.30.csv.zip
 inflating: trips-2017.9.1-2017.9.30.csv  

mymac:dwn $ gsutil -m cp tri*.csv gs://bysykkelbucket
Copying file://trips-2016.12.1-2016.12.31.csv [Content-Type=text/csv]...
Copying file://trips-2017.9.1-2017.9.30.csv [Content-Type=text/csv]...          
...
| [18/18 files][278.0 MiB/278.0 MiB] 100% Done   8.5 MiB/s ETA 00:00:00
Operation completed over 18 objects/278.0 MiB.        


mymac:dwn $ gsutil ls -lh gs://bysykkelbucket|head -n 2
13.89 MiB  2018-02-19T10:36:28Z  gs://bysykkelbucket/trips-2016.10.1-2016.10.31.csv
 6.18 MiB  2018-02-19T10:36:19Z  gs://bysykkelbucket/trips-2016.11.1-2016.11.30.csv

A quick peek of the data using gsutil cat, selecting only the first 256 bytes.

mymac:dwn $ gsutil cat -r 0-256 gs://bysykkelbucket/trips-2016.11.1-2016.11.30.csv
Start station,Start time,End station,End time
213,2016-11-01 06:00:02 +0100,163,2016-11-01 06:12:24 +0100
283,2016-11-01 06:00:04 +0100,280,2016-11-01 06:07:04 +0100
266,2016-11-01 06:00:11 +0100,196,2016-11-01 06:09:31 +0100

As we can see, the date format is not according to BQ standard, so this needs to be parsed at a later stage.

There are two easy ways to load this data, either loading into BigQuery and then transforming, or using federated queries to query the data where it is in the storage bucket, and then store it to a table in BigQuery.


1. Load into BigQuery, then transform

I will create a new dataset in BigQuery and set this as the default in my shell.
mymac:dwn $ bq mk bysykkel
Dataset 'healthy-basis-19xxxx:bysykkel' successfully created.
mymac:dwn $ echo dataset_id=bysykkel > ~/.bigqueryrc

Now we are ready to load the data. I'll skip the header row, import all into a table named tripsdump. Here I’m specifying the format of the file inline, you can also create a json file with the table definition

mymac:dwn $ bq load --skip_leading_rows=1 tripsdump \ gs://bysykkelbucket/*.csv \ start_station:string,start_time:string,end_station:string,end_time:string
Waiting on bqjob_r21e6f0f984e_000addee25d_1 ... (50s) Current status: DONE  

The data is now loaded into BigQuery, but the data type for dates are not correct. To do this transformation we will use SQL and save the result to a destination table. We could use bq for this as well, but I'll use the web GUI because it is easier to spot typos and errors with the validator.

First we will cast the start and end station as an int ID column, then we will parse the date formats and exclude bad data with nulls.


2. Create an external data source, transform and load into BigQuery with SQL

This method is great for loading large amounts of data, because we load and transform the data in one step. It can also be used to limit the data being loaded with a where clause, or enrich the data from existing tables using SQL.

First we will use the bq mkdef command to create a json table definition.

mymac:dwn $ bq mkdef --noautodetect --source_format=CSV "gs://bysykkelbucket/*.csv" start_station:string,start_time:string,end_station:string,end_time:string > tripsdef.json

Let's open the tripsdef.json and change skipLeadingRows to 1



mymac:dwn $ bq mk --external_table_definition=tripsdef.json bysykkel.trips_ext
Table 'healthy-basis-19xxxx:bysykkel.trips_ext' successfully created.

The external table is now visible in the BigQuery UI, so we run the query we created earlier to clean the data.


If you haven't tried Google Cloud yet, I recommend heading over to cloud.google.com to get started on your free trial.  For further reading on dataloading in BigQuery including newer enhancements like DDL, partitioning and materializing queries, check out this blogpost by @felipehoffa

A few links:
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators

No comments:

Post a Comment