IoT on GCP

Using Google Cloud Platform to capture, store and analyze IoT data from an Arduino.

6 min readMar 3, 2022

I will be building an end-to-end architecture for capturing environmental data using an Arduino MKR WiFi 1010 and Env Shield to publish data to GCP for reporting and analytics.

CI/CD Setup

This repository contains our environment definition in Terraform. GitHub Actions will call Terraform Cloud every time we commit to the “,” branch in order to update our infrastructure.

In the Google Cloud Portal, create or select the Project you’re going to use. In IAM, create a Service Account and assign it “Owner” permissions, we will be creating another service account using Terraform. Select the Service Account, navigate to Keys, generate a Key and download the JSON key definition.

In Terraform Cloud we need to setup a Workspace, select “API-driven workflow” for type. Under “Variables”, register a new variable, I named mine “GOOGLE_CREDENTIALS”, for the Value, paste the entirety of the JSON file you downloaded after defining your Service Account and flag it as “Sensitive”. We also need to register a variable for our GCP Project Number, I named this “PROJECT_NUMBER” and pasted the ID of the project here.

Finally, we need to generate a token for Terraform so that GitHub can call it. Navigate to your profile in the top right, select “User settings”, then “Tokens”. Generate a token.

Finally, back in GitHub, register this token value under “Settings” and “Actions”. Save it as “TF_KEY” and enter the token as the Value.

GCP Setup

Google Cloud requires us to enable APIs before we can populate our environment, so in the Portal, navigate to “APIs & Services” and select “+ Enable APIs and Services”. We need to enable “Cloud Pub/Sub”, “
Identity and Access Management (IAM) API”, “Google Cloud IoT API”, “Compute Engine API”, “Cloud Autoscaling API”, and “Dataflow API”.

Now committing to the repository should successfully build our environment. Key things in the header of the main.tf will depend on your environment, specifically;

The organization should be the organization you setup in Terraform Cloud, with the Workspace mapping as well. All of the other variables will work as-is if you named your secrets in Terraform Cloud what I suggested.

Setting up the Arduino

This repository contains the Arduino code for capturing environmental readings and publishing them to GCP via MQTT.

To begin we need to generate a certificate, which will be used to authenticate our device to Google IoT. This requires the gcloud cli and can be done from a terminal with the following commands;

openssl ecparam -genkey -name prime256v1 -noout -out ec_private.pemopenssl ec -in ec_private.pem -pubout -out ec_public.pemopenssl ec -in ec_private.pem -noout -textgcloud auth logingcloud config set project <project ID>gcloud iot devices create arduino-reader --region=us-central1 --registry=arduino-registry --public-key path=ec_public.pem,type=es256

Here we’re generating certificates, a public and private key with the first two commands. The third command outputs text, we’ll need the text under “Private-Key: (256 bit)”. Copy the text that should consist of pairs of numbers or letter separated by “:”, we’ll need it later.

The final 3 use those keys to register the arduino as “arduino-reader” using the “arduino-registry” IoT Core instance that we created in Terraform. This could be done using the web UI instead.

Now we have credentials and a registered device in our GCP instance, we need to register our secrets for Arduino and start capturing data.

Starting Data Capture

Our Arduino code expects a “secrets.h” file to reside in the same directory as our “gcpmqtt.ino” file. The secrets file should contain;

const char* ssid = "<your-wifi-ssid>";const char* password = "<your-wifi-password>";// Cloud iot details.const char* project_id = "<your-gcp-project-ID>";const char* location = "us-central1";const char* registry_id = "arduino-registry";const char* device_id = "arduino-reader";//generate private-key text with 
//openssl ec -in ec_private.pem -noout -textconst char* private_key_str ="<Paste the private key text captured above>";// Time (seconds) to expire token += 20 minutes for driftconst int jwt_exp_secs = 3600; // Maximum 24H (3600*24)// In case we ever need extra topicsconst int ex_num_topics = 0;const char* ex_topics[ex_num_topics];

Once we register our variables we’re ready to program the device and start sending data to GCP. Serial output should look like;

Data Pipeline

Now that we have data being transmitted, let’s move and store it in BigQuery using Cloud Dataflow. Terraform scripting this portion failed, so the configuration or CLI for registering a job is;

gcloud dataflow jobs run readings to bq --gcs-location gs://dataflow-templates-us-central1/latest/PubSub_to_BigQuery --region us-central1 --max-workers 3 --num-workers 1 --service-account-email <YOUR_DATAFLOW_ACCOUNT>.iam.gserviceaccount.com --staging-location gs://pb-temp-gcs/files --parameters inputTopic=projects/<YOUR_PROJECT_ID>/topics/arduino-telemetry,outputTableSpec=<YOUR_PROJECT_ID>:sensor_data.arduino,outputDeadletterTable=<YOUR_PROJECT_ID>:sensor_data.arduinofailed

Most of this information will be the same if you left the Terraform script alone but key differences will be; finding the service account email, this should start with “dataflow” and can be found in IAM under service accounts. Other information should be the same if you didn’t change the Terraform.

One other note is, if configuring in the UI or via REST, to specify “Additional experiments” and “enable_prime”. This should significantly lower the cost of running this job if you keep it for an extended period of time.

Our running job:

BigQuery

Now we should have data showing up in BigQuery under <Project ID>.sensor_data.arduino.

Enhancement

So we have data showing up end-to-end, but the timestamp really should be in a DATETIME, not integer timestamp, so let’s create a notebook to modify it.

We can do interative development of our Dataflow job using Dataflow Workbench. Create a User Managed Notebook and upload the “parse_streaming.ipynb” in the same repo as our Terraform code.

This notebook demonstrates connecting to the Pub/Sub topic our IoT telemetry is being published to and parsing it using Python and Beam

Parsing the date:

#parse the timestanp into a date time format with my timezone
def to_timezone(timestamp):
    date = datetime.fromtimestamp(timestamp)
    date_format='%Y-%m-%d %H:%M:%S'
    date = date.strftime(date_format)
    return date

Running this in a Beam Pipeline:

#add a step to the pipeline to parse time timestamp
p = beam.Pipeline(InteractiveRunner(), options=options)

pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
            | beam.Map(json.loads)
            | beam.Map(lambda x: {"timestamp":to_timezone(x['timestamp']), "temp":x['temp'], "humidity":x['humidity'], 
                                  "pressure":x['pressure'], "illuminance":x['illuminance']}))

ib.show(pubsub)

Here we show interactive results from our processing pipeline, we can launch this same code as a streaming job that will be visible in Dataflow and populate another table;

def streaming_pipeline(project, region="us-central1"):
    from datetime import datetime
    #parse the timestanp into a date time format with my timezone
    def to_timezone(timestamp):
        date = datetime.fromtimestamp(timestamp)
        date_format='%Y-%m-%d %H:%M:%S'
        date = date.strftime(date_format)
        return date
    
    table = "data2-340001:sensor_data.arduino_prepared"
    schema = "timestamp:datetime,temp:float,humidity:float,pressure:float,illuminance:float"
    
    options = PipelineOptions(
        streaming=True,
        project=project,
        region=region,
        staging_location="gs://pb-temp-gcs/files", #change to your bucket
        temp_location="gs://pb-temp-gcs/temp" #change to your bucket
    )

    p = beam.Pipeline(DataflowRunner(), options=options)

    pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
                | beam.Map(json.loads)
                | beam.Map(lambda x: {"timestamp":to_timezone(x['timestamp']), "temp":x['temp'], "humidity":x['humidity'], 
                                      "pressure":x['pressure'], "illuminance":x['illuminance']}))

    pubsub | "Write To BigQuery" >> WriteToBigQuery(table=table, schema=schema,
                                  create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
                                  write_disposition=BigQueryDisposition.WRITE_APPEND)

    return p.run()

In the future, we can build a model and maybe some materialized views.