Fun with BigQuery and R – Building a Google Analytics Alternate
As of a few weeks ago, I started getting my hands dirty with google BigQuery, that is an enterprise cloud data warehouse. (Keep reading if you need a little more info about the topic of discussion than already stated).
ps. the last part is a joke, I love google analytics and not just for the fact that it’s hard to replace and elegantly built but it was a fun thought experiment.
During that period, I was asked to post data from google analytics to google BigQuery. As I got on with it, the first problem I faced was that google BigQuery accepts data, exported only from “Google Analytics premium” and doing so can be an obstacle for startups or low budget firms as it is quite expensive.
So the only option left to consider here is third party applications. After a little bit of research, I stumbled upon this online github repo by Mark Edmondson, which allows streaming tag manager data to BigQuery using a python app that moves the data received from GTM and transforms it into BigQuery schema. Naturally this is also running on google app engine 🙂
What is BigQuery?
Think about large sets of data. By large, I mean few PBs and TBs of which that not just need managing, but interpreting, analyzing, manipulating as well as storing. All these big words put together form a deadly combination which is enough to send any analytics newbie running towards the hills.
If that is not enough, imagine querying your way through massive datasets, minus the expensive hardware setup that makes the job easier and automated.
When it comes to data analytics, you are going to need to do all that and more. This is where BigQuery comes in, a fast, hassle free alternative for analytics rookies – combined with the power of Google’s robust infrastructure.
The best thing about BigQuery as a RESTFUL web service would be its low price offering, bundled with a fully managed substructure that offers advanced query processing and that too on huge chunks of data.
For integration of third party tools with Google BigQuery (in this case a python app), there are a few simple steps you need to attempt:
Getting and Streaming Data:
First you need a google cloud account, as BigQuery is a part of – or associated with google cloud and it accesses google cloud project. After this is taken care of, you need to create the tables which is fairly easy in BigQuery. In case you need any further assistance. Please refer to the extensive documentation available on https://cloud.google.com/bigquery/docs/ .
The next important step is creating the schema.
So what is schema?
A Schema is basically the structure of a database. It includes database table names,columns. It can lodge and the type of data these fields will hold etc.
This is extremely important as we have to be sure about the data we need. Here, what we needed was basic GTM values like clientId,page views and other related stuff. So essentially extracting information about our business needs while naming table columns is quite important because every time you want to alter a table schema, you’d have to delete the table and recreate it. BigQuery does have the option to add field but only at the end of the table and that too can’t be removed.
Next you need to make appropriate changes in the code and deploy the app to google cloud so that it can capture incoming data and pass it onto Google BigQuery. Since I was new to all this, I kind of got stuck at every point here. From setting up the SDK to making changes into YAML file and deploying it to the cloud app engine. After successfully deploying the app we had to actually send the data.
Google Tag Manager to deliver the goods
For collecting data and moving it to app engine, we made custom tags that will capture the data from our web site and stream it to BigQuery through our app.
This is the part where I faced most the problems. Our data was sort of dynamic, as some events fire at a page and some not. Keeping it to GTM is fairly simple but we ran out of options when all these values are being stored in real time in a table. So deciding when to fire a certain tag, when to send certain data to BigQuery table or when to send a ‘NULL’ is also very important. Though this process wasn’t really painless and as clumsy as a regular developer can be, I made a lot of common mistakes like missing out a variable. So, it’s quite normal if the process doesn’t go as smoothly as you’d reckon.
Another problem with streaming was that I had to send all the values to all the fields. There were no defaults like if a field is missed it does not send NULL automatically. This also makes it necessary to know for sure what your schema will be before hand.
After banging our heads against the wall and using google to its fullest we resolved our errors and the app was successfully deployed and started streaming our data to google BigQuery.
Next step was to visualize our data and to plot some sort of forecast for it. The best language to use for this purpose is R as it has over 7000 packages which deal with plotting, visualization and forecast.
If I am being honest here we knew almost nothing about working with this ‘alphabet’ cum language.
But there is a repo on github again by Mark Edmondson with R shiny app for visualizing and forecasting google bigQuery data.
First we tried to deploy the app to google cloud app engine but since it had only limited environments, we failed miserably in setting up a custom environment for our R shiny app.
After making a few changes to shiny app code and running it locally, a few issues with google authorization came up at first but luckily they were resolved.
Now the challenge at hand was querying and getting bigQuery fields. This problem was fairly silly because there is a check to see consistency in field names and used fields. Meaning we could only use the mentioned field in a variable. Querying anything less or more would generate an error and to be honest R errors are really weird and difficult to comprehend so it’s something you’d definitely want to avoid.
In the end, we successfully deployed the app to shinyapps.io through Rstudio which is basically click and deploy. Although, we still wanted to make a few more changes but not having any luck, we decided to give it a rest. Basically what the app shows is the number of hits per timestamp and a forecast for next couple of hours.
How was it?
This was fairly new experience for me though, painful but fun to do. Exploring different techniques and tools is always the best way to discover something new and to augment your skills.
Next time while working with something relatively new, I’d definitely be bombarded with problems and bugs, but I’ll make sure to share them with you along with a proper solution. Also a big thanks to Mark Edmondson who was very kind in answering what were surely to him noob questions.