The world’s leading publication for data science, AI, and ML professionals.

How to Build and Scale a Data Platform with the Google Cloud Ecosystem

E-commerce journey on building a data platform with Google's tool stack

Building a data platform from scratch can be both exciting and daunting.

Regardless if you are an early or late-stage company, your choices in building a foundational data platform will likely have a material impact on business for years to come.

Of course, the initial objective should be to resolve the most burning business issues (e.g., reduce the repetitive and manual analysis).

However, the long-term objective should be to help you scale the business by easing your decision-making process.

And with these two objectives in mind, this blog post will present our e-commerce journey of building a data platform from scratch using Google’s tool stack.

First, we’ll explain the business requirements that triggered our decision to build a data platform.

Then, we’ll share **** which Google tools we used at the beginning of the project and how our Data Architecture has evolved around available Google Cloud services.

Lastly, we’ll point out what to pay attention to when building a data platform if you decide to go down this path yourself.

Deciding: why build a data platform?

Our plan to build a data platform started in 2020 after a sudden boom in the e-commerce sector.

With growth higher than expected, we were experiencing two significant issues that impacted our decision-making process:

  • Issue #1 | The initial requirement:

As an e-commerce company with 40+ shops, we had data silos problems. In other words, our main data sources – sales histories, performance marketing data sources (Google Analytics and 15+ other traffic sources), financial reports, etc., were isolated.

This resulted in increased manual work in the business departments, i.e. they manually downloaded the data from different sources and overlapped them in Google Sheets to create cross-dataset insights. Consequently, we needed a detailed controlling framework.

So, our initial business requirement from the data platform was to solve the data island problems and automatize the creation of cross-dataset (and cross-department) data insights.

  • Issue #2 | The long-term requirement:

To provide our customers with a better service, we needed to develop more advanced analytical use cases:

  • Demand forecasting models for optimizing the stock levels.
  • Market basket analysis models for creating better newsletter offers.
  • Dynamic pricing models using competitors’ prices for better pricing strategies.
  • Customer segmentation model for better understanding our customers’ shopping preferences and providing them with customized offers and loyalty discounts.

Hence, our long-term requirement for the data platform was to help us scale in business by enabling the development of more advanced analytical use cases.

With these two requirements in mind, we will share our journey of building and scaling a data platform.

Developing: how to build and scale a data platform?

After landing a cross-business and technical decision to build a data platform, we started evaluating the best cloud solution for us.

The main criteria were that the cloud provider supports us in scaling up (or scaling down) and that we can easily tame our costs.

When it comes to scaling criteria, all cloud providers were able to support us in this, following our business requirements. However, the pricing differed between the cloud vendors, and we evaluated that Google Cloud offered more discounts on the services we planned to use. In addition, as we already used GoogleAds intensively, it was easier for us to get internal and external consultancy support.

And this is why we decided to acquire the Google Cloud platform.

Accordingly, we started architecting our initial data layers:

As visible from the image above, the initial data architecture consisted of the following layers:

  • #1: Data collection layer: presents the most relevant data sources that had to be initially imported to our data warehouse.
  • #2: Data integration layer: presents cron jobs used for importing e-commerce datasets and the Funnel.io platform for importing performance marketing datasets to our data warehouse.
  • #3: Data storage layer: presents the selected data warehouse solution, i.e. BigQuery.
  • #4: Data modelling and presentation layer: presents the data analytics platform of choice, i.e. Looker.

To summarize our initial work:

  • First, we worked on creating the data storage layer by importing data from two main clusters of data sources **** (shop e-commerce datasets and performance marketing sources) to BigQuery.
  • Second, we started creating the data modelling & presentation layer by developing cross-dataset self-service data models using Looker.

Our beginning resources for building a data platform in Google Cloud can be quantified as follows:

  • 2 tools – BigQuery and Looker,
  • 6 people – for managing data pipelines (cron jobs + Funnel.io platform) and initial analytical requirements (data modelling),
  • 3 months -from acquiring Google Cloud to presenting the first analytical insights.

It is essential to mention that initially, we didn’t have a dedicated data team to work on building a data platform.

Instead, we distributed the work between two departments – five colleagues from the IT department (working on the data pipelines) and one from the Business Development department (working on the data modelling).

And with the listed resources and organizational structure, we achieved our initial business objective and automated the creation of cross-dataset data insights.


From then on, the business requirements for data insights started only to grow, and the plan was to begin developing more advance analytical use cases.

This resulted in changes in the data architecture and extension of the layers with new Google services:

As visible from the above-provided image, we extended our data architecture with the data preprocessing layer and started using new Google Cloud services and tools:

  • Cloud storage – for storing our external files in Google Cloud.
  • Cloud Run – used for deploying analytical pipelines developed in Python and wrapped as Flask applications.
  • Google Functions – for writing simple, single-purpose functions attached to events emitted from the cloud services.
  • Google Workflows – used for orchestrating connected analytical pipelines that needed to be executed in a specific order.
  • Google Colab – for creating quick PoC data science models.

With the scaled data architecture, we had a growth in resources:

  • From 2 to 7 tools – from ** using only** BigQuery and Looker, we started using Cloud storage, Cloud Run, Google Functions, Google Workflows, and Google Colab.
  • From 6 people in two teams (IT and Business Development) to 8 people in one team (Data and Analytics) – the Data and Analytics team was established and now has complete ownership over all data layers.
  • From 3 months for creating initial insights to 2+ years of continuous development – we are gradually developing more advanced analytical use cases.

And this is where we currently are – working actively on delivering new, more advanced analytical use cases to ease our decision-making process and better support our customers.

To conclude, we will share the main takeaways on what to focus on if you build a data platform from scratch.

Summarizing: what to pay attention to?

Before starting to build a data platform in the cloud, think about the following two topics:

  • What are your priorities and burning issues? – prioritize the use cases the data platform should resolve for you promptly, which can generate immediate business value.
  • What are your constraints? – think and quantify everything – from software and human resources to time and effort required, level of internal knowledge, and monetary resources.

During this part, keep in mind two aspects:

  • Start with quick wins – don’t dive directly into data science and machine learning model development, but instead start with quick win use cases (usually descriptive statistic use cases).
  • Be realistic – when setting the data platform goals, the important thing is to be realistic about what’s feasible to achieve given current constraints.

In addition, during the development of the data platform, pay special attention to the following:

  • Building data pipelines – properly developed data pipelines will save you money, time, and nerves. Developing pipelines is the most crucial part of the development, i.e. that your pipelines are properly tested and deliver new data to business users without constant brakes due to various data and system exceptions.
  • Organizing and maintaining the data warehouse – with the new data sources, a data warehouse can quickly become messy. Implement development standards and naming conventions for a better data warehouse organization.
  • Data preprocessing – think about acquiring data preprocessing tool(s) as early as possible to improve the dashboard performance and reduce computational costs by de-normalizing your datasets.
  • Data governance and security – set the internal standards and data policies on the data lifecycle (data gathering, storing, processing, and disposal).

And by listing these takeaways, we are concluding our post.

We hope you will find this post helpful if you decide to build your data platform.


*Credits: Thanks to my colleague Stefan Gajanovic, who helped visualize the changed data architecture.


Related Articles