Data infrastructure through the eyes of a data scientist.

Published in

Towards Data Science

8 min readNov 2, 2018

Data science today is a hot topic with companies talking about being “Data Driven”, “Data informed”, or “Data Centric” and the changes they have made in their approach to a new business model because of this, and why not? The impact of using data to make more informed (?) decisions to a problem has seen businesses thrive. Naively a lot of businesses think all it takes is to hit the big red button marked “Data Science” get the answer of 42 and sit back and watch the $$$’s roll in, if you’ve been in this world you know this isn’t the case.

This article is not about the strategic business process of transforming a company to be data driven, or on the difficulties you will almost certainly encounter along the way (FYI This will mainly be about educating your audience, dealing with hostility to change or egos). This article is focused on the ground up approach to building the data infrastructure needed to support your data scientist needs.

Disclaimer: Technologies, SLAs, and the particular use cases of your business are always different to any authors views, this is an overview of what worked for us.

When one thinks of industries pushing the data frontier in terms of a forward-thinking approach on using data in traditional businesses; verticals such as, Financial services and Insurance, Computers and electronics and Media and telco’s would generally be the first that spring to mind. To have a leader of course means that other industries are behind the digital curve.

The shipping industry in particular, is synonymous with lagging behind other industries in utilising data in making key decisions or automating many processes, and whilst coming to the game late has had its disadvantages, it also allows us to look at the data eco-system as it stands today and try take advantages of recent design paradigms and technologies suitable for us.

I intend to write three posts over the next year on specific areas:

Cloud Infrastructure for data driven applications
Data application life-cycle management
MLQL (Machine Learning Query Language)

Cloud Infrastructure for data driven applications

The data science community has matured immensely over the last few years in automating processes and pipelines in general. Great effort has been spent on concepts like “AutoML” and “AutoFeature engineering”, to optimize the time a data scientist spends redoing the same workflow, i.e go to any conference, or internal presentation regarding “Data Science” and some version of the following diagram will pop up.

This nicely represents the data science domain model which enables us to codify the processes required in attempting to solve our problem somewhat optimally. Indeed the community has gone a step further and realised that solving this problem is not enough, and that a mechanism is required to productionise this solution.

Dockerizing this solution and creating a micro-service is the go to way in solving this next problem of getting into production, though this to me is simply a mechanism, what is behind the curtain that means this seamlessly can happen? and how can i make sure i’m prepared for the future? For me creating a data driven company means creating a data democracy within the company and enabling data access to users at all levels of data literacy and by doing so it creates questions to be solved:

How can I create an environment that a data user WANTS to use?
How can I enable a reproducible data environment as a standard way of working?
How can I deploy my solution without worrying about dev-ops / cloud ops?
How can I create abstraction of data manipulation? (includes reporting, pipeline’s, parallelism, persistence etc.)
Will my solution scale seamlessly?

The first two points are much softer in respect to being solved in your particular business, the last three points should resonate much deeper if the infrastructure you have exposes these problems to the data *user*

To give an example, when starting our new data science team, we explicitly spend a great deal of time focusing on the the scaling challenges that we will face and more-over solving these challenges upfront, giving a layer of abstraction from the end data user who simply wants to use the infrastructure and not be bogged down into learning all manner of other design and architecture concepts… you know, i just want to investigate the data, try out a bunch of models and put my <insert fancy model> in production.

On the surface this seems like a life-cycle management question of moving from R&D in Prod, and whilst this is a real problem (See the wonderful article around this here: https://medium.com/towards-data-science/understanding-ml-product-lifecycle-patterns-a39c18302452 and the git repo of a library to solve this https://github.com/formlio/forml) that still needs to operate on something.

This brings us to the heart of the matter, creating the scalable infrastructure to enable the other layers in the data stack upstream ( Pipelines, Life Cycle Management, logging, etc) but removing the low level implementation details from the user, so how did we approach this problem?

Firstly by recognizing that data is the commodity means the majority of our time and and effort should be spent around good methodologies to access data and remove barriers of entry for rapid development. (This is easier said than done) this might mean teams thinking about some of the following:

Enabling common abstraction of logic to query data — the user shouldn’t have to worry about Auth Credentials, tenant names, or storage formats
Enabling security layers to avoid misconduct (This is a layer that sits on top of the data but as soon as access is granted the user should know how to query this data)
Enabling applications and services once delivered on an infrastructure that provide predictability, scalability, and high availability.

Without going into the code level of our architecture, we have effectively broken the infrastructure layer into a few distinct areas which group together components of some common characteristics.

This is especially useful from the operational perspective as each of these groups might have different SLAs, scaling patterns or resource demands requiring different approaches in terms of management, or policies such as level of isolation avoiding influencing each other’s group performance.

For any Data Persistence, we almost exclusively want to utilize the cloud PaaS. It’s a well-established ecosystem of solutions that often became the de-facto standards, so trying to solve these problems ourselves in majority of the cases would be; a cost-inefficient, quality-incomparable, reinvention of a wheel. Thus we reach out for available services of blob storage, data lakes, file shares, databases, caches and message queues.

Directory Subcluster is what we call a group of meta-data services with relatively low CPU and I/O demands that are responsible for managing global state for the rest of the cluster. They are designed with high-availability in mind and built (where relevant) on top of the data persistence PaaS described previously. Services from this category are typically coordinators, resolvers, orchestrators and schedulers, or the master parts of server-worker architectures. As examples, we can name Zookeeper, Hive Server, Hive Metastore, PrestoDB Coordinator, Atlas, Airflow Scheduler, and Consul but possibly also Yarn Resource Manager, Spark Master when in standalone mode and similar.

Our Job Subcluster is characterized by large and spiky CPU and I/O loads generated by non-interactive on-demand distributed batch operations. It is running on stateless nodes — typically under the control of the directory subcluster. Its key attribute is horizontal elasticity that allows us to ramp up the resource pools dynamically (possibly by a large factor) when needed and releasing them back after task completion. Services in this cluster would be the counterparts of the components mentioned in the directory subcluster so we can find here pools of PrestoDB workers, Airflow workers, Superset workers, but also the Yarn Node Managers or Spark executors among others.

The Application Subcluster, in contrast, is designed for rather smooth loads, often interactive sessions related to user experience or REST APIs. As such the demand is focused on low latency and high stability. Services running on the application subcluster refer to the Directory cluster for things like service discovery or to the Job cluster for offloading any batch operations. Our custom products are the primary instances of this group followed by various UIs like the Superset frontend or Apache Knox Gateway.

We mentioned the persistence layer would be predominantly provided by the PaaS so the actual application (big) data is accessed directly from it. There are however cases where we need a specialized layer (access interface or processing stage) that doesn’t exist within the available PaaS but can be implemented as a managed feature on top of it. For this purpose, there is the Data Subcluster serving custom I/O-bound workloads.

We use this cluster for integrating all our streaming ingestion which covers anything related to message queues up to the final sinks that produce the primary data sets but also any secondary information like search indexing or BI aggregations. Examples of technologies that can be found in this cluster are Kafka Streams applications, Solr, RabbitMQ, Logstash but potentially even HBase RegionServers running on top of hosted blob storage (in case hosted NoSQL services from some reason doesn’t fit the needs).

There are a series of low-level services which we separated into its own System subcluster. These are for example centralized ingress proxies and application gateways providing transparent auth control, load-balancing or SSL provisioning and termination. We should also mention here a few hosted services (even though external to the cluster) like OAuth providers or hosted log and metrics collectors and monitoring systems.

A special part of the infrastructure is dedicated to our R&D resources. It is a vertical stack mixing both hosted and managed technologies. This covers various agile tools, the whole CI/CD ecosystem including artifact repositories, virtual development environment with multi-user notebook service and a remote shell. The Key role obviously belongs to elastic computing resources and necessary scheduling as an instrumentation allowing for iterative and reproducible data science research based on continuous evaluation.

Creating the infrastructure in this fashion has given us the abstraction we desired to create an integrated life cycle management for the user whilst addressing our key areas of concern.

This enables a data user at a high level to:

Query known data sources via PrestoDB through a single API without concern of data size (Elastic Scaling of workers) — In fact we have an in house Python API abstraction of the entire data warehouse infrastructure allowing us to use its various resources.
Search the metadata around other data sources with Solr and Apache Atlas to augment primary data sources (all via Presto connectors) and retrieve them if permissions are granted (Apache Ranger)
Complete EDA and other investigations with Jupyter Notebooks and save and share these with other users.
Define secondary, transformed data sources to expose to B.I end users via Apache Superset for further investigation.

Note this is NOT concerned with the pushing models into production, retraining or tuning models as this is the next stage up the stack and the concern of the next post. What we have shown is the building blocks we have used to allow us to make this happen, whilst keeping our number one focus on providing the tools the data user needs and wants to use.

Data infrastructure through the eyes of a data scientist.

Written by Michael Pearmain