Data science within the advertising industry

How data drives budget allocation in the ~$600B advertising business.

Published in

Towards Data Science

6 min readMay 28, 2020

You may have heard of how data drives ad placement within Facebook’s or Google’s advertising businesses, by data-driven ad placement we refer to the capability of ad vendors to present an ad on a higher rate to consumers liable to have interest in the content of the ad. But apart from big tech companies complex ad placement data-driven systems there is a much more dense and equally relevant group of businesses driven by data within the advertising space: agencies. Though in a different way and with different purposes to those of big tech companies, agencies leverage on data for a very important operation, usually overseen from outside the industry: budget allocation.

At a high level, you can think of an advertising and marketing agency as a company in charge of managing and executing the lifecycle of advertising campaigns for its clients. This includes content creation, marketing strategy design, single point of contact with ad vendors, centralized account management and of course data analytics. With a forecast spend of ~$615B for advertising in 2020, agencies play a key role on how companies like Facebook build profitable businesses on advertising.

By continuously evaluating metrics on performance vs. cost, feeling vs. age and gender reach or conversion rate vs. media type, agencies decide how to dynamically allocate resources of running campaigns to meet client’s goals.

Awareness and performance are the ruling end goals for brands in today’s market, these client goals define a campaign’s objective and thus the data required for relevant analytics. Think of an ad as the atomic unit of a campaign, ads of a campaign usually run on different media types and each may have different objectives, together, they add up to the global intents of the campaign. For example, an ad within a campaign may be aimed for awareness, meaning it’s meant to make the public aware of the product/service presented, hence we care for its reach, impressions and audience demography data like age and gender. On that same campaign there may be ads scoped for performance, the end of these is to reroute the audience to a place where they can buy the product offered or sign up for the service presented. In this case analytics need to track conversions and traffic generated through metrics like clicks or ctr.

Agencies run campaigns on a wide range of media types namely digital, cinema, radio, tv, press and ooh (out of home). Data scientists within agencies engineer data analysis solutions that are capable of extracting and deliver meaningful insights out of ever-changing data. In the end, these insights are meant to backup critical business decisions during a campaign’s lifecycle as well as keep clients informed on diverse key statistics of their running campaigns.

We are spending too much on TV and we’re not getting the expected results, put a cap on their budget.
This vendor called ‘Facebook’ owns more than %40 of our digital offering, is there no one else? What about these guys ‘Instagram’?

A data analysis solution will ingest, process and analyze data to finally deliver the insights in the shape of visualizations on a data visualization tool like DOMO, Amazon Quicksight or Power BI in an automated fashion and on schedule over periods of time (larger/shorter depending on the clients budget). The channel through which raw data for the solution is ingested and the shape of such data varies on the different media types where agencies run campaigns.

Digital

Anything within the internet. Facebook Ads, Google Ads, Youtube Advertising and any other web portal with ads usually powered by a mayor vendor like Teads. Since these are mainly tech companies, they’ll expose an API to extract non-relational data statistics of running campaigns and even some libraries for the most popular programming languages.

Cinema

Those fifteen minutes before every movie really add up to the client’s end numbers. You may have noticed advertising content is of a higher standard at the cinema.

Agencies will have the number of impressions (impressions = reproductions, biz jargon) of an ad campaign in cinema, flat file sheets mainly as these are filled up by someone in account management.

Depending on the size of the agency cinema companies may agree to deliver extracts of their sales database with which agencies can cross the known number of impressions to know the reach of an ad. We have a leak though for those that arrive 15 minutes late to their movie LOL.

TV

Pretty much the same as in cinema. Here, the reach field definitely needs to be filled up by the vendor part (TV company). TV companies collect their reach data 24/7 so this is most probably what they handle to the agency.

Radio, Press and OOH

You get it to this point, agencies need to collect raw data about things like planned budget, spent budget, impressions, reach, clicks, vendor, campaign feeling, audience demography data and conversion rate continuously in an incremental mode during the life of a campaign. Data analytics must be sophisticated enough to analyze data following common data fields across different data sets.

The solution

From an engineering perspective, we’ll want to have all this data collected in an automated fashion on schedule into an object storage like AWS S3, Microsoft Azure Storage Account or GCP Cloud Storage.

For the APIs we can use a script that sends requests to the vendor endpoints on schedule and pushes the response object to our cloud object storage service. At my job I use serverless architectures like AWS Lambda, Azure Functions or GCP Cloud Functions for this.

Since the rest of the data is provided to us by account management and third parties we need a way to give them access to our object storage. We can leave an SFTP server connected to our storage so that every time the server receives a file it maps it into an object within a folder of our object storage.

Once we have all these data sources connected we say the ingestion process is complete, the object storage is continuously fed off of all the data that the solution needs to process. This is called a data lake.

The rest of the process is just a bunch of ETL jobs and scripts to transform all the data into a shape that makes sense for analysis, this includes field renaming, data type assigning, regular expressions, complex JOIN operations and file format transformations to an analysis-friendly format like Parquet/ORC. These operations happen within tool frameworks like Hadoop, Spark or cloud managed solutions like AWS Glue, Microsoft Azure Data Factory or GCP Dataproc. One can also leverage on open source libraries like pandas for python. Jobs and scripts can run both on event triggers or on schedule.

Finally the solution needs to store the transformed data into another cloud object storage or better yet, a cloud hosted data warehouse like Amazon Redshift, Microsoft Azure Synapse or GCP Big Query.

Agencies will feed their data visualization tool of choice off of this data and the business intelligence department will handle the final dashboard/visualization design assuming the solution automatically updates the final data set every time there is new raw data. Account management hosts follow-up meetings with the client that orbit around these visualization dashboards and thousands of dollars get dynamically allocated on an almost daily basis heavily relying on data analytics.