
If you’re anything like me, you think that serverless is great. Negligible running costs, no server configuration to take up your time, auto-scaling by default etc. etc. Of course, by virtue of the ‘no free lunch’ mantra this convenience and cost savings has to come at a price, and whilst there are numerous real costs to serverless (memory constraints, package size, run time restrictions, developer learning curve) this article is going to assume that you’ve already solved those (or don’t care about them) and instead focus on a specific performance cost – cold starts.
Why do cold starts hurt data science applications?
Cold starts are slow. That’s the problem, how slow exactly depends on various factors such as your runtime (python is actually one of the quicker lambda containers to start), and what you’re doing in the setup phase of the Lambda (source).
As data scientists and developers, we’re used to a gentle pace. Start a download of 200 GBs of text data (put the kettle on), load a model (poor the tea), run a clustering algorithm (go for a walk)… sound familiar? But whilst this slow pace is fine for experimentation (and productionised systems running on persistent servers), it is potentially fatal for the serverless pattern. AWS Lambda only allows a maximum of 15 mins runtime before the function times out and the container dies, confining all un-persisted work (and running computations) to the graveyard of dead containers never to be restarted.
This issue is compounded when your lambda function is in fact an API endpoint which clients can call on an ad-hoc basis to trigger some real-time data sciencey/analytics process (in our case NLP on a single phrase of arbitrary length), as that window of 15 mins Lambda run time suddenly gets slashed to a 30-second window to return an HTTP response to the client.
This article isn’t to try and put you off running ad hoc NLP pipelines in lambda, on the contrary, it hopes to help make you aware of these constraints and some ways to overcome them (or alternatively, help you to live on the edge by choosing to ignore from a place of understanding).
What is a cold start?
A Cold Start occurs when the container is shut down and then has to be restarted when the lambda function is invoked, typically this happens after ~5 mins of inactivity.
A lot has already been written about cold starts, so this article won’t provide a detailed guide (I recommend you check out this article for that). But in a nutshell…
When a container starts from a cold state, the function needs to:
- Get and load the package containing the lambda code from external persistent storage (e.g. S3);
- Spin up the container;
- Load the package code in memory;
- Run the function’s handler method/function.
(https://dashbird.io/blog/can-we-solve-serverless-cold-starts/)
N.b. step 4 always happen whenever you invoke a lambda function (it’s your code!), but steps 1–3 only occur for cold starts. As the setup stages take place entirely in AWS, they are outside of our control. We can optimise to our heart’s content in step 4, but steps 1–3 can still arbitrarily hit us with ~10 second+ latencies. This is obviously a problem for synchronous APIs.
Our specific problem
Getting onto our specific problem now. We had a synchronous API, which:
- Took an arbitrary text input from an HTTP request
- Downloaded an NLP model from S3 (~220mb)
- Performed NLP on the input using the model
- Returned the serialised result to the caller.
The issue here was step 2. Downloading the model from s3 on each invocation could take between 15–20 seconds. This was fine for our use case the majority of the time, as although we were providing a long-running synchronous endpoint we didn’t expect it to be quick (we’re talking about NLP on the fly, not a simple GET request).
However, during cold starts we were seeing requests frequently time out. This makes total sense as if the lambda takes 10ish seconds to start up on top of the 20ish seconds to download the model, we’re not left much time to run the NLP and return the results in a 30 second HTTP window!

Possible solutions:
There are various ways to solve this problem, such as:
a. Provisioned concurrency
b. serverless-plugin-warmup
c. EC2
d. A bespoke solution
Each of these is discussed below:
A. Provisioned concurrency
Provisioned concurrency is a great solution if you know an exact window for when you expect traffic to your lambda. It basically pre-allocates a provisioned number of containers to run your lambda which will remain up for the duration of the specified window.
One of the main advantages with this solution is that it can improve latency with both the setup (1–3) and the runtime(4) phases of the lambda process described above. It’s obvious to see that the setup phase is eliminated during the provisioned timeslot as the containers are already running (i.e. no cold starts), but what maybe isn’t quite so obvious is how this approach can speed up the runtime. The answer lies in the fact that by keeping the container warm unnecessary initialisations in subsequent invocations are avoided (e.g. tasks like setting up database connections, initializing objects, downloading reference data, or loading heavy frameworks) (source).
The issue with provisioned concurrency for our use case was that as our API has to support integrators globally in different timezones, the provisioning windows would become so large that it would become an ineffective solution for us. Furthermore, we didn’t know exactly when these windows would be.
B. serverless-plugin-warmup
We use the (highly recommended) serverless framework, and one of the greatest benefits of adopting the framework is the community and range of open-source plugins available. One such plugin for exactly this problem is the serverless-plugin-warmup.
This plugin takes a similar approach to provisioned concurrency, in that it attempts to keep lambda containers from going cold, but it achieves it by hitting them with dummy requests during a specified time window. The plugin is well documented and highly configurable, but ultimately we decided it wasn’t the best route for us for the same reason as the provisioned concurrency option – we can’t necessarily predict exactly when we’ll need the concurrency, and to configure it to a wide window would be wasteful/overkill.
Let’s focus briefly on that last point – wastefulness. Lambdas are cheap as chips, but for us, the real cost was downloading the NLP model. A cigarette packet calculation showed that if we were to attempt to keep just 10 containers alive during normal UK office hours throughout the month we’d have downloaded 1.89TB of data a month. Whilst this still isn’t a massive expense in S3, it was orders of magnitude higher than our current expenditure. Bare in mind when scaling your lambdas the real cost – often it isn’t just the runtime of the lambdas themselves!
C. EC2
All this downloading models from S3 every time sounds very inefficient – can’t we just spin up a few webservers that autoscale and load the model on there so it’s ready to go? Problem solved, no?
Well yes, but a) that would require a significant investment in refactoring from our current entirely serverless API backend, b) another cigarette packet cost calculation showed this would be orders of magnitude higher than the cost of implementing A or B, which themselves were orders of magnitude more expensive than what we’re currently doing.
D. A bespoke solution
Having ruled out the above solutions, it was decided a bespoke solution was the way to go.
Now, the caveat here is don’t forget the cost of development when choosing to roll your own solution. I like Yevgeniy Brikman’s rule that if you’re a startup, unless what you are building is your core value proposition, don’t build it – use an open-source library or failing that a proprietary solution.
However, in this case, I decided to break that rule as the ongoing cost of the alternatives would dwarf the one-time development cost of the feature.
So what was the solution?
Our solution hinged around the realisation that whilst we couldn’t predict exactly when we needed the service to be warm, we did know that in order to use the service the client had to first authenticate themselves using OAuth. If an authentication request to get an OAuth token was sent, this would most likely be followed by a subsequent request to do something with that token.
Therefore, the pattern was as follows:
- When generating an OAuth token also asynchronously warmup the lambda container.
- When the subsequent client request is made the warmed lambda is hit.
Here’s a sequence diagram:

N.b. there is a slight problem with this solution, it introduces a race condition if the client sends the subsequent request before the lambda has fully warmed up. In our case, it was decided that this was fine so long as we made clear to our API integrators to allow time for the lambda to warmup before sending their first request after authenticating.
Conclusion
Serverless is cool. Doing NLP on serverless is cool. API request timeouts are not cool (ironically, cold starts are not cool). If you want to be cool, work out how to avoid the cold start issue by considering one of the means highlighted above.
Depending on your specific use case an out of the box solution like provisioned concurrency may well be the right route for you, alternatively, if none of these solutions is ideal to you then consider a simple bespoke approach as we did. Always consider the fully loaded cost of running and scaling your service (not just the lambda invocation runtime). Finally, whilst avoiding optimising too early, if you start to hit timeout issues in your API then you know that now is time for optimisation!
Footnote
On the point: "All this downloading models from S3 every time sounds very inefficient". Indeed it is. The solution here is to make use of the /tmp
storage that comes with AWS lambdas, and only re-download the model if it’s not already persisted there (i.e. it’s a cold start).