
This story is for those who would like to learn how to code and how to run tests, automate CI/CD checks and run them in any environment including locally.
Unit testing is an essential must-have skill for machine learning engineers these days. It looks great on your CV and increases the chances of getting employed.
I’m a Data Engineer and very often I need to create microservices to process the data (ETL). Depending on the task we might want to do the following (not an extensive list):
- Extract data from one source and pass it to another.
- Transform the data along the way, i.e. change format, PII masking, etc.
- Load the data somewhere else, i.e. data warehouse solution.
In each of these cases, we would want to guarantee that our data services perform as required, and as we make changes, automated tests are run to ensure logic consistency.
Simplify
I use a simple atomic design for each part of the data pipeline I have in mind and deploy it with AWS Lambda or Cloud Functions (Google Cloud Platform). In this way, we can orchestrate it with ease and ML Ops becomes a trivial task.
Types of tests
In unit testing code behaviour is tested using tiny, independent units, or unit tests.
So what we want to test depends on our microservice logic and infrastructure but in general it would be this:
-
Simple unit tests for function logic, i.e.
processEvent()
function inside your microservice. We would want to make sure that this particular function returns the expected result consistently when we provide some input for it. -
Integration tests This would be the case when we need to test how our service works in conjunction with some other services, i.e. it sends a request and receives something back in return (response).
What makes a good unit test?
- It should test expected vs actual results
- Should describe the script’s logic corresponding to use cases.
- It should be automated.
- Be Independent (tests should not do setup or teardown for one another)
- It should be easy to implement.
- Be Repeatable: Anyone should be able to run it in any environment.
- Once it’s written, it should remain for future use.
Let’s imagine a simple scenario where we were tasked to create a data pipeline to transfer the data from one cloud platform to another, i.e. from GCP to AWS and do some ETL along the way:

We can create a simple microservice with AWS Lambda or Cloud Functions that will process that pipeline.
This can be anything pipeline wise and this way we can deploy as many data pipelines as we need.
Creating a data service
This will be a simple atomic design for a serverless microservice to perform one single operation.
If you are a Python user don’t worry, I have a similar story in Python on how to write integration tests with Python and AWS Lambda. Anyways this story is just to demonstrate how to perform tests for your data pipelines.
Let’s create our service…
In your command line run:
mkdir js-lambda-unit-tests
cd js-lambda-unit-tests
mkdir stack
cd stack
mkdir js-lambda-unit-tests
cd js-lambda-unit-tests
Now we can open it in the code editor and start coding…
Our App structure
.
├── app.js
├── node_modules
├── package-lock.json
├── package.json
└── test
├── event.json
└── unit
├── lambdaResponseTest.js
└── runTimeTest.js
Let’s create our app.js
exports.handler = async(event) => {
try {
const jobs = event.jobs;
const successfullJobs = await processEvent(jobs);
return {
'statusCode': 200,
'data': successfullJobs },
'context': context ? context.succeed() : null,
;
} catch (e) {
console.log(e);
return {
'statusCode': 400,
'data': e
'context': context ? context.done() : null,
};
}
};
Simple, right? We are learning to code. Let’s start with simple things and we can always add lots of other useful features later.
To initialise your App you can run npm init
or simply copy my package.json file and then run npm install
command.
These are the packages we are going to use:
...
"author": "Mike Shakhomirov [email protected]",
"license": "ISC",
"devDependencies": {
"aws-sdk": "2.804.0",
"run-local-lambda": "1.1.1",
"eslint": "^7.20.0",
"eslint-plugin-classes": "^0.1.1",
"eslint-plugin-promise": "^4.3.1",
"mocha": "^7.1.1",
"chai": "^4.2.0"
},
"dependencies": {
"moment": "^2.24.0"
}
...
Again, it is simple, right? Indeed, we are going to use only what’s necessary moving
aws-sdk
to devDependencies so it will never go into our Lambda package zip. AW services use this package by default in theirruntime
and there is no need to install it.eslint
– related packages are good for development and there is no need to explain why.
it makes your code look nice and consistent.
run-local-lambda
is just an alternative to run lambda locally for demonstration purposes.mocha
&chai
is finally what we need for testing.
Add an event to invoke the service
Our Data Engineering jobs/tasks to run will be coming from the event
that will invoke our service. This event can be fired by another data service, i.e. orchestrator Lambda or AWS Cloudwatch to run it simply on a schedule.
Let’s create ./test/event.js
i.e.:
{ "configOverride": false,
"jobs": [
{
"name": "gcp_to_s3",
"output": "s3",
"dryRun": true,
"disabled": false,
"s3Key": "gcs/",
"s3Bucket": "data-staging.avro.aws",
"sourceBucket": "data-staging-gcs-avro"
}
]
}
In this example, we define a job configuration to move data from one cloud to another, i.e. from Google Cloud Storage to AWS S3. This is a trivial task in data engineering performed almost daily. We also supply configOverride
parameter to tell our service whether we want to use a custom event or the one defined in the config file. For example, it can be a task that has to be run daily, hourly, etc.
Add processEvent() function
Now let’s add processEvent()
function. It will do something simple with event data, i.e. return current DATE
const processEvent = async(jobs) => {
const now = moment.utc();
const jobList = [];
for (const job of jobs) {
const jobTime = now.format('YYYY-MM-DD HH:mm');
jobList.push({
name: job.name,
runTime: jobTime,
})
}
return jobList;
};
Now let’s add this to our app.js:
const moment = require('moment');
exports.handler = async(event, context) => {
console.log('Now: ', moment());
try {
const jobs = event.jobs;
const successfullJobs = await processEvent(jobs);
return {
'statusCode': 200,
'data': successfullJobs,
'result': context.succeed()
};
} catch (e) {
console.log(e);
return {
'statusCode': 400,
'data': e,
'result': context.done()
};
}
};
const processEvent = async(jobs) => {
const now = moment.utc();
console.log(jobs);
const jobList = [];
for (const job of jobs) {
const jobTime = now.format('YYYY-MM-DD HH:mm');
jobList.push({
name: job.name,
runTime: jobTime,
})
console.log(jobList);
}
return jobList;
};
And finally, let’s add a new script to our package.json
file to use run-local-lambda
with event supplied from ./test/event.json
:
...
"scripts": {
"local": "export DEBUG=true; export NODE_ENV=staging; run-local-lambda --file app.js --event test/event.json --timeout 1000000",
"test": "test"
},
...
Now we can run npm run local
in our command line to test Lambda locally:

So we see that it works and Lambda returns something when we run it locally.
That’s great news but how do we
unit-test
it?
Use Mocha and Chai to unit test our data service

First of all, why use mocha
and chai
?
Mocha is a great framework for testing where we can run all tests from the test folder recursively, i.e. npm run test
will run test1.js
, test2.js
,…, testN.js
. It enables the following:
- use of any assertion library (this is where we will use Chai)
- asynchronous testing
- test coverage reports
Chai is an assertion library for NodeJS that can be paired with Mocha to raise assertions in our tests.

Let’s add a couple of unit tests. One will test that Lambda return results with runTime
and the second one will check that Lambda returns a response with the correct status codes.
Let’s create stack/js-lambda-unit-tests/test/unit/runTimeTest.js
:
const chai = require('chai');
const expect = chai.expect;
const app = require('../../app');
describe('When transferring data from one cloud to another: return a list of jobs', () => {
before(async() => {
});
beforeEach(async() => {
});
after(async() => {
});
afterEach(async() => {
});
it('should return a 200 statusCode and array [] of jobs each having a runTime key', async() => {
const event = { 'configOverride': true,
'jobs': [
{
'name': 'gcp_to_s3',
'output': 's3',
'dryRun': true,
'disabled': false,
's3Key': 'gcs/',
's3Bucket': 'data-staging.avro.aws',
'sourceBucket': 'data-staging-gcs-avro',
},
],
};
const response = await app.handler(event);
console.log(response);
expect(response).to.have.property('statusCode');
expect(response.statusCode).to.be.deep.equal(200);
expect((response.data).length).to.equal(1);
expect((response.data)[0]).to.have.all.keys('name', 'runTime');
});
});
Now let’s add this single unit test to scripts in our package.json
as sometimes we might need to run only one test:
...
"scripts": {
"local": "export DEBUG=true; export NODE_ENV=staging; run-local-lambda --file app.js --event test/event.json --timeout 1",
"test-lambda-runtime": "NODE_ENV=test mocha --timeout 10000 ./test/unit/runTimeTest.js",
...
Now if we run in the command line npm run test-lambda-runtime

Now let’s add the second test but now we would want to add new behaviour to our lambda, i.e.
it must return status
400
on error.
To emulate the error we will supply an event with a missing job name
. Our ./test/unit/lambdaResponseTest.js
should look like that:
const chai = require('chai');
const expect = chai.expect;
const app = require('../../app');
describe('When transferring data from one cloud to another: return a response code', () => {
before(async() => {
});
beforeEach(async() => {
});
after(async() => {
});
afterEach(async() => {
});
it('should return a 400 statusCode when Error', async() => {
const event = { 'configOverride': true,
'jobs': [
{
'name_missing': 'gcp_to_s3',
'output': 's3',
'dryRun': true,
'disabled': false,
's3Key': 'gcs/',
's3Bucket': 'data-staging.avro.aws',
'sourceBucket': 'data-staging-gcs-avro',
},
],
};
const response = await app.handler(event);
console.log(response);
expect(response).to.have.property('statusCode');
expect(response.statusCode).to.be.deep.equal(400);
});
it('should return a 200 statusCode when Succeed', async() => {
const event = { 'configOverride': true,
'jobs': [
{
'name': 'gcp_to_s3',
'output': 's3',
'dryRun': true,
'disabled': false,
's3Key': 'gcs/',
's3Bucket': 'data-staging.avro.aws',
'sourceBucket': 'data-staging-gcs-avro',
},
],
};
const response = await app.handler(event);
console.log(response);
expect(response).to.have.property('statusCode');
expect(response.statusCode).to.be.deep.equal(200);
});
});
And let’s add error handling to our app.js
:
// in lambda.handler add this:
...
if (successfullJobs.errorCode) {
throw successfullJobs;
}
...
// in processEvent() add this:
...
if (typeof job.name === 'undefined') {
throw { errorCode: 1, message: 'job.name is missing' };
}
...
So the final app.js
must look like so:
/* eslint-disable no-throw-literal */
const moment = require('moment');
exports.handler = async(event, context) => {
console.log('Now: ', moment());
try {
const jobs = event.jobs;
const successfullJobs = await processEvent(jobs);
if (successfullJobs.errorCode) {
throw successfullJobs;
}
console.log(successfullJobs);
return {
'statusCode': 200,
'data': successfullJobs,
'context': context ? context.succeed() : null,
};
} catch (e) {
return {
'statusCode': 400,
'data': e,
'context': context ? context.done() : null,
};
}
};
const processEvent = async(jobs) => {
const now = moment.utc();
console.log(jobs);
const jobList = [];
for (const job of jobs) {
try {
if (typeof job.name === 'undefined') {
throw { errorCode: 1, message: 'job.name is missing' };
}
const jobTime = now.format('YYYY-MM-DD HH:mm');
jobList.push({
name: job.name,
runTime: jobTime,
});
} catch (error) {
return error;
}
}
return jobList;
};
Now let’s change our package.json scripts to this:
...
"scripts": {
"local": "export DEBUG=true; export NODE_ENV=staging; run-local-lambda --file app.js --event test/event.json --timeout 1",
"test-lambda-runtime": "NODE_ENV=test mocha --timeout 10000 ./test/unit/runTimeTest.js",
"test-lambda-response": "NODE_ENV=test mocha --timeout 10000 ./test/unit/lambdaResponseTest.js",
"test-unit": "NODE_ENV=test mocha NODE_ENV=test --exit --recursive ./test/unit",
"test-integration": "NODE_ENV=test mocha --delay --exit --require ./test/fixtures/bigquery-integration-bootstrap.js --timeout 100000 ./test/integration/multiCategoryArchiveTest.js"
},
You probably noticed a new script to run commands recursively in `./test/unit` folder. `npm run test-unit` will run all tests:

You probably noticed an integration test script in `package.json` More about integration tests can be found here:
> [**Data Platform Unit and Integration Tests Explained**](https://mydataschool.com/blog/data-platform-unit-and-integration-tests-explained/)
An integration test is when we test how our application interacts with something else, i.e. another service or API.
## Conclusion
In data engineering and Machine Learning, tests are frequently used to ensure that new code changes do not create unintended modifications and to assist describe the fundamental core functionality of the data platform.
It is a must-have skill for any machine learning engineer. Getting used to reading tests might be uncomfortable but you'll benefit from developing it today throughout your path into coding!
It looks great on every CV and the idea is the same for any programming language. Make sure you add **Pytest**, Mocha, Chai, Sinon and things like Mock to your Testing tools.
## Github repository
[https://github.com/mshakhomirov/js-lambda-unit-tests.git](https://github.com/mshakhomirov/js-lambda-unit-tests.git)
## Recommended read
1. [https://mochajs.org/](https://mochajs.org/)
2. [https://www.chaijs.com/api/bdd/](https://www.chaijs.com/api/bdd/)
3. [https://stackoverflow.com/questions/52019039/how-to-test-aws-lambda-handler-locally-using-nodejs](https://stackoverflow.com/questions/52019039/how-to-test-aws-lambda-handler-locally-using-nodejs)
4. [https://stackoverflow.com/questions/54846513/lambda-trigger-callback-vs-context-done](https://stackoverflow.com/questions/54846513/lambda-trigger-callback-vs-context-done)
5. [http://opensourceforgeeks.blogspot.com/2019/03/writing-unit-tests-for-aws-lambda-in.html](http://opensourceforgeeks.blogspot.com/2019/03/writing-unit-tests-for-aws-lambda-in.html)
6. [https://dashbird.io/blog/test-javascript-lambda-functions/](https://dashbird.io/blog/test-javascript-lambda-functions/)
7. [https://docs.aws.amazon.com/lambda/latest/dg/nodejs-context.html](https://docs.aws.amazon.com/lambda/latest/dg/nodejs-context.html)
8. [https://www.google.com/search?client=firefox-b-d&q=aws+lambda+test+TypeError%3A+context.done+is+not+a+function](https://www.google.com/search?client=firefox-b-d&q=aws+lambda+test+TypeError%3A+context.done+is+not+a+function)