
Amazon Comprehend is part of the vast AWS AI/ML stack. It’s the premier Natural Language Processing (NLP) service that AWS offers and it provides great flexibility. In the past I explained how you can use Comprehend APIs to automate Sentiment Analysis and Entity Recognition.
For most use cases however you’ll have a specific dataset that you will want to train your models on. We can use Comprehend’s AutoML capabilities and fine-tune them with our own dataset for a Custom Classification example. In this article we’ll take a sample Spam Dataset and use the Comprehend APIs to launch a Custom Classification training job. Using this we can create an endpoint from our trained Comprehend model that can perform real-time inference.
NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. There will be costs incurred through the training and deployment process, especially if you leave your endpoint up and running. This article will also assume basic familiarity with AWS and using an AWS SDK.
Setup
To get started we’ll be pulling the sample dataset from the following Kaggle link. You can work with this dataset in any environment, I will be using a classic SageMaker Notebook Instance (ml.c5.xlarge) for further computing power. You can also work within your local Anaconda environment or whatever you’re comfortable with.
First we download the dataset and can just get some preliminary information using Pandas. For the training data we can have two types of classification problems: Multi-class and Multi-label mode. For this example we we have a Multi-class type problem as we have two classes: Spam and Ham.

To work with Comprehend we’ll be using the Boto3 Python SDK, you can also use whatever language you’re comfortable with that’s supported by the offered AWS SDKs.
A few housekeeping activities we need to take care is uploading the dataset to S3 and also making sure Comprehend has access to S3 for model training. First we create an S3 bucket, make sure it has a unique name, we can then upload our dataset to this bucket.
Next go to IAM and make sure to create a role for Comprehend that has the ComprehendDataAccessRolePolicy attached.

Make sure to grab the arn of this Role as you will need to provide it later for model training.
Model Training
For Model Training we first need create a Document Classifier that will specify our training data location for training. This will submit and launch a training job.
As the training job launches we can monitor the status until it has successfully completed.
We can also monitor and see this training job execute successfully in the Console. This step can take anywhere from 30–60 minutes depending on the size of your dataset, number of classes/labels, and other factors.

Here we can also see different metrics such as Accuracy and F1-score that are tracked. Using this trained Document Classifier we can grab the job arn and then use it to deploy directly to a Real-Time Endpoint or do Batch Inference. For this example we’ll look at a Real-Time Endpoint, but check out the documentation at the following link for an asynchronous Batch Job.
Endpoint Creation & Invocation
To create the endpoint we can continue to use the Comprehend APIs. We’ll first grab the trained Document Classifier arn and feed this into a create_endpoint call that we can monitor until endpoint creation. This step should take about 10-15 minutes.
One factor to note here is the parameter DesiredInferenceUnits, this is the knob you can adjust for throughput. Each inference unit represents a throughput of 100 characters per second, so make sure to choose a value that is sufficient for your use-case. Once the endpoint is in status, we can grab a sample data point and feed that to our endpoint for inference.

Using the classify_document call we can then feed in this sample point for inference.

Make sure to delete the endpoint with the following call if you do not want it up and running.
Additional Resources & Conclusion
GitHub – RamVegiraju/aws-comprehend-custom-classifier: AWS Comprehend Custom Classification
For the code for the entire article access the link above. For further examples/resources around Amazon Comprehend, check out how you can repeat this entire process on the AWS Console directly without any code. For further Comprehend code samples check out this AWS Samples repository.
I hope this article was a good introduction to one of the core AI/ML services AWS offers. In future articles we’ll explore how we can integrate this service with others in the AWS ML stack.
If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.