The world’s leading publication for data science, AI, and ML professionals.

Premade AI in the Cloud with Python

Create a simple bot with vision, hearing and speech using Azure and Colab Notebooks

Photo by Eberhard Großgasteiger on Unsplash
Photo by Eberhard Großgasteiger on Unsplash

There is a fleet of clouds with their own minds floating over the internet trying to take control of the winds. They’ve been pushing very aggressively all kinds of services into the world and absorbed data from every possible source. Among this big bubble of services, an increasing number of companies and applications are relying on pre-made AI resources to extract insights, predict outcomes and gain value out of unexplored information. If you are wondering how to try them out, I’d like to give you an informal overview of what you can expect from sending different types of data to these services. In short, we’ll be sending images, text and audio files high into the clouds and explore what we get back.

While this way of using AI doesn’t give you direct, full control over what’s happening (as you would using machine learning frameworks) its a quick way for you to play around with several kinds of models and use them in your applications. It’s also a nice way to get to know what’s already out there.

Photo by Franck V. on Unsplash
Photo by Franck V. on Unsplash

In general, before we can use any kind of cloud service, we must first:

  • Create a subscription with a particular cloud provider.
  • Create a resource: to register the particular service we’ll be using and
  • Retrieve credentials: to **** authorize our applications to access the service.

And while there are many cloud providers likely able to suit your needs, we’ll be looking at Microsoft’s Azure Cloud. There is a ton of options and tutorials that will likely confuse you if you don’t know where to start, so for the first part of this post we will walk through from scratch and get to know what we need to make use of the following services:

All resources from the Cognitive Services platform of Azure, a nice collection of services with use cases in the areas of vision, speech, language, web search, and decision. In their words:

The goal of Azure Cognitive Services is to help developers create applications that can see, hear, speak, understand, and even begin to reason.

We will put them in action using a Colab notebook: where we will set up everything we need in Azure, implement code to call these services and explore the results. To make it more fun, we will also make use of the camera, microphone and speakers to speak, see and hear responses from the cloud!


Setup

All you need to give life to the notebook is an Azure subscription. After that, you should be able to run the examples in this post without trouble.

Create a subscription

Create an Azure subscription following this link. (If you are currently enrolled in a university, use this link). If it is your first account you should have some trial money to spend, but make sure you always check the prices before using anything. For this tutorial, we’ll be only using free services, but it’s still a good thing to do!

After you have made an account, you will have access to the _Portal._ Here you can manage everything about your subscription and configure huge amounts of stuff.

The Azure Portal
The Azure Portal

The portal is a very interesting place, there are entire companies managed using this big pack of tools. To not get lost in the woods, I implemented code to set up all you will need for this post inside the notebook. But I’ll take time here to explain the basics and give you an idea of how to do it yourself.

But if you already know how to create groups and resources, feel free to skip entirely the Setup in Azure section below and jump straight into the notebook.


Setup in Azure

This is all automated in the notebook but read through if you want to know how to do it yourself.

Creating a resource group

Before you can create specific resources (e.g. Computer Vision or Text Translation), you need to create a group to hold multiple resources together. In other words, each resource we create must belong to a group. This allows you to manage all of them as an entity and keep track of things more easily.

In general, there are two main ways to get stuff done in the cloud: you can either use the graphical user interface (GUI) of the cloud providers (e.g the Portal of Azure) or type lines of code in a command-line interface (CLI). To make this clear, let’s see how to create a resource group in both ways:

  • To create a resource group using the portal:
Making a resource group
Making a resource group

On the left menu, go to "Resource groups", click on "Add" and fill in the "Resource group" and "Region" fields with a name and a location.

Note: for the entire tutorial we will be using MyGroup as our resource group name and West Europe ** as our region. The location argument specifies whic_h regio_n you want the data center/services to be located. If you are not in west Europe besides a bit of latency it shouldn’t change much for you. Although it could be simple to change to another region, the argument will come over and over agai**n when making use the resources, (often under different names) so if you are setting things up by yourself, useMyGroup and WestEurope to keep things simple and allow you to run the code examples without changes later on.

  • To achieve exactly the same and create myGroup without having to use the GUI in the Portal, we could have also used the command-line interface. By clicking on >_ in the top menu in your portal:

And typing this in the window that pops up:

az group create -l westeurope -n MyGroup

Just as before, this will make a resource group named MyGroup in the location westeurope.

What you see clicking on the >_ icon in the portal is the command line of a virtual machine they provide for you. It is really powerful and comes with very handy tools: its own environments for go, Python and other languages, docker, git, etc. ready to go. But most importantly it comes with [az](https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli?view=azure-cli-latest) a command-line tool from Azure that provides options to do basically all you see in the portal and more. This is the tool we use inside the notebook to set up everything.

The az tool is way too much to cover here but feel free to type az and start exploring!

Creating a Resource

Let’s now create individual resources and assign them to the group MyGroupthat we made. As before, we can either use the GUI or the CLI. For intuition’s sake: this is how the whole the process goes using the GUI:

  • Steps to create a resource in the Portal:
Creating a resource
Creating a resource
  • On the left menu, click on "Create a resource", search for the one you want (e.g. Computer Vision) and fill in the project details.
  • In "Location", specify which _ region you want the data center/services to be located. (We will be using ‘West Europe’_ ).
  • The "Pricing tier" of each resource defines its cost. For each resource and tier, there will be different prices and conditions. For this tutorial, we always select the free tier F0 to avoid any charges.
  • Finally, we assign our resource to an existing resource group (MyGroup ).

After the resource has been made, we need to retrieve its key to be able to use it. The key is the credential that will authorize our application to use the service. This is how you retrieve a key in the Portal:

Finding the authentication key for some resource in the portal
Finding the authentication key for some resource in the portal
  • On the left menu, go to "All resources" and click on top of your target resource. Under "Resource Management", go to "Keys" and write down the credential.

Briefly explained, but that’s how you create resources and retrieve keys manually in the Portal. I hope this gives you an intuition of how to do it. But following this procedure for all the resources we need would take a while. So to relieve the __ pain and prevent readers from getting bored. I made a snippet to set up everything at once using the az tool in the CLI. The snippet will create and retrieve the keys for the following resources, (specifying the free tier F0):

Go the command line in the portal (or install az in your machine):

Then copy, paste and run this snippet:

  • We create an array resources with the specific names of the arguments we want.
  • We iterate over resources and apply the commandaz cognitive services account create to create each service individually. (Here, we specify the location WestEurope, the free tier F0 and the resource group we made MyGroup).
  • We loop again over resources and apply the command [az cognitiveservices account keys](https://docs.microsoft.com/de-de/cli/azure/cognitiveservices/account/keys?view=azure-cli-latest) to retrieve the keys for each resource and append them into a file called keys.py.
  • When this has finished running. keys.py should contain a standard dictionary with the resources and their credentials:
# sample keys.py with fake credentials
subscriptions = {
  'TextAnalytics': 'ec96608413easdfe4ad681',
  'LUIS': '5d77e2d0eeef4bef8basd9985',
  'Face': '78987cff4316462sdfa8af',
  'SpeechServices': 'f1692bb6desae84d84af40',
  'ComputerVision': 'a28c6ee267884sdff889be3'
  }

All we need to give life to our notebook :). You can verify and see what we did by typing:

az resource list 

Alright! If you made it so far, I’ll reward you with a collection of snippets to call and send content to these resources.


Accessing Cognitive Services

Now that we’ve been through the boring stuff, we are ready to use the resources and see what they are capable of. Creating resources has given us access to use their respective REST-APIs. There is a whole lot we can do with what we just made.

In short, to access these services we’ll be sending requests with certain parameters and our content to specific URLs, trigger an action on the server and get back a response. To know how to structure our requests for each of the services we’ll be using we need to refer to the API docs (here is where you can really appreciate what the API is useful for).

Each API has a collection of functions ready to use. For example, with the Computer Vision API, we can perform OCR (Optical Character Recognition) and extract text from images, describe images with words, detect objects, landmarks and more.

A glimpse into the Computer Vision API.
A glimpse into the Computer Vision API.

It might feel overwhelming to look up the docs for all the APIs we have access to:

So let’s go over an example to give you an intuition of what you do to get something started. Assume you are looking up the "Analyze Image" method of the Computer Vision API.

  • Go to the API docs, select the method you want ("Analyze Image") and scroll down :O!
  • You see a URL where to send your requests and with which parameters.
  • What the request’s header and body are made of.
  • An example response from the server.
  • Error responses and explanation.
  • And code samples for many languages including Python. Where you just need to replace the resource’s key and the data to be sent. This will become clear once you see it implemented.

Sending a request

The APIs from different Cognitive Services resemble each other a lot! If you go through the code samples, you will notice that they all share the same backbone, but are just directed to different URLs. This is basically how the backbone of our requests looks like:

In short, we define headers , body and params (request parameters), send them to a certain URL and receive a response.

  • In the headers we specify the type of data we want to send and the resource key to access the API.
  • In thebody we include (or point) to the data we want to send. The body itself can have many forms and depends on the API.
  • In the params we specify the (often optional) parameters to tell the API more specifically what we want.
  • We then send these variables as a standard request to a specific endpoint, for example: westeurope.api.cognitive.microsoft.com/computervision/v2.0/describe

In the notebook, we take advantage of these common elements to implement two utility functions:get_headers_body and send_request to help us makeup requests more quickly and avoid repeating too much code.

Let’s now get our hands dirty! Jump into the Colab notebook. I’ve put there additional code to take_picture , show_picture, record_audio, play_audio and more. These will be the effectors and actuators in our notebook and will allow us to interact with the cloud.

We are not gonna cover all that’s possible with each API but will simply look at a couple of methods and how to call them.

For each API, we will define a couple of functions and see practical examples of how to use them. The responses from the APIs often contain a lot of information! We will be parsing these responses and returning only a small part of it (we will not look at the full responses).

Computer Vision API

Processes images and returns various information.

Photo by Arseny Togulev on Unsplash
Photo by Arseny Togulev on Unsplash

Let’s define a set of functions that will return visual information about images. The functions will make up our requests based on the image we want to analyze and send them to specific URLs (endpoints) of the Computer Vision API.

Here we will make use of the functions get_headers_body and send_request to speed things up (check the definition of these and more information on the notebook).

  • [describe](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fe) : returns visual descriptions about an image.

    Let’s see it in action:

describe(source, number_of_descriptions=3)
Source: URL of the image given as an argument.
Source: URL of the image given as an argument.
A yellow toy car 
A close up of a toy car 
A yellow and black toy car

Not bad!

  • classify: assigns a category to an image and tags it.
    classify(source)
Source
Source
Categories in Image: outdoor_stonerock 
Tags found in the image:
['outdoor', 'nature', 'mountain', 'man', 'water', 'waterfall', 'riding', 'going', 'hill', 'snow', 'covered', 'skiing', 'forest', 'large', 'lake', 'traveling', 'river', 'slope', 'standing', 'white', 'wave']

[read](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fc) : performs optical character recognition and extracts text in an image. Draws regions where text is located and displays the result.

Besides retrieving the response and printing the extracted text, inside of read we use additional OCR information to draw and label the bounding boxes of detected textual regions in the image. Let’s see it in action:

text = read(source)
for line in text:
    print(line)
ANALYSIS 
Per Quantitatum 
SERIES, FLUXIONES5 
DIFFERENTIAS. 
c UM 
Enumeratione Linearum 
TERTII ORDINIS. 
LONDI,VIS 
Ex Offcina M.DCC.XL

[see](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/5e0cdeda77a84fcd9a6d3d0a) : returns objects recognized in an image and displays their bounding boxes in the image.

see(source)
In the image of size 800 by 1181 pixels, 2 objects were detected 
person detected at region {'x': 354, 'y': 535, 'w': 106, 'h': 280} 
Toy detected at region {'x': 214, 'y': 887, 'w': 186, 'h': 207}

Face API

Detect, recognize, and analyze human faces in images.

For brevity, we’ll be looking at only one API method:

[detect_face](https://westus.dev.cognitive.microsoft.com/docs/services/563879b61984550e40cbbe8d/operations/563879b61984550f30395236): shows bounding boxes of faces recognized in an image and some information about them (age, sex, and sentiment).

As in see and read we use an inner functiondraw_show_boxes to draw the bounding boxes around the detected faces. This is the result:

detect_face(source)
Source
Source

Cool right?

These are all the functions we will try regarding images. But let’s experiment with them a little further by taking a picture with our device using the function take_picture (see in the notebook).

Capture and Send a Picture

Let’s take a picture and see what the cloud thinks about it. Inside all of our functions, we can specify the argument localfile=True to allow us to send local files as binary images.

# turns on the camera and shows button to take a picture
img = take_picture('photo.jpg')

Now let’s see what the cloud "thinks" about it by applying the describe and classify functions:

print(describe(img, localfile=True, number_of_descriptions=3))
>> A man sitting in a dark room 
>> A man in a dark room 
>> A man standing in a dark room
print(classify(img, localfile=True))
>> Categories in Image:   dark_fire 
>> Tags found in the image    ['person', 'indoor', 'man', 'dark', 'sitting', 'looking', 'lit', 'laptop', 'front', 'room', 'light', 'standing', 'dog', 'watching', 'computer', 'wearing', 'mirror', 'black', 'living', 'night', 'screen', 'table', 'door', 'holding', 'television', 'red', 'cat', 'phone', 'white']

Text Analytics

Used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection

Photo by Patrick Tomasso on Unsplash
Photo by Patrick Tomasso on Unsplash
  • detect_language : returns the language detected for each string given.
    detect_language('Was soll das?', 'No tengo ni idea', "Don't look at me!", 'ごめんなさい', 'Sacré bleu!')
>> ['German', 'Spanish', 'English', 'Japanese', 'French']

key_phrases : returns a list of the keys (important, relevant textual points) for each given string. If none are found, an empty list is returned.

keys = key_phrases('I just spoke with the supreme leader of the galactic federation', 'I was dismissed', 'But I managed to steal the key', 'It was in his coat')
for key in keys:
    print(key)
>> ['supreme leader', 'galactic federation'] 
>> [] 
>> ['key'] 
>> ['coat']

check_sentiment: assigns a positive, negative or neutral sentiment to the strings given.

print(check_sentiment('Not bad', "Not good", 'Good to know', 'Not bad to know', "I didn't eat the hot dog", 'Kill all the aliens'))
>> ['positive', 'negative', 'positive', 'positive', 'negative', 'negative']

find_entities : returns a recognized list of entities, assigned to a category. If possible, also a Wikipedia link that refers to the entity.

find_entities('Lisa attended the lecture of Richard Feynmann at Cornell')
>> [['Entity: Lisa, Type: Person', 
     'Entity: Richard Feynman, Type: Person,
       Link:https://en.wikipedia.org/wiki/Richard_Feynman',   
     'Entity: Cornell University, Type: Organization, 
       Link https://en.wikipedia.org/wiki/Cornell_University']]

OCR + Text Analytics

Let’s see how we can couple some stuff together. Applying read on an image effectively generates textual data for you. So it is the perfect match to apply our text analytics. Let’s make up a report function that extracts the individual text regions, analyzes them and makes up a report with our results:

report(source)
Source of image analyzed
Source of image analyzed
# Report  
## Region 1 
> "o. 4230..." 
- Language: English 
- Sentiment: positive 
- Entities:  
    - 4230, Type: Quantity,  
- Keys:  
## Region 2 
> "WASHINGTON, SATURDAY, APRIL 14, 1866..." 
- Language: English 
- Sentiment: positive 
- Entities:  
    - WASHINGTON, Type: Location,   
    - SATURDAY, APRIL 14, 1866, Type: DateTime,   
    - April 14, Type: Other, Wiki [Link](https://en.wikipedia.org/wiki/April_14) 
- Keys:  
    - WASHINGTON 
## Region 3 
> "PRICE TEN CENTS..." 
- Language: English 
- Sentiment: positive
- Entities:  
    - Tencent, Type: Organization, Wiki [Link](https://en.wikipedia.org/wiki/Tencent)  
    - TEN CENTS, Type: Quantity,  
- Keys:  
    - PRICE  
    - CENTS
...

Speech Services

To convert audio to text and text-to-speech

Photo by David Laws on Unsplash
Photo by David Laws on Unsplash

We will be using Speech Services to transform voice to text and vice-versa. Together with record_audioand play_audio (defined in the notebook) we have a way to hear and talk to our notebook. But before we can use Speech Services, we need to get a token (a secret string) that is valid for 10 minutes. We will do this with the function get_token as seen below:

We will use it to define the headers of our requests inside our functions and authorize them to use our Speech Services resource.

speech_to_text : receives the file path of an audio file and transforms the recognized speech of the given language into text. A whole lot of languages are supported.

text_to_speech : does the exact opposite and transforms the given text into speech (an audio file) with an almost human voice. This will give a voice to our notebook.

Because speech_to_text receives an audio file and returns words and text_to_speech receives words and returns an audio file, we can do something like this:

# transform words to speech
audio = text_to_speech("Hi, I'm your virtual assistant")
# transcribe speech back to words
words = speech_to_text(audio)
print(words)
>> Hi I am your virtual assistant

Ok cool! But that seems totally pointless. Let’s do something more interesting. We will record our voice with record_audio, transform it into words with speech_to_text do something with the words and speak out loud the results.

Let’s check up the feeling of what you say with check_sentiment :

# speak into the mic
my_voice = record_audio('audio.wav')
# transform it into words 
text = speech_to_text(my_voice)
# analyze its feeling
sentiment = check_sentiment(text)[0]
# convert analysis into speech
diagnosis = text_to_speech(sentiment)
# hear the results
play_audio(diagnosis)

Let’s implement this idea inside a function to make it more usable:

Try it out!

motivational_bot()

Having your voice converted into voice means you can use your voice as input into your functions. There is a really lot you can try out with something like this. For example, instead of checking for feelings in the words you say, you could translate them to a bunch of different languages (see the Text Translator API , look up for something on the web (see Bing Search) or (going beyond Azure) maybe ask complex questions to answer engines like Wolfram Alpha etc.

LUIS – Language Understanding

A machine learning-based service to build natural language understanding into apps, bots and IoT devices.

Photo by Gustavo Centurion on Unsplash
Photo by Gustavo Centurion on Unsplash

Let’s make our notebook more intelligent by giving it the ability to understand certain intentions in the language using the LUIS API. In short words, we will train a linguistic model that recognizes certain intentions in the language.

For example, let’s say we have the intent to take_picture. After training our model, if our notebook ‘hears’ sentences of the like of:

  • take a photo
  • use the camera and take a screenshot
  • take a pic

It will know that our intention is to take_picture. We call these phrases utterances. And are what we need to provide to teach the language model how to recognize our intents – the tasks or actions we want to perform.

By using varied and nonredundant utterances, as well as adding additional linguistic components such as entities, roles, and patterns you can create flexible and robust models tailored to your needs. Well implemented language models (backed up by the proper software) are what allow answer engines to respond to questions like "What is the weather in San Francisco?", "How many kilometers from Warsaw to Prag?", "How far is the Sun?" etc.

For this post, we will keep things simple and assign 5 utterances to a handful of intents. As you might presume, the intents will match some of the functions that we’ve already implemented.

Activate LUIS

In contrast to all the services we’ve seen, LUIS is a complex tool that comes with its own "Portal", where you manage your LUIS apps and create, train, test and iteratively improve your models. But before we can use it, we need to activate a LUIS account. Once you’ve done this:

  • Go the LUIS dashboard, retrieve the Authoring Key for your account as seen below and paste it in the notebook.
AuthoringKey = '36bd10b73b144a5ba9cb4126ksdfs82ad2'

This was a topic of confusion for me. But the Authoring Key for your LUIS account is not the same as the key for the LUIS resource we made. But you can assign Azure resources to the LUIS app (e.g to open up access points in different regions) but refer here for more detailed information.

Creating a LUIS app

The LUIS Portal makes it very easy to create, delete and improve your LUIS models. But in this post, we’ll be using the LUIS programmatic API to set things up from within the notebook using the authoring_key.

Let’s start off by creating the app:

app_id, luis_config = create_luis_app('Notebot' )

In this implementation, we keep track of the app ID (returned by the server) and the parameters that we specified inside app_id and luis_config as global variables for later use.

Add intents and utterances

Let’s now define a function to add intents and a function to add their respective utterances.

  • [create_intent](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c0c) : adds one intent to the LUIS app. Which app is specified by the variablesapp_id and luis_config.
  • add_utterances : adds a batch of examples/utterances to an existing intent in the LUIS app.

    With these functions, let’s define our language model inside a dictionary as seen below and apply them to it. There is big room for experimentation at this stage.

    The keys of this dictionary will be the intents for our application. Let’s loop over them and create them:

intents = intentions.keys()
for intent in intents:
    create_intent(intent)

Each intent has 4 examples/utterances, let’s now add these to their respective intents.

for intent, utterances in intentions.items():
    add_utterances(intent=intent, utterances=utterances)
Photo by Lenin Estrada on Unsplash
Photo by Lenin Estrada on Unsplash

Train the model

Let’s now train the model with the information we’ve specified with [train_luis_app](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c45).

train_luis_app(app_id, luis_config)

Publish the application

We are now ready to publish the application with [publish_app](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c3b).

publish_app(app_id)

Making a prediction

Let’s see if our model is any useful by making predictions of our intents. Note that LUIS has a separate API to make predictions, the LUIS endpoint API.

  • [understand](https://westus.dev.cognitive.microsoft.com/docs/services/5819c76f40a6350ce09de1ac/operations/5819c77140a63516d81aee78): to predict the intent using the given text
    understand('Can you give me some descriptions about what you are seeing?')
# predicted intent is:
>> `describe`
understand('Any homo sapiens in the picture?')
>> `detect_faces`

Cool! Now our notebook can approximately understand what our intentions are from plain language. But having to type the text ourselves doesn’t seem so helpful. The notebook should hear what we say and understand the intention that we have. Let’s address this writing a functionhear that uses predict together with the functions record_audio and speech_to_text.

We can now call hear to speak into the mic, transfer our speech into words and predict the intention that we mean using our LUIS app.

intent = hear()
# see the prediction
print(intent)

Using the app

Let’s write a function that triggers a set of actions based on the predicted or recognized intent and maps then to execute some code.

In short: a function to execute what happens when a certain intent is predicted. There is big room for experiments here.

To finalize, let’s summon the Notebot to fulfill our wishes:

Depending on what you say, the "Notebot" can take a picture and:

  • speak out loud a description
  • display any detected objects
  • display any detected faces.
  • apply OCR and read out loud the results.
# summon your creation
Notebot()

The Notebot will run a set of actions based on what you say.

Let’s sum up what happens when you call it. In the beginning, you will hear a greeting message. After that the Notebot will apply hear and start recording anything you say, your speech (the percept) will be transcribed to words and sent to the LUIS application to predict the intention that you have. Based on this prediction a different set of actions will be executed. In case there is no clear intent recognized from your speech, the intent "None" will be predicted and the Notebot will call itself again.

Looked from above, the Notebot ends up acting as a simple reflex-based agent, that simply finds a rule whose condition matches the current situation and executes it. (In this case what the Notebot does if you say this or something else).

At this point, you might like to upgrade your agent with additional concepts, e.g adding memory about what is perceived. But I’ll leave that task for the diligent reader.

Cleaning up

This article got way longer than I intended. Let’s clean up things before we finish. To clean up everything we made in the cloud: delete the entire resource group (along with all the resources) and the keys.py file (with your credentials) in the command line >_ by running:

az group delete --name MyGroup 
rm keys.py

Alright, I hope this tutorial gave you at least a couple of ideas to implement in your projects.

That’s all from me! Thank you for reading the whole thing 🙂


Related Articles