
There is a fleet of clouds with their own minds floating over the internet trying to take control of the winds. They’ve been pushing very aggressively all kinds of services into the world and absorbed data from every possible source. Among this big bubble of services, an increasing number of companies and applications are relying on pre-made AI resources to extract insights, predict outcomes and gain value out of unexplored information. If you are wondering how to try them out, I’d like to give you an informal overview of what you can expect from sending different types of data to these services. In short, we’ll be sending images, text and audio files high into the clouds and explore what we get back.
While this way of using AI doesn’t give you direct, full control over what’s happening (as you would using machine learning frameworks) its a quick way for you to play around with several kinds of models and use them in your applications. It’s also a nice way to get to know what’s already out there.

In general, before we can use any kind of cloud service, we must first:
- Create a subscription with a particular cloud provider.
- Create a resource: to register the particular service we’ll be using and
- Retrieve credentials: to **** authorize our applications to access the service.
And while there are many cloud providers likely able to suit your needs, we’ll be looking at Microsoft’s Azure Cloud. There is a ton of options and tutorials that will likely confuse you if you don’t know where to start, so for the first part of this post we will walk through from scratch and get to know what we need to make use of the following services:
All resources from the Cognitive Services platform of Azure, a nice collection of services with use cases in the areas of vision, speech, language, web search, and decision. In their words:
The goal of Azure Cognitive Services is to help developers create applications that can see, hear, speak, understand, and even begin to reason.
We will put them in action using a Colab notebook: where we will set up everything we need in Azure, implement code to call these services and explore the results. To make it more fun, we will also make use of the camera, microphone and speakers to speak, see and hear responses from the cloud!
Setup
All you need to give life to the notebook is an Azure subscription. After that, you should be able to run the examples in this post without trouble.
Create a subscription
Create an Azure subscription following this link. (If you are currently enrolled in a university, use this link). If it is your first account you should have some trial money to spend, but make sure you always check the prices before using anything. For this tutorial, we’ll be only using free services, but it’s still a good thing to do!
After you have made an account, you will have access to the _Portal._ Here you can manage everything about your subscription and configure huge amounts of stuff.

The portal is a very interesting place, there are entire companies managed using this big pack of tools. To not get lost in the woods, I implemented code to set up all you will need for this post inside the notebook. But I’ll take time here to explain the basics and give you an idea of how to do it yourself.
But if you already know how to create groups and resources, feel free to skip entirely the Setup in Azure section below and jump straight into the notebook.
Setup in Azure
This is all automated in the notebook but read through if you want to know how to do it yourself.
Creating a resource group
Before you can create specific resources (e.g. Computer Vision or Text Translation), you need to create a group to hold multiple resources together. In other words, each resource we create must belong to a group. This allows you to manage all of them as an entity and keep track of things more easily.
In general, there are two main ways to get stuff done in the cloud: you can either use the graphical user interface (GUI) of the cloud providers (e.g the Portal of Azure) or type lines of code in a command-line interface (CLI). To make this clear, let’s see how to create a resource group in both ways:
- To create a resource group using the portal:

On the left menu, go to "Resource groups", click on "Add" and fill in the "Resource group" and "Region" fields with a name and a location.
Note: for the entire tutorial we will be using MyGroup
as our resource group name and West Europe ** as our region. The location argument specifies whic_h regio_n you want the data center/services to be located. If you are not in west Europe besides a bit of latency it shouldn’t change much for you. Although it could be simple to change to another region, the argument will come over and over agai**n when making use the resources, (often under different names) so if you are setting things up by yourself, useMyGroup
and WestEurope
to keep things simple and allow you to run the code examples without changes later on.
- To achieve exactly the same and create
myGroup
without having to use the GUI in the Portal, we could have also used the command-line interface. By clicking on>_
in the top menu in your portal:

And typing this in the window that pops up:
az group create -l westeurope -n MyGroup
Just as before, this will make a resource group named MyGroup
in the location westeurope
.
What you see clicking on the >_
icon in the portal is the command line of a virtual machine they provide for you. It is really powerful and comes with very handy tools: its own environments for go, Python and other languages, docker, git, etc. ready to go. But most importantly it comes with [az](https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli?view=azure-cli-latest)
a command-line tool from Azure that provides options to do basically all you see in the portal and more. This is the tool we use inside the notebook to set up everything.
The az
tool is way too much to cover here but feel free to type az
and start exploring!

Creating a Resource
Let’s now create individual resources and assign them to the group MyGroup
that we made. As before, we can either use the GUI or the CLI. For intuition’s sake: this is how the whole the process goes using the GUI:
- Steps to create a resource in the Portal:

- On the left menu, click on "Create a resource", search for the one you want (e.g. Computer Vision) and fill in the project details.
- In "Location", specify which _ region you want the data center/services to be located. (We will be using ‘West Europe’_ ).
- The "Pricing tier" of each resource defines its cost. For each resource and tier, there will be different prices and conditions. For this tutorial, we always select the free tier F0 to avoid any charges.
- Finally, we assign our resource to an existing resource group (
MyGroup
).
After the resource has been made, we need to retrieve its key to be able to use it. The key is the credential that will authorize our application to use the service. This is how you retrieve a key in the Portal:

- On the left menu, go to "All resources" and click on top of your target resource. Under "Resource Management", go to "Keys" and write down the credential.
Briefly explained, but that’s how you create resources and retrieve keys manually in the Portal. I hope this gives you an intuition of how to do it. But following this procedure for all the resources we need would take a while. So to relieve the __ pain and prevent readers from getting bored. I made a snippet to set up everything at once using the az
tool in the CLI. The snippet will create and retrieve the keys for the following resources, (specifying the free tier F0):
Go the command line in the portal (or install az
in your machine):

Then copy, paste and run this snippet:
- We create an array
resources
with the specific names of the arguments we want. - We iterate over
resources
and apply the commandaz cognitive services account create
to create each service individually. (Here, we specify the locationWestEurope
, the free tierF0
and the resource group we madeMyGroup
). - We loop again over
resources
and apply the command[az cognitiveservices account keys](https://docs.microsoft.com/de-de/cli/azure/cognitiveservices/account/keys?view=azure-cli-latest)
to retrieve the keys for each resource and append them into a file calledkeys.py
. - When this has finished running.
keys.py
should contain a standard dictionary with the resources and their credentials:
# sample keys.py with fake credentials
subscriptions = {
'TextAnalytics': 'ec96608413easdfe4ad681',
'LUIS': '5d77e2d0eeef4bef8basd9985',
'Face': '78987cff4316462sdfa8af',
'SpeechServices': 'f1692bb6desae84d84af40',
'ComputerVision': 'a28c6ee267884sdff889be3'
}
All we need to give life to our notebook :). You can verify and see what we did by typing:
az resource list
Alright! If you made it so far, I’ll reward you with a collection of snippets to call and send content to these resources.
Accessing Cognitive Services
Now that we’ve been through the boring stuff, we are ready to use the resources and see what they are capable of. Creating resources has given us access to use their respective REST-APIs. There is a whole lot we can do with what we just made.
In short, to access these services we’ll be sending requests with certain parameters and our content to specific URLs, trigger an action on the server and get back a response. To know how to structure our requests for each of the services we’ll be using we need to refer to the API docs (here is where you can really appreciate what the API is useful for).
Each API has a collection of functions ready to use. For example, with the Computer Vision API, we can perform OCR (Optical Character Recognition) and extract text from images, describe images with words, detect objects, landmarks and more.

It might feel overwhelming to look up the docs for all the APIs we have access to:
So let’s go over an example to give you an intuition of what you do to get something started. Assume you are looking up the "Analyze Image" method of the Computer Vision API.
- Go to the API docs, select the method you want ("Analyze Image") and scroll down :O!
- You see a URL where to send your requests and with which parameters.
- What the request’s header and body are made of.
- An example response from the server.
- Error responses and explanation.
- And code samples for many languages including Python. Where you just need to replace the resource’s key and the data to be sent. This will become clear once you see it implemented.
Sending a request
The APIs from different Cognitive Services resemble each other a lot! If you go through the code samples, you will notice that they all share the same backbone, but are just directed to different URLs. This is basically how the backbone of our requests looks like:
In short, we define headers
, body
and params
(request parameters), send them to a certain URL and receive a response.
- In the
headers
we specify the type of data we want to send and the resource key to access the API. - In the
body
we include (or point) to the data we want to send. The body itself can have many forms and depends on the API. - In the
params
we specify the (often optional) parameters to tell the API more specifically what we want. - We then send these variables as a standard request to a specific endpoint, for example:
westeurope.api.cognitive.microsoft.com/computervision/v2.0/describe
In the notebook, we take advantage of these common elements to implement two utility functions:get_headers_body
and send_request
to help us makeup requests more quickly and avoid repeating too much code.
Let’s now get our hands dirty! Jump into the Colab notebook. I’ve put there additional code to take_picture
, show_picture
, record_audio
, play_audio
and more. These will be the effectors and actuators in our notebook and will allow us to interact with the cloud.
We are not gonna cover all that’s possible with each API but will simply look at a couple of methods and how to call them.
For each API, we will define a couple of functions and see practical examples of how to use them. The responses from the APIs often contain a lot of information! We will be parsing these responses and returning only a small part of it (we will not look at the full responses).
Computer Vision API
Processes images and returns various information.

Let’s define a set of functions that will return visual information about images. The functions will make up our requests based on the image we want to analyze and send them to specific URLs (endpoints) of the Computer Vision API.
Here we will make use of the functions get_headers_body
and send_request
to speed things up (check the definition of these and more information on the notebook).
[describe](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fe)
: returns visual descriptions about an image.Let’s see it in action:
describe(source, number_of_descriptions=3)

A yellow toy car
A close up of a toy car
A yellow and black toy car
Not bad!
classify
: assigns a category to an image and tags it.classify(source)

Categories in Image: outdoor_stonerock
Tags found in the image:
['outdoor', 'nature', 'mountain', 'man', 'water', 'waterfall', 'riding', 'going', 'hill', 'snow', 'covered', 'skiing', 'forest', 'large', 'lake', 'traveling', 'river', 'slope', 'standing', 'white', 'wave']
[read](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fc)
: performs optical character recognition and extracts text in an image. Draws regions where text is located and displays the result.
Besides retrieving the response and printing the extracted text, inside of read
we use additional OCR information to draw and label the bounding boxes of detected textual regions in the image. Let’s see it in action:
text = read(source)
for line in text:
print(line)

ANALYSIS
Per Quantitatum
SERIES, FLUXIONES5
DIFFERENTIAS.
c UM
Enumeratione Linearum
TERTII ORDINIS.
LONDI,VIS
Ex Offcina M.DCC.XL
[see](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/5e0cdeda77a84fcd9a6d3d0a)
: returns objects recognized in an image and displays their bounding boxes in the image.
see(source)

In the image of size 800 by 1181 pixels, 2 objects were detected
person detected at region {'x': 354, 'y': 535, 'w': 106, 'h': 280}
Toy detected at region {'x': 214, 'y': 887, 'w': 186, 'h': 207}
Face API
Detect, recognize, and analyze human faces in images.
For brevity, we’ll be looking at only one API method:
[detect_face](https://westus.dev.cognitive.microsoft.com/docs/services/563879b61984550e40cbbe8d/operations/563879b61984550f30395236)
: shows bounding boxes of faces recognized in an image and some information about them (age, sex, and sentiment).
As in see
and read
we use an inner functiondraw_show_boxes
to draw the bounding boxes around the detected faces. This is the result:
detect_face(source)

Cool right?
These are all the functions we will try regarding images. But let’s experiment with them a little further by taking a picture with our device using the function take_picture
(see in the notebook).
Capture and Send a Picture
Let’s take a picture and see what the cloud thinks about it. Inside all of our functions, we can specify the argument localfile=True
to allow us to send local files as binary images.
# turns on the camera and shows button to take a picture
img = take_picture('photo.jpg')
Now let’s see what the cloud "thinks" about it by applying the describe
and classify
functions:
print(describe(img, localfile=True, number_of_descriptions=3))
>> A man sitting in a dark room
>> A man in a dark room
>> A man standing in a dark room
print(classify(img, localfile=True))
>> Categories in Image: dark_fire
>> Tags found in the image ['person', 'indoor', 'man', 'dark', 'sitting', 'looking', 'lit', 'laptop', 'front', 'room', 'light', 'standing', 'dog', 'watching', 'computer', 'wearing', 'mirror', 'black', 'living', 'night', 'screen', 'table', 'door', 'holding', 'television', 'red', 'cat', 'phone', 'white']
Text Analytics
Used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection

detect_language
: returns the language detected for each string given.detect_language('Was soll das?', 'No tengo ni idea', "Don't look at me!", 'ごめんなさい', 'Sacré bleu!')
>> ['German', 'Spanish', 'English', 'Japanese', 'French']
key_phrases
: returns a list of the keys (important, relevant textual points) for each given string. If none are found, an empty list is returned.
keys = key_phrases('I just spoke with the supreme leader of the galactic federation', 'I was dismissed', 'But I managed to steal the key', 'It was in his coat')
for key in keys:
print(key)
>> ['supreme leader', 'galactic federation']
>> []
>> ['key']
>> ['coat']
check_sentiment:
assigns a positive, negative or neutral sentiment to the strings given.
print(check_sentiment('Not bad', "Not good", 'Good to know', 'Not bad to know', "I didn't eat the hot dog", 'Kill all the aliens'))
>> ['positive', 'negative', 'positive', 'positive', 'negative', 'negative']
find_entities
: returns a recognized list of entities, assigned to a category. If possible, also a Wikipedia link that refers to the entity.
find_entities('Lisa attended the lecture of Richard Feynmann at Cornell')
>> [['Entity: Lisa, Type: Person',
'Entity: Richard Feynman, Type: Person,
Link:https://en.wikipedia.org/wiki/Richard_Feynman',
'Entity: Cornell University, Type: Organization,
Link https://en.wikipedia.org/wiki/Cornell_University']]
OCR + Text Analytics
Let’s see how we can couple some stuff together. Applying read
on an image effectively generates textual data for you. So it is the perfect match to apply our text analytics. Let’s make up a report
function that extracts the individual text regions, analyzes them and makes up a report with our results:
report(source)

# Report
## Region 1
> "o. 4230..."
- Language: English
- Sentiment: positive
- Entities:
- 4230, Type: Quantity,
- Keys:
## Region 2
> "WASHINGTON, SATURDAY, APRIL 14, 1866..."
- Language: English
- Sentiment: positive
- Entities:
- WASHINGTON, Type: Location,
- SATURDAY, APRIL 14, 1866, Type: DateTime,
- April 14, Type: Other, Wiki [Link](https://en.wikipedia.org/wiki/April_14)
- Keys:
- WASHINGTON
## Region 3
> "PRICE TEN CENTS..."
- Language: English
- Sentiment: positive
- Entities:
- Tencent, Type: Organization, Wiki [Link](https://en.wikipedia.org/wiki/Tencent)
- TEN CENTS, Type: Quantity,
- Keys:
- PRICE
- CENTS
...
Speech Services
To convert audio to text and text-to-speech

We will be using Speech Services to transform voice to text and vice-versa. Together with record_audio
and play_audio
(defined in the notebook) we have a way to hear and talk to our notebook. But before we can use Speech Services, we need to get a token (a secret string) that is valid for 10 minutes. We will do this with the function get_token
as seen below:
We will use it to define the headers
of our requests inside our functions and authorize them to use our Speech Services resource.
speech_to_text
: receives the file path of an audio file and transforms the recognized speech of the given language into text. A whole lot of languages are supported.
text_to_speech
: does the exact opposite and transforms the given text into speech (an audio file) with an almost human voice. This will give a voice to our notebook.
Because speech_to_text
receives an audio file and returns words and text_to_speech
receives words and returns an audio file, we can do something like this:
# transform words to speech
audio = text_to_speech("Hi, I'm your virtual assistant")
# transcribe speech back to words
words = speech_to_text(audio)
print(words)
>> Hi I am your virtual assistant
Ok cool! But that seems totally pointless. Let’s do something more interesting. We will record our voice with record_audio
, transform it into words with speech_to_text
do something with the words and speak out loud the results.
Let’s check up the feeling of what you say with check_sentiment
:
# speak into the mic
my_voice = record_audio('audio.wav')
# transform it into words
text = speech_to_text(my_voice)
# analyze its feeling
sentiment = check_sentiment(text)[0]
# convert analysis into speech
diagnosis = text_to_speech(sentiment)
# hear the results
play_audio(diagnosis)
Let’s implement this idea inside a function to make it more usable:
Try it out!
motivational_bot()
Having your voice converted into voice means you can use your voice as input into your functions. There is a really lot you can try out with something like this. For example, instead of checking for feelings in the words you say, you could translate them to a bunch of different languages (see the Text Translator API , look up for something on the web (see Bing Search) or (going beyond Azure) maybe ask complex questions to answer engines like Wolfram Alpha etc.
LUIS – Language Understanding
A machine learning-based service to build natural language understanding into apps, bots and IoT devices.

Let’s make our notebook more intelligent by giving it the ability to understand certain intentions in the language using the LUIS API. In short words, we will train a linguistic model that recognizes certain intentions in the language.
For example, let’s say we have the intent to take_picture
. After training our model, if our notebook ‘hears’ sentences of the like of:
- take a photo
- use the camera and take a screenshot
- take a pic
It will know that our intention is to take_picture
. We call these phrases utterances. And are what we need to provide to teach the language model how to recognize our intents – the tasks or actions we want to perform.
By using varied and nonredundant utterances, as well as adding additional linguistic components such as entities, roles, and patterns you can create flexible and robust models tailored to your needs. Well implemented language models (backed up by the proper software) are what allow answer engines to respond to questions like "What is the weather in San Francisco?", "How many kilometers from Warsaw to Prag?", "How far is the Sun?" etc.
For this post, we will keep things simple and assign 5 utterances to a handful of intents. As you might presume, the intents will match some of the functions that we’ve already implemented.
Activate LUIS
In contrast to all the services we’ve seen, LUIS is a complex tool that comes with its own "Portal", where you manage your LUIS apps and create, train, test and iteratively improve your models. But before we can use it, we need to activate a LUIS account. Once you’ve done this:
- Go the LUIS dashboard, retrieve the Authoring Key for your account as seen below and paste it in the notebook.

AuthoringKey = '36bd10b73b144a5ba9cb4126ksdfs82ad2'
This was a topic of confusion for me. But the Authoring Key for your LUIS account is not the same as the key for the LUIS resource we made. But you can assign Azure resources to the LUIS app (e.g to open up access points in different regions) but refer here for more detailed information.
Creating a LUIS app
The LUIS Portal makes it very easy to create, delete and improve your LUIS models. But in this post, we’ll be using the LUIS programmatic API to set things up from within the notebook using the authoring_key
.
Let’s start off by creating the app:
app_id, luis_config = create_luis_app('Notebot' )
In this implementation, we keep track of the app ID (returned by the server) and the parameters that we specified inside app_id
and luis_config
as global variables for later use.
Add intents and utterances
Let’s now define a function to add intents and a function to add their respective utterances.
[create_intent](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c0c)
: adds one intent to the LUIS app. Which app is specified by the variablesapp_id
andluis_config
.add_utterances
: adds a batch of examples/utterances to an existing intent in the LUIS app.With these functions, let’s define our language model inside a dictionary as seen below and apply them to it. There is big room for experimentation at this stage.
The keys of this dictionary will be the intents for our application. Let’s loop over them and create them:
intents = intentions.keys()
for intent in intents:
create_intent(intent)
Each intent has 4 examples/utterances, let’s now add these to their respective intents.
for intent, utterances in intentions.items():
add_utterances(intent=intent, utterances=utterances)

Train the model
Let’s now train the model with the information we’ve specified with [train_luis_app](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c45)
.
train_luis_app(app_id, luis_config)
Publish the application
We are now ready to publish the application with [publish_app](https://westus.dev.cognitive.microsoft.com/docs/services/5890b47c39e2bb17b84a55ff/operations/5890b47c39e2bb052c5b9c3b)
.
publish_app(app_id)
Making a prediction
Let’s see if our model is any useful by making predictions of our intents. Note that LUIS has a separate API to make predictions, the LUIS endpoint API.
[understand](https://westus.dev.cognitive.microsoft.com/docs/services/5819c76f40a6350ce09de1ac/operations/5819c77140a63516d81aee78)
: to predict the intent using the given textunderstand('Can you give me some descriptions about what you are seeing?')
# predicted intent is:
>> `describe`
understand('Any homo sapiens in the picture?')
>> `detect_faces`
Cool! Now our notebook can approximately understand what our intentions are from plain language. But having to type the text ourselves doesn’t seem so helpful. The notebook should hear what we say and understand the intention that we have. Let’s address this writing a functionhear
that uses predict
together with the functions record_audio
and speech_to_text
.
We can now call hear
to speak into the mic, transfer our speech into words and predict the intention that we mean using our LUIS app.
intent = hear()
# see the prediction
print(intent)
Using the app
Let’s write a function that triggers a set of actions based on the predicted or recognized intent and maps then to execute some code.
In short: a function to execute what happens when a certain intent is predicted. There is big room for experiments here.
To finalize, let’s summon the Notebot to fulfill our wishes:
Depending on what you say, the "Notebot" can take a picture and:
- speak out loud a description
- display any detected objects
- display any detected faces.
- apply OCR and read out loud the results.
# summon your creation
Notebot()
The Notebot
will run a set of actions based on what you say.
Let’s sum up what happens when you call it. In the beginning, you will hear a greeting message. After that the Notebot
will apply hear
and start recording anything you say, your speech (the percept) will be transcribed to words and sent to the LUIS application to predict the intention that you have. Based on this prediction a different set of actions will be executed. In case there is no clear intent recognized from your speech, the intent "None" will be predicted and the Notebot
will call itself again.
Looked from above, the Notebot
ends up acting as a simple reflex-based agent, that simply finds a rule whose condition matches the current situation and executes it. (In this case what the Notebot
does if you say this or something else).
At this point, you might like to upgrade your agent with additional concepts, e.g adding memory about what is perceived. But I’ll leave that task for the diligent reader.
Cleaning up
This article got way longer than I intended. Let’s clean up things before we finish. To clean up everything we made in the cloud: delete the entire resource group (along with all the resources) and the keys.py
file (with your credentials) in the command line >_
by running:
az group delete --name MyGroup
rm keys.py
Alright, I hope this tutorial gave you at least a couple of ideas to implement in your projects.
That’s all from me! Thank you for reading the whole thing 🙂