A Pythonista's Intro to Semantic Kernel

Since the release of ChatGPT, Large language models (LLMs) have received a huge amount of attention in both industry and the media; resulting in an unprecedented demand to try and leverage LLMs in almost every conceivable context.

Semantic Kernel is an open-source SDK originally developed by Microsoft to power products such as Microsoft 365 Copilot and Bing, designed to make it easy to integrate LLMs into applications. It enables users to leverage LLMs to orchestrate workflows based on natural language queries and commands by making it possible to connect these models with external services that provide additional functionality the model can use to complete tasks.

As it was created with the Microsoft ecosystem in mind, many of the complex examples currently available are written in C#, with fewer resources focusing on the Python SDK. In this blog post, I shall demonstrate how to get started with Semantic Kernel using Python, introducing the key components and exploring how these can be used to perform various tasks.

In this article, what we shall cover includes the following:

The Kernel
Connectors
Prompt Functions
Creating a custom connector
Using a Chat Service
Making a simple chatbot
Memory
Using a text embedding service
Integrating memory into context
Plugins
Using out-of-the-box plugins
Creating custom plugins – Chaining multiple plugins
Orchestrating workflows with a planner

Disclaimer: Semantic Kernel, like everything related to related to LLMs, is moving incredibly fast. As such, interfaces may change slightly over time; I will try to keep this post updated where I can.

Whilst I work for Microsoft, I am not asked to, or compensated for, promoting Semantic Kernel in any way. In Industry Solutions Engineering (ISE), we pride ourselves on using what we feel are the best tools for the job depending on the situation and the customer that we are working with. In cases that we choose not to use Microsoft products, we provide detailed feedback to the product teams on the reasons why, and the areas where we feel things are missing or could be improved; this feedback loop usually results in Microsoft products being well suited for our needs.

Here, I am choosing to promote Semantic Kernel because, despite a few rough edges here and there, I believe that it shows great promise, and I prefer the design choices made by Semantic Kernel compared to some of the other solutions I’ve explored.

The packages used at the time of writing were:

dependencies:
  - Python=3.10.1.0
  - pip:
    - semantic-kernel==0.9.3b1
    - timm==0.9.5
    - transformers==4.38.2
    - sentence-transformers==2.2.2
    - curated-transformers==1.1.0

Tl;dr: If you just want to see some working code that you can use directly, all of the code required to replicate this post is available as a notebook here.

Acknowledgements

I’d like to thank my colleague Karol Zak, for collaborating with me on exploring how to get the most out of Semantic Kernel for our use cases, and providing code which inspired some of the examples in this post!

An overview of Semantic Kernel's components. Image from: https://learn.microsoft.com/en-us/semantic-kernel/media/kernel-flow.png — An overview of Semantic Kernel’s components. Image from: https://learn.microsoft.com/en-us/semantic-kernel/media/kernel-flow.png

Now, let’s begin with the central component of the library.

The Kernel

Kernel: "The core, center, or essence of an object or system." – Wiktionary

One of the key concepts in Semantic Kernel is the kernel itself, which is the main object that we will use to orchestrate our LLM based workflows. Initially, the kernel has very limited functionality; all of its features are largely powered by external components that we will connect to. The kernel then acts as a processing engine that fulfils a request by invoking appropriate components to complete the given task.

We can create a kernel as demonstrated below:

import semantic_kernel as sk

kernel = sk.Kernel()

Connectors

To make our kernel useful, we need to connect one or more AI models, which enable us to use our kernel to understand and generate natural language; this is done using a connector. Semantic Kernel provides out-of-the-box connectors that make it easy to add AI models from different sources, such as OpenAI, Azure OpenAI, and Hugging Face. These models are then used to provide a service to the kernel.

At the time of writing, the following services are supported:

text completion service: used to generate natural language
chat service: used to create a conversational experience
text embedding generation service: used to encode natural language into embeddings

Each type of service can support multiple models from different sources at the same time, making it possible to switch between different models, depending on the task and the preference of the user. If no specific service or model is specified, the kernel will default to the first service and model that was defined.

We can see all of the currently registered services using the following attribute:

As expected, we don’t currently have any connected services! Let’s change that.

Here, I will start by accessing a GPT3.5-turbo model which I deployed using the Azure OpenAI service in my Azure subscription.

As this model can be used for both text completion and chat, I will register using both services.

from semantic_kernel.connectors.ai.open_ai import (
    AzureChatCompletion,
    AzureTextCompletion,
)

kernel.add_service(
    service=AzureTextCompletion(
        service_id="azure_gpt35_text_completion",
        deployment_name=OPENAI_DEPLOYMENT_NAME,
        endpoint=OPENAI_ENDPOINT,
        api_key=OPENAI_API_KEY
    ),
)

gpt35_chat_service = AzureChatCompletion(
    service_id="azure_gpt35_chat_completion",
    deployment_name=OPENAI_DEPLOYMENT_NAME,
    endpoint=OPENAI_ENDPOINT,
    api_key=OPENAI_API_KEY,
)

kernel.add_service(gpt35_chat_service)

We can now see that the chat service has been registered as both a text completion and a chat completion service.

To use the non-Azure OpenAI API, the only change we would have to make is to use the OpenAITextCompletion and OpenAIChatCompletionconnectors instead of our Azure classes. Don’t worry if you don’t have access to OpenAI models, we will look at how to connect to open source models a little later; the choice of model won’t affect any of the next steps.

To retrieve a service after we have registered it, we can use the following methods on the kernel.

Now that we have registered some services, let’s explore how we can interact with them!

Prompt functions

The way to interact with a LLM through Semantic Kernel is to create a Prompt Function. A prompt function expects a natural language input and uses an LLM to interpret what is being asked, then act accordingly to return an appropriate response. For example, a prompt function could be used for tasks such as text generation, summarization, sentiment analysis, and question answering.

In Semantic Kernel, a semantic function is composed of two components:

Prompt Template: the natural language query or command that will be sent to the LLM
Execution config: contains the settings and options for the prompt function, such as the service that it should use, the parameters it should expect, and the description of what the function does.

The simplest way to get started is by using the kernel’s create_function_from_prompt method, which accepts a prompt and execution config, as well as some identifiers to help keep track of the function in the kernel.

To illustrate this, let’s create a simple prompt:

prompt = """
{{$input}} is the capital city of
"""

Here, we have used the {{$}} syntax to represent an argument that will be injected into our prompt. Whilst we shall see many more examples of this throughout this post, a comprehensive guide to templating syntax can be found in the documentation.

Next, we need to create an execution config. If we know the type of service that we want to use to execute our function, we can import the corresponding config class and create an instance of this, as demonstrated below.

from semantic_kernel.connectors.ai.open_ai import OpenAITextPromptExecutionSettings

execution_config = OpenAITextPromptExecutionSettings(service_id = "azure_gpt35_text_completion",
                                                    max_tokens=100,
                                                    temperature=0,
                                                    top_p=0.0)

Whilst this works, it does couple our function to a certain type of service, which limits our flexibility. An alternative approach is to retrieve the corresponding configuration class directly from the service we intend to use, as demonstrated below.

This way, we can select the service we wish to use at runtime, and automatically load an appropriate config object. Let’s us this approach to create our execution config.

target_service_id = "azure_gpt35_text_completion"

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=100,
        temperature=0,
        seed=42
    )

Now, we can create our function!

generate_capital_city_text = kernel.create_function_from_prompt(
    prompt=prompt,
    plugin_name="Generate_Capital_City_Completion",
    function_name="generate_city_completion",
    execution_settings=execution_config
)

Now, we can use our function using the kernel’s invoke method. As many of our connected services are likely to be calling external APIs, invoke is an asyncronous method, based on Asyncio. This enables us to execute multiple calls to external services simultaneously, without waiting for a response for each one.

response = await kernel.invoke(generate_capital_city_text, input="Paris")

The response object contains valuable information about our function call, such as the parameters that were used; provided everything worked as expected, we can access our result using the str constructor on the object.

Here, we can see that our function has worked!

Using Local Models

In addition to using models behind APIs, we can also use the kernel to orchestrate calls to local models. To illustrate this, let’s register another text completion service, and create a config which enables us to specify that we would like to use our new service. For our second completion service, let’s use a model from the Hugging Face transformers library. To do this, we use the HuggingFaceTextCompletion connector.

Here, as we will be running the model locally, I have selected OPT-350m, a older model aiming to roughly match the performance of GPT-3, which should be able to run quickly easily on most hardware.

from semantic_kernel.connectors.ai.hugging_face import HuggingFaceTextCompletion

hf_model = HuggingFaceTextCompletion(service_id="hf_text_completion", ai_model_id="facebook/opt-350m", task="text-generation")
kernel.add_service(hf_model)

Now, let’s create our config object. We can do this in a similar way before, but this time passing the service_id associated with our Hugging Face service.

target_service_id = "hf_text_completion"

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=100,
        temperature=0,
        seed=42
    )

We can now create and execute our function as we saw earlier.

hf_complete = kernel.create_function_from_prompt(
    prompt=prompt,
    plugin_name="Generate_Capital_City_Completion",
    function_name="generate_city_completion_opt",
    execution_settings=execution_config
)

response = await kernel.invoke(hf_complete, input='Paris')

Well, the generation seems to have worked, but it is arguably not as good as the response provided by GPT3.5. This is not unexpected, as this is an older model! Interestingly, we can see that, before it reached its max token limit, it started answering a similar pattern about Berlin; this behaviour is not unexpected when dealing with text completion models.

Creating a custom connector

Now that we have seen how to create a semantic function and specify which service we would like our function to use. However, until this point, all of the services we have used have relied on out-of-the-box connectors. In some cases, we may wish to use a model from a different library to those currently supported, for which we will need a custom connector. Let’s look at how we can do this.

As an example, let’s use a transformer model from the curated transformers library.

To create a custom connector, we need to subclass TextCompletionClientBase, which acts as a thin wrapper around our model. A simple example of how to do this is provided below.

from typing import Any, Dict, List, Optional, Union

import torch
from curated_transformers.generation import (AutoGenerator,
                                             SampleGeneratorConfig)
from semantic_kernel.connectors.ai.prompt_execution_settings import 
    PromptExecutionSettings
from semantic_kernel.connectors.ai.text_completion_client_base import 
    TextCompletionClientBase

class CuratedTransformersPromptExecutionSettings(PromptExecutionSettings):
    temperature: float = 0.0
    top_p: float = 1.0

    def prepare_settings_dict(self, **kwargs) -> Dict[str, Any]:
        settings = {
            "temperature": self.temperature,
            "top_p": self.top_p,
        }
        settings.update(kwargs)
        return settings

class CuratedTransformersCompletion(TextCompletionClientBase):
    device: Any
    generator: Any

    def __init__(
        self,
        service_id: str,
        model_name: str,
        device: Optional[int] = -1,
    ) -> None:
        """
        Use a curated transformer model for text completion.

        Arguments:
            model_name {str}
            device_idx {Optional[int]} -- Device to run the model on, -1 for CPU, 0+ for GPU.

        Note that this model will be downloaded from the Hugging Face model hub.
        """
        device = (
            "cuda:" + str(device)
            if device >= 0 and torch.cuda.is_available()
            else "cpu"
        )
        generator = AutoGenerator.from_hf_hub(
            name=model_name, device=torch.device(device)
        )
        super().__init__(
            service_id=service_id,
            ai_model_id=model_name,
            device=device,
            generator=generator,
        )

    async def complete(
        self, prompt: str, settings: CuratedTransformersPromptExecutionSettings
    ) -> Union[str, List[str]]:
        generator_config = SampleGeneratorConfig(**settings.prepare_settings_dict())
        try:
            with torch.no_grad():
                result = self.generator([prompt], generator_config)

            return result[0]

        except Exception as e:
            raise ValueError("CuratedTransformer completion failed", e)

    async def complete_stream(self, prompt: str, request_settings):
        raise NotImplementedError(
            "Streaming is not supported for CuratedTransformersCompletion."
        )

    def get_prompt_execution_settings_from_settings(
        self, settings: CuratedTransformersPromptExecutionSettings
    ) -> CuratedTransformersPromptExecutionSettings:
        return settings

Now, we can register our connector and create a semantic function as demonstrated before. Here, I am using the Falcon-7B model, which will require a GPU to run in a reasonable amount of time. Here, I used a Nvidia A100 on an Azure virtual machine, as running it locally was too slow.

kernel.add_service(
        CuratedTransformersCompletion(
            service_id="custom",
            model_name="tiiuae/falcon-7b",
            device=-1,
        )
    )

complete = kernel.create_function_from_prompt(
        prompt=prompt,
        plugin_name="Generate_Capital_City_Completion",
        function_name="generate_city_completion_curated",
        prompt_execution_settings=CuratedTransformersPromptExecutionSettings(
            service_id="custom", temperature=0.0, top_p=0.0
        ),
    )

print(await kernel.invoke(complete, input="Paris"))

Once again, we can see that the generation has worked, but it quickly descends into repetition after it has answered our question.

A likely reason for this is the model that we have selected. Commonly, autoregressive transformer models are trained to predict the next word on a large corpus of text; essentially making them powerful autocomplete machines! Here, it appears that it has tried to ‘complete’ our question, which has resulted in it continuing to generate text, which isn’t helpful for us.

Using a Chat Service

Some LLM models have undergone additional training, to make them more useful to interact with. An example of this process is detailed in OpenAI’s InstructGPT paper.

At a high level, this usually involves adding one or more supervised finetuning steps where, instead of random unstructured text, the model is trained on curated examples of tasks such as question answering and summarisation; these models are usually known as instruction-tuned or chat models.

As we already observed how base LLMs can generate more text than we need, let’s investigate whether a chat model will perform differently. To use our chat model, we need to update our config to specify an appropriate service and create a new function; we shall use azure_gpt35_chat_completion in our case.

generate_capital_city_chat = kernel.create_function_from_prompt(
        prompt=prompt,
        plugin_name="Generate_Capital_City",
        function_name="capital_city_chat_2",
        prompt_execution_settings=kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
            service_id=target_service_id, temperature=0.0, top_p=0.0, seed=42
        ),
    )

print(await kernel.invoke(generate_capital_city_chat, input="Paris"))

Excellent, we can see that the chat model has given us a much more concise answer!

Previously, as we were using text completion models, we had formatted our prompt as a sentence for the model to complete. However, the instruction tuned models should be able to understand a question, so we may be able to change our prompt to make it a little more flexible. Let’s see how we can adjust our prompt with the aim of interacting with the model as if it was a chatbot designed to provide us information about places that we may like to visit.

First, let’s adjust our function config to make our prompt more generic.

chatbot = kernel.create_function_from_prompt(
        prompt="{{$input}}",
        plugin_name="Chatbot",
        function_name="chatbot",
        prompt_execution_settings=kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
            service_id=target_service_id, temperature=0.0, top_p=0.0, seed=42
        ),
    )

Here, we can see that we are only passing in the user input, so we must phrase our input as a question. Let’s try this.

async def chat(user_input):
    print(await kernel.invoke(generate_capital_city_chat, input=user_input))

Great, that seems like it has worked. Let’s try asking a follow up question.

We can see that the model has provided a very generic answer, which doesn’t take into account our previous question at all. This is expected, as the prompt that the model recieved was "What are some interesting things to do there?", we didn’t provide any context on where ‘there’ is!

Let’s see how we can extend our approach to make a simple chatbot in the following section.

Making a simple Chatbot

Now that we have seen how we can use a chat service, let’s explore how we can create a simple chatbot.

Our chatbot should be able to do three things:

Know it’s purpose and inform us of this
Understand the current conversation context
Reply to our questions

Let’s adjust our prompt to reflect this.

chatbot_prompt = """
"You are a chatbot to provide information about different cities and countries. 
 For other questions not related to places, you should politely decline to answer the question, stating your purpose"
 +++++

{{$history}}
User: {{$input}}
ChatBot: """

Notice that we have added the variable history which will be used to provide previous context to the chatbot. Whilst this is quite a naive approach, as long conversations will quickly cause the prompt to reach the model’s maximum context length, it should work for our purposes.

So far, we have only used prompts which use a single variable. To use multiple variables we need adjust our config, as demonstrated below, by creating a PromptTemplateConfig; which defines which inputs we are expecting.

from semantic_kernel.prompt_template.input_variable import InputVariable

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=500,
        temperature=0,
        seed=42
    )

prompt_template_config = sk.PromptTemplateConfig(
    template=chatbot_prompt,
    name="chat",
    template_format="semantic-kernel",
    input_variables=[
        InputVariable(name="input", description="The user input", is_required=True),
        InputVariable(name="history", description="The conversation history", is_required=True),
    ],
    execution_settings=execution_config,
)

Now, let’s use this updated config and prompt to create our chatbot

chatbot = kernel.create_function_from_prompt(
    function_name="chatbot_with_history",
    plugin_name="chatPlugin",
    prompt_template_config=prompt_template_config,
)

To keep track of the history to include in our prompt, we can use a ChatHistory object. Let’s create a new instance of this.

from semantic_kernel.contents.chat_history import ChatHistory

chat_history = ChatHistory()

Additionally, to pass multiple arguments to our prompt, we can use the KernelArguments, so that we are only passing a single parameter to invoke; which contains all arguments.

We can see how to do this by creating a simple chat function, to update our history after each interaction.

from pprint import pprint

async def chat(input_text, verbose=True):
    # Save new message in the context variables
    context = KernelArguments(user_input=input_text, history=chat_history)

    if verbose:
        # print the full prompt before each interaction
        print("Prompt:")
        print("-----")
        # inject the variables into our prompt
        print(await chatbot.prompt_template.render(kernel, context))
        print("-----")

    # Process the user message and get an answer
    answer = await kernel.invoke(chatbot, context)

    # Show the response
    pprint(f"ChatBot: {answer}")

    # Append the new interaction to the chat history
    chat_history.add_user_message(input_text)
    chat_history.add_assistant_message(str(answer))

Let’s try it out!

Here, we can see that this has fulfilled our requirements quite well!

Inspecting our prompt, we can see that our history is being rendered into a format which has the option of including additional metadata. Whilst this may be a useful implementation detail, it is likely that we don’t want our prompt formatted this way!

When using a library such as semantic kernel, it is important to be able to verify exactly what is being passed into the model, as the way that a prompt is written and formatted can have a big impact on the result.

Most language models, such as the OpenAI APIs, do not take a single prompt as an input, but prefer inputs formatted as a list of messages; alternating between the user and the model. We can inspect how our prompt will be broken down into messages below.

Here, we can see that all of the formatting associated with the chat history has been removed, and the messages appear how we would expect.

Memory

When interacting with our chatbot, one of the key aspects that made the experience feel like a useful interaction was that the chatbot was able to retain the context of our previous questions. We did this by giving the chatbot access to memory, leveraging ChatHistory to handle this for us.

Whilst this worked well enough for our simple use case, all of our conversation history was stored in our system’s RAM and not persisted anywhere; once we shut down our system, this is gone forever. For more intelligent applications, it can be useful to be able to build and persist both short and long term memory for our models to access.

Additionally, in our example, we were feeding all of our previous interactions into our prompt. As models usually have a fixed size context window – which determines how long our prompts can be – this will quickly break down if we start to have lengthy conversations. One way to avoid this is to store our memory as separate ‘chunks’ and only load information that we think may be relevant into our prompt.

Semantic Kernel offers some functionality around how we can incorporate memory into our applications, so let’s explore how we can leverage these.

As an example, let’s extend our chatbot so that it has access to some information that is stored in memory.

First, we need some information that may be relevant to our chatbot. Whilst we could manually resarch and curate relevant information, it is quicker to have the model generate some for us! Let’s get the model to generate some facts about the city of London. We can do this as follows.

response = chatbot(
    """Please provide a comprehensive overview of things to do in London. Structure your answer in 5 paragraphs, based on:
- overview
- landmarks
- history
- culture
- food

Each paragraph should be 100 tokens, do not add titles such as `Overview:` or `Food:` to the paragraphs in your response.

Do not acknowledge the question, with a statement like "Certainly, here's a comprehensive overview of things to do in London". 
Do not provide a closing comment.
"""
)

Now that we have some text, so that the model can access only the parts that it needs, let’s divide this into chunks. Semantic kernel offers some functionality to do this in it’s text_chunker module. We can use this as demonstrated below:

from semantic_kernel.text import text_chunker as tc

chunks = tc.split_plaintext_paragraph([london_info], max_tokens=100)

We can see that the text has been split into 8 chunks. Depending on the text, we will have to adjust the maximum number of tokens specified for each chunk.

Using a Text Embedding Service

Now that we have chunked our data, we need to create a representation of each chunk that enables us to calculate relevance between text; we can do this by representing our text as embeddings.

To generate embeddings, we need to add a text embedding service to our kernel. Similarly to before, there are various connectors that can be used, depending on the source of the underlying model.

First, let’s use a text-embedding-ada-002 model deployed in the Azure OpenAI service. This model was trained by OpenAI, and more information about this model can be found in their launch blog post.

from semantic_kernel.connectors.ai.open_ai import AzureTextEmbedding

embedding_service =  AzureTextEmbedding(
        service_id="azure_openai_embedding",
        deployment_name=OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        endpoint=OPENAI_ENDPOINT,
        api_key=OPENAI_API_KEY,
    )

kernel.add_service(embedding_service)

Now that we have access to a model that can generate embeddings, we need somewhere to store these. Semantic Kernel provides the concept of a MemoryStore, which is an interface to various persistence providers.

For production systems, we would probably want to use a database for our persistence, to keep things simple for our example, we shall use in-memory storage. Let’s create an instance of an in-memory memory store.

memory_store = sk.memory.VolatileMemoryStore()

Whilst we have used an in-memory memory store to keep things simple for our example, we would probably want to use a database for our persistence when building more complex systems. Semantic Kernel offers connectors to popular storage solutions such as CosmosDB, Redis, Postgres and many others. As memory stores have a common interface, the only change that would be required would be modifying the connector used, which makes it easy to switch between providers.

Now that we have defined our memory store, we need to generate our embeddings. Semantic Kernel provides Semantic memory data sctuctures to help with this; which associate a memory store with a service that can generate embeddings. Here, we are going to use SemanticTextMemory, which will enable us to embed and retrieve our document chunks.

from semantic_kernel.memory.semantic_text_memory import SemanticTextMemory

memory = SemanticTextMemory(storage=memory_store, embeddings_generator=embedding_service)

We can now save information to our memory store as follows.

for i, chunk in enumerate(chunks):
    await memory.save_information(
        collection="London", id="chunk" + str(i), text=chunk
    )

Here, we have created a new collection, to group similar documents.

We can now query this collection in the following way:

results = await memory.search(
    "London", "what food should I eat in London?", limit=2
)

Looking at the results, we can see that relevant information has been returned; which is reflected by the high relevance scores.

However, this was quite easy, as we have information direcly relating to what we being asked, using very similar language. Let’s try a more subtle query.

results = await memory.search(
    "London", "Where can I eat non-british food in London?", limit=2
)

Here, we can see that we have received exactly the same results. However, as our second result explicitly mentions ‘food from around the world’, I feel that this is a better match. This highlights some of the potential limitations of a semantic search approach.

Using an Open Source model

Out of interest, let’s see how an open source model compares with our OpenAI service in this context. We can register a Hugging Face sentence transformer model for this purpose, as demonstrated below:

from semantic_kernel.connectors.ai.hugging_face import HuggingFaceTextEmbedding

hf_embedding_service = HuggingFaceTextEmbedding(
    service_id="hf_embedding_service",
    ai_model_id="sentence-transformers/all-MiniLM-L6-v2",
    device=-1
)
hf_memory = SemanticTextMemory(storage=sk.memory.VolatileMemoryStore(), embeddings_generator=hf_embedding_service)

We can now query these in the same way as before.

for i, chunk in enumerate(chunks):
    await kernel.memory.save_information_async(
        "hf_London", id="chunk" + str(i), text=chunk
    )

hf_results = await hf_memory.search(
    "hf_London", "what food should I eat in London", limit=2, min_relevance_score=0
)

hf_results = await hf_memory.search(
    "hf_London",
    "Where can I eat non-british food in London?",
    limit=2,
    min_relevance_score=0,
)

We can see that we have returned the same chunks, but our relevance scores are different. We can also observe the difference in the dimensions of the embeddings generated by the different models.

Integrating memory into context

In our previous example, we saw that whilst we could identify broadly relevant information based on an embedding search, for more subtle queries we didn’t receive the most relevant result. Let’s explore whether we can improve upon this.

One way that we could approach this is to provide the relevant information to our chatbot, and then let the model decide which parts are the most relevant. Let’s create a prompt which instructs the model to answer the question based on the context provided, and register a prompt function.

prompt_with_context = """
 Use the following pieces of context to answer the users question.
 This is the only information that you should use to answer the question, do not reference information outside of this context.
 If the information required to answer the question is not provided in the context, just say that "I don't know", don't try to make up an answer.
 ----------------
 Context: {{$context}}
 ----------------
 User question: {{$question}}
 ----------------
 Answer:
"""

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=500,
        temperature=0,
        seed=42
    )

prompt_template_config = sk.PromptTemplateConfig(
    template=prompt_with_context,
    name="chat",
    template_format="semantic-kernel",
    input_variables=[
        InputVariable(name="question", description="The user input", is_required=True),
        InputVariable(name="context", description="The conversation history", is_required=True),
    ],
    execution_settings=execution_config,
)

chatbot_with_context = kernel.create_function_from_prompt(
    function_name="chatbot_with_memory_context",
    plugin_name="chatPluginWithContext",
    prompt_template_config=prompt_template_config,
)

Now, we can use this function to answer our more subtle question. First, we create a context object, and add our question to this.

question = "Where can I eat non-british food in London?"

Next, we can manually perform our embedding search, and add the retrieved information to our context.

results = await hf_memory.search("hf_London", question, limit=2)

We create a context object, and add our question to this.

context = KernelArguments(question=question, context="n".join([result.text for result in results]))

Finally, we can execute our function.

answer = await kernel.invoke(chatbot_with_context, context)

This time, we see that our answer has referenced the information that we are looking for and provided a better answer!

Plugins

A plugin in Semantic Kernel is a group of functions that can be loaded into the kernel to be exposed to AI apps and services. The functions within plugins can then be orchestrated by the kernel to accomplish tasks.

The documentation describes plugins as the "building blocks" of Semantic Kernel, which can be chained together to create complex workflows; as plugins follow the OpenAI plugin specification, plugins created for OpenAI services, Bing, and Microsoft 365 can be used with Semantic Kernel.

Semantic Kernel provides several plugins out-of-the-box, which include:

ConversationSummaryPlugin: To summarize a conversation
HttpPlugin: To call APIs
TextMemoryPlugin: To store and retrieve text in memory
TimePlugin: To acquire the time of day and any other temporal information

Let’s start by exploring how we can use a pre-defined plugin, before moving on to investigate how we can create custom plugins.

Using an out-of-the-box plugin

One of the plugins included in Semantic Kernel is TextMemoryPlugin, which provides functionality to save and recall information from memory. Let’s see how we can use this to simplify our previous example of populating our prompt context from memory.

First, we must import our plugin, as demonstrated below.

Here, we can see that this plugin contains two semantic functions, recall and save.

Now, let’s modify our prompt:

prompt_with_context_plugin = """
 Use the following pieces of context to answer the users question.
 This is the only information that you should use to answer the question, do not reference information outside of this context.
 If the information required to answer the question is not provided in the context, just say that "I don't know", don't try to make up an answer.
 ----------------
 Context: {{recall $question}}
 ----------------
 User question: {{$question}}
 ----------------
 Answer:
"""

We can see that, to use the recall function, we can reference this in our prompt. Now, let’s create a config and register a function.

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=500,
        temperature=0,
        seed=42
    )

prompt_template_config = sk.PromptTemplateConfig(
    template=prompt_with_context_plugin,
    name="chat",
    template_format="semantic-kernel",
    input_variables=[
        InputVariable(name="question", description="The user input", is_required=True),
        InputVariable(name="context", description="The conversation history", is_required=True),
    ],
    execution_settings=execution_config,
)

chatbot_with_context_plugin = kernel.create_function_from_prompt(
    function_name="chatbot_with_context_plugin",
    plugin_name="chatPluginWithContextPlugin",
    prompt_template_config=prompt_template_config,
)

In our manual example, we were able to control aspects such as the number of results returned and the collection to search. When using TextMemoryPlugin, we can set these by adding them to KernelArguments. Let’s try out our function.

context = KernelArguments(question="Where can I eat non-british food in London?", collection='London', relevance=0.2, limit=2)

answer = await kernel.invoke(chatbot_with_context_plugin, context)

We can see that this is equivalent to our manual approach.

Creating Custom Plugins

Now that we understand how to create semantic functions, and how to use plugins, we have everything we need to start making our own plugins!

Plugins can contain two types of functions:

Prompt functions: use natural language to perform actions
Native functions: use Python code to perform actions

which can be combined within a single plugin.

The choice of whether to use a prompt vs native function depends on the task that you are performing. For tasks involving understanding or generating language, prompt functions are the obvious choice. However, for more deterministic tasks, such as performing mathematical operations, downloading data or accessing the time, native functions are better suited.

Let’s explore how we can create each type. First, let’s create a folder to store our plugins.

from pathlib import Path

plugins_path = Path("Plugins")
plugins_path.mkdir(exist_ok=True)

Creating a Poem generator plugin

For our example, let’s create a plugin which generates poems; for this, using a prompt function seems a natural choice. We can create a folder for this plugin in our directory.

poem_gen_plugin_path = plugins_path / "PoemGeneratorPlugin"
poem_gen_plugin_path.mkdir(exist_ok=True)

Recalling that plugins are just a collection of functions, and we are creating a semantic function, the next part should be quite familiar. The key difference is that, instead of defining our prompt and config inline, we will create individual files for these; to make it easier to load.

Let’s create a folder for our semantic function, which we shall call write_poem.

poem_sc_path = poem_gen_plugin_path / "write_poem"
poem_sc_path.mkdir(exist_ok=True)

Next, we create our prompt, saving it as skprompt.txt.

Now, let’s create our config and store this in a json file.

Whilst it is always good practice to set meaningful descriptions in our config, this becomes more important when we are defining plugins; plugins should provide clear descriptions that describe how they behave, what their inputs and outputs are, and what their side effects are. The reason for this is that this is the interface that is presented by our kernel and, if we want to be able to use an LLM to orchestrate tasks, it needs to be able to understand the plugin’s functionality and how to call it so that it can select appropriate functions.

config_path = poem_sc_path / "config.json"

%%writefile {config_path}

{
  "schema": 1,
  "description": "A poem generator, that writes a short poem based on user input",
  "execution_settings": {
    "azure_gpt35_chat_completion": {
      "max_tokens": 512,
      "temperature": 0.8,
      "top_p": 0.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "seed": 42
    }
  },
  "input_variables": [
    {
      "name": "input",
      "description": "The topic that the poem should be written about",
      "default": "",
      "is_required": true
    }
  ]

}

Now, we are able to import our plugin:

poem_gen_plugin = kernel.import_plugin_from_prompt_directory(
    plugins_path, "PoemGeneratorPlugin"
)

Inspecting our plugin, we can see that it exposes our write_poem semantic function.

We can call our function using the kernel, as we have seen before.

result = await kernel.invoke(poem_gen_plugin["write_poem"], KernelArguments(input="Munich"))

or, we can use it in another semantic function:

prompt = """
{{PoemGeneratorPlugin.write_poem $input}}
"""

target_service_id = "azure_gpt35_chat_completion"

execution_config = kernel.get_service(target_service_id).instantiate_prompt_execution_settings(
        service_id=target_service_id,
        max_tokens=500,
        temperature=0.8,
        seed=42
    )

prompt_template_config = sk.PromptTemplateConfig(
    template=prompt,
    name="chat",
    template_format="semantic-kernel",
    input_variables=[
        InputVariable(name="input", description="The user input", is_required=True),
    ],
    execution_settings=execution_config,
)

write_poem_wrapper = kernel.create_function_from_prompt(
    function_name="poem_gen_wrapper",
    plugin_name="poemWrapper",
    prompt_template_config=prompt_template_config,
)

result = await kernel.invoke(write_poem_wrapper, KernelArguments(input="Munich"))

Creating an Image Classifier plugin

Now that we have seen how to use a prompt function in a plugin, let’s take a look at how we can use a native function.

Here, let’s create a plugin that takes an image url, then downloads and classifies the image. Once again, let’s create a folder for our new plugin.

image_classifier_plugin_path = plugins_path / "ImageClassifierPlugin"
image_classifier_plugin_path.mkdir(exist_ok=True)

download_image_sc_path = image_classifier_plugin_path / "download_image.py"
download_image_sc_path.mkdir(exist_ok=True)

Now, we can create our Python module. Inside the module, we can be quite flexible. Here, we have created a class with two methods, the key step is to use the kernel_function decorator to specify which methods should be exposed as part of the plugin.

For our inputs, we have used the Annotated type hint to provide a description of what our argument does. More information can be found in the documentation.

import requests
from PIL import Image
import timm
from timm.data.imagenet_info import ImageNetInfo

from typing import Annotated
from semantic_kernel.functions.kernel_function_decorator import kernel_function

class ImageClassifierPlugin:
    def __init__(self):
        self.model = timm.create_model("convnext_tiny.in12k_ft_in1k", pretrained=True)
        self.model.eval()
        data_config = timm.data.resolve_model_data_config(self.model)
        self.transforms = timm.data.create_transform(**data_config, is_training=False)
        self.imagenet_info = ImageNetInfo()

    @kernel_function(
        description="Takes a url as an input and classifies the image",
        name="classify_image"
    )
    def classify_image(self, input: Annotated[str, "The url of the image to classify"]) -> str:
        image = self.download_image(input)
        pred = self.model(self.transforms(image)[None])
        return self.imagenet_info.index_to_description(pred.argmax())

    def download_image(self, url):
        return Image.open(requests.get(url, stream=True).raw).convert("RGB")

For this example, I have used the excellent Pytorch Image Models library to provide our classifier. For more information on how this library works, check out this blog post.

Now, we can simply import our plugin as seen below.

image_classifier = ImageClassifierPlugin()

classify_plugin = kernel.import_plugin_from_object(image_classifier, plugin_name="classify_image")

Inspecting our plugin, we can see that only our decorated function is exposed.

We can verify that our plugin works using an image of a cat from Pixabay.

url = "https://cdn.pixabay.com/photo/2016/02/10/16/37/cat-1192026_1280.jpg"
response = await kernel.invoke(classify_plugin["classify_image"], KernelArguments(input=url))

Manually calling our function, we can see that our image has been classified correctly! In the same way as before, we could also reference this function directly from a prompt. However, as we have already demonstrated this, let’s try something slightly different in the following section.

Chaining multiple plugins

Now that we have defined a variety of functions, both inline and as plugins, let’s see how we can orchestrate a workflow that calls more than one function.

If we would like to execute multiple functions independently, this is straightforward; we can simply pass a list of functions to invoke, as demonstrated below.

answers = await kernel.invoke([classify_plugin["classify_image"], poem_gen_plugin["write_poem"]], arguments=KernelArguments(input=url))

Here, we can see that the same input has been used for each function. We could have defined different named parameter in our KernelArguments, but if multiple functions have arguments with the same name, this becomes difficult. As an aside, our poem generator seemed to do a great job given that it was only provided with a url!

A more interesting case is when we would like to use the output from one function as the input to another, so let’s explore that.

To provide more finegrained control over how functions are invoked, the kernel enables us to define handlers, where we can inject custom behaviour:

add_function_invoking_handler: used to register handlers that are called before a function is called
add_function_invoked_handler: used to register handlers that are called after a function is called

As we would like to update our input to the next function with the previous function’s output, we can define a short function to do this, and register this so that it is called after each function has been invoked. Let’s see how we can do this.

First, we need to define a function which takes the kernel and an instance of FunctionInvokedEventArgs, and updates our arguments.

from semantic_kernel.events.function_invoked_event_args import FunctionInvokedEventArgs

def store_results(kernel, invoked_function_info: FunctionInvokedEventArgs):
    previous_step_result = str(invoked_function_info.function_result)
    invoked_function_info.arguments['input'] = previous_step_result
    invoked_function_info.updated_arguments = True

Next, we can register this with our kernel.

kernel.add_function_invoked_handler(store_results)

Now, we can invoke our functions as before.

answers = await kernel.invoke([classify_plugin["classify_image"], poem_gen_plugin["write_poem"]], arguments=KernelArguments(input=url))

We can see that, using both plugins sequentially, we have classified the image and wrote a poem about it!

Orchestrating workflows with a Planner

At this point, we have thoroughly explored semantic functions, understand how functions can be grouped and used as part of a plugin, and have seen how we can chain plugins together manually. Now, let’s explore how we can create and orchestrate workflows using LLMs. To do this, Semantic Kernel provides Planner objects, which can dynamically create chains of functions to try and achieve a goal.

A planner is a class that takes a user prompt and a kernel, and uses the kernel’s services to create a plan of how to perform the task, using the functions and plugins that have been made available to the kernel. As the plugins are the main building blocks of these plans, the planner relies heavily on the descriptions provided; if plugins and functions don’t have clear descriptions, the planner will not be able to use them correctly. Additionally, as a planner can combine functions in various different ways, it is important to ensure that we only expose functions that we are happy for the planner to use.

As the planner relies on a model to generate a plan, there can be errors introduced; these usually arise when the planner doesn’t properly understand how to use the function. In these cases, I have found that providing explicit instructions – such as describing the inputs and outputs, and stating whether inputs are required – in the descriptions can lead to better results. Additionally, I have had better results using instruction tuned models than base models; base text completion models tend to hallucinate functions that don’t exist or create multiple plans. Despite these limitations, when everything works correctly, planners can be incredibly powerful!

Let’s explore how we can do this by exploring if we can create a plan to write a poem about an image, based on its url; using the plugins we created earlier. As we have defined lots of functions that we no longer need, let’s create a new kernel, so we can control which functions are exposed.

kernel = sk.Kernel()

To create our plan, let’s use our OpenAI chat service.

service_id = "azure_gpt35_chat_completion"

kernel.add_service(
    service=AzureChatCompletion(
        service_id=service_id,
        deployment_name=OPENAI_DEPLOYMENT_NAME, 
        endpoint=OPENAI_ENDPOINT, 
        api_key=OPENAI_API_KEY
    ),
)

Now, let’s import our plugins.

classify_plugin = kernel.import_plugin_from_object(
    ImageClassifierPlugin(), plugin_name="classify_image"
)
poem_gen_plugin = kernel.import_plugin_from_prompt_directory(
    plugins_path, "PoemGeneratorPlugin"
)

We can see which functions our kernel has access to as demonstrated below.

Now, let’s import our planner object.

from semantic_kernel.planners.basic_planner import BasicPlanner

planner = BasicPlanner(service_id)

To use our planner, all we need is a prompt. Often, we will need to tweak this depending on the plans that are generated. Here, I have tried to be as explicit as possible about the input that is required.

ask = f"""
I would like you to write poem about what is contained in this image with this url: {url}. This url should be used as input.

"""

Next, we can use our planner to create a plan for how it will solve the task.

plan = await planner.create_plan(ask, kernel)

Inspecting our plan, we can see that the model has correctly identified out input, and the correct functions to use!

Finally, all that is left to do is to execute our plan.

poem = await planner.execute_plan(plan, kernel)

Wow, it worked! For a model trained to predict the next word, that is pretty powerful!

As a word of warning, I was quite lucky when making this example that the generated plan worked first time. However, we are relying on a model to correctly interpret our instructions, as well as understanding the tools available; not to mention that LLMs can hallucinate and potentially dream up new functions that don’t exist! For me personally, in a production system, I would feel much more comfortable manually creating the workflow to execute, rather than leaving it to the LLM! As the technology continues to improve, especially at the current rate, hopefully this recommendation will become outdated!