UniFliXsg: AI-Powered Undergraduate Program Recommendations for Singapore Universities

A week ago, I shared a story on my LinkedIn about my latest project, UniFliXsg, the AI app that will suggest the program for undergraduate studies in Singapore based on your interests and career goals. If you haven’t seen it, you may also try it out at the link below. Currently, the database only covers single major programs in NUS, NTU, SMU, and SUTD.

Uniflixsg – a Hugging Face Space by oadultradeepfield

As promised, I will share the workflow and the technical aspects of how I did this from scratch in this Medium blog. The layout of this article is organized in the same process as my thought processes, so you may also find some jumping between steps. Additionally, there will be little to no code in this article. You can see those on this GitHub repository if you like.

This is my first blog, so if we haven’t known each other yet, I am Oad. I am now a computer science undergraduate at NUS with a strong passion for AI, particularly Large Language Models (LLMs). That’s also why I started this project, to learn more about their application

Step #1: Outlining the Workflow

Everything starts with planning. The first thing that came to my mind was to do something similar to content-based filtering, where we could match the user’s profile with the program’s information. After looking through samples of the programs, I found that the information available for all the programs is essentially the description, and the career prospects itself. So, I could employ techniques to calculate the average similarity between the users and each piece of information to recommend the programs. This is somewhat similar to semantic search, where we match the user’s queries and the items in the database.

I also noted that there may be different cultures among the universities, so including the description of the university as another piece of information could be beneficial for more personalized recommendations. The name UniFliXsg arises during this moment to replicate Netflix, which is also known for its powerful Recommender Systems.

On the user’s side, I would let them input their personal interests and career goals. To reduce the extra computations, I decided to make the user profiles that are inputted to the model be a single query, which is "I am interested in user_input. Upon graduation, I want to work as user_input."

I used the cosine similarity to compute the similarity scores. I chose this metric because it could handle the varied length vectors better than other distance metrics. So we will first convert each information and the user’s profile to vectors (text embedding), then compute the scores. This overall planning is illustrated in the image below.

A simple illustration of the idea. Here, I simplify the embedding spaces to two-dimensional spaces. The cosine similarity is just the cosine of the angle between the vectors, the exact formulation is to be elaborated later.

Step 2: Data Collection and Preprocessing

This was the most time-consuming part of this project. The information regarding each program is organized in a completely different way. Therefore, I decided to collect all data manually without any web scraping techniques, because it may take the same effort and time anyway.

You can find the dataset in the Hugging Face repository provided earlier. The data is stored as a parquet file because I planned to use Polars instead of Pandas for this project. I never used it before but heard that it is faster, so I want to try it out! However, I originally stored the data in an Excel file, because I found it to be the most convenient for working with tables. The image below shows how the data originally looked like before performing any embeddings.

A screenshot of the Microsoft Excel spreadsheet used to collect the data before embedding.

Step 3: Text Embedding and Model Selection

To embed the texts in the dataset, it may be convenient to use some language models from Hugging Face, which are pre-trained models and already know the similarity between words. We could use only the embedding layers instead of the whole model. Generally, we could access these layers for text embeddings from most models using the Sentence Transformers library. We could initiate simply by:

from sentence_transformers import SentenceTransformer

# model_name is the name of the model to use
model = SentenceTransformer(model_name)

First, each text in our corpus will be tokenized (i.e. split as separated words). The processing will automatically be made to match the configuration of the model we selected. Each token (word) will be embedded as a vector through the embedding matrix. At the end, we will have the matrix that could be passed through the layers of transformers. The output is an n-dimensional vector.

When choosing the model, we should consider the balance between choosing high-performing and lightweight models. Generally, the sentence transformers are lighter than most models. However, as the free Hugging Face spaces are limited to use only CPUs, I decided to pick some of the lightweight models known to perform well for the task. This allows the users to get faster results when using the app.

Honestly, I have tried several models but the one I found to be the best match for the task is "all-MiniLM-L6-v2". I used this model to embed all relevant texts in the dataset. When the users input their profiles, we could embed only the profiles and compute the similarity to each program accordingly. This helps save computations by avoiding embedding the texts in the dataset every time we run the task.

Step 4: Computing Cosine Similarity

The cosine similarity is computed between each vector (representing different information) of the program and the users accordingly. The final similarity score is the average between the three scores. However, there is some tricky part in this approach. After some run tests, I found that the model doesn’t provide much accurate results. Therefore, I decided to weigh each similarity score differently. I gave more weight to the program description and career prospects, and less to the university description. This is essentially the weighted arithmetic mean, which allows a more accurate result in the end.

A more accurate formula of the weighted arithmetic mean for this project.

After I have all the similarities, I sort them in descending order and return the top ten results. As mentioned, all the code and data are made available on my GitHub.

Example of the output using the function defined for calculating the similarity and return the best-matched output. This is the prototype for the app created in the subsequent steps.

Step 5: Creating and Deploying the App

I built this app using the integration of Gradio and Hugging Face ecosystems. If you’re new, Gradio is a library that allows you to create and deploy the prototype of your app directly in Python using its template and prebuilt elements. It is quite convenient as it facilitates faster production to the users, and does not necessarily require me to know much about web development. Although I used some HTML knowledge to build the app, it is just for centering or scaling elements. I provided some code snippets below to see what Gradio looks like in action.

selected_programs = []

with gr.Blocks(theme=gr.themes.Soft(primary_hue="rose", neutral_hue="slate", font=gr.themes.GoogleFont("Inter"))) as app:
    with gr.Column():
        gr.HTML(value='&lt;p align="center"&gt; Made with   by &lt;a href="https://www.instagram.com/oadultradeepfield/"&gt; @oadultradeepfield &lt;/a&gt; &lt;/p&gt;', visible=True)
        gr.HTML(value='&lt;h1 align="center"&gt;Discover Your Perfect Undergraduate Program in Singapore with AI-Powered Matching!&lt;/h1&gt;', visible=True)
        gr.HTML(value='&lt;h3 align="center"&gt;What Are the Key Word(s) That Best Describe You?&lt;/h2&gt;', visible=True)
        gr.HTML(value='&lt;p align="center"&gt; P.S. Only single major programs will be shown.   &lt;br&gt; You can also try out instantly with the example keywords.   &lt;/p&gt;', visible=True)

    with gr.Row():
        with gr.Column():
            input_interest = gr.Textbox(placeholder='e.g. Business, Coding, Data Analysis, etc.',
                                        label='I am interested in...', scale=3)
            input_career = gr.Textbox(placeholder='e.g. Entrepreneur, Data Scientist, Software Engineer, etc.',
                                    label='Upon graduation, I want to work as...', scale=3)

        button = gr.Button("  Search Now!")

        button.click(fn=return_search_results, inputs=[input_interest, 
                                                       input_career],
                     outputs=selected_programs)  

    for _ in range(10):
        with gr.Row():
            selected_programs.append(gr.Image(visible=False))
            selected_programs.append(gr.HTML(visible=False))

app.launch()

Finally, the app is well-built, and can be easily accessed on any device using the link to my Hugging Face space!

Data Availability Statement

The dataset utilized in this project was sourced directly from the official websites of the undergraduate programs of various universities. As this information is publicly accessible, it can be located through standard search engines.

Acknowledgment

I added this special section to extend grateful thank you messages to the community and people who engaged with my LinkedIn post. I even received a comment from Gradio, which is unexpected but appreciated. I also received feedback and kind suggestions from the comments and other platforms. So thanks to everyone for contributing to my learning! I will share more stories in future blogs and posts, see you there soon!

Unless otherwise noted, all images are by the author.