Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

Customer reviews about products and services provide valuable information about customer satisfaction. They provide insight into what should be improved across the whole product development. Dynamic topic models in business intelligence can identify key product qualities and other satisfaction factors, cluster them into categories, and evaluate how business decisions materialized in customer satisfaction over time. This is highly valuable information not only for product managers.
This article will compare two of the latest topic models to classify customer complaints data. Bertopic by Maarten Grootendorst (2022) and the recent FASTopic by Xiaobao Wu et al. (2024) presented at last year’s NeurIPS, are the current leading models for topic analytics of customer data. For these models, we’ll explore in Python code:
- how to effectively preprocess data
- how to train a Bigram topic model for customer complaint analysis
- how to model topic activity over time.
1. Customer complaints data in companies
Complaints data are generated by interaction with customers and typically recorded in ERP systems. There are many channels where customers can raise a concern about a product or service. Here are just a few examples:
- Email: email communication is stored for the BI team, e.g., in the SQL database.
- After-purchase survey: feedback sent to customers after product purchase. Companies either send the emails themselves or use a price comparison website (e.g., Billiger in Germany) where customers order the product.
- Phone transcriptions: after prior consent from a customer, some companies record the phone communication with customers, which is then available for the BI team.
- Google reviews: customers leave comments and reviews on products and services worldwide. Google enables authorized users to export the data not only for Text Mining purposes.
- Review platforms: independent review platforms **** (such as Trustpilot) offer customers a place to provide feedback to brands and companies. This data is available through various APIs.
- Social media conversations: Instagram, X, and Facebook are full of product or brand-related comments. The simplest way is to use an official API to collect the data. For Instagram and Facebook, go to the developers’ portal to receive an API key. X works the same way.
2. Example data
As example data, we’ll use the Amazon Dog Food Reviews dataset from Hugging Face, released under the Apache-2.0 license. The subset for topic modeling only contains 3693 customer reviews collected over 02/01/2016: 31/12/2020. Here is what the data looks like:


3. Data preprocessing
Processing data systematically in the right order keeps the essential information and does not add a new bias. Let’s go along these steps:
- #1: Numbers: digits are typically the characters to remove in the first step.
- #2: Emoticons: product reviews are typically full of them. For topic modeling in customer reviews, emojis don’t have much significance.
- #3: Stopwords: apart from standard stopwords, it is common to remove an extended stopwords list for one or more languages.
- #4: Punctuation: general language has a myriad of special characters and punctuation, which should be cleaned in this step.
- #5: Additional stopwords: depending on the use case, some additional words are also useful to remove. With the Amazon dog food reviews, these are "dog", "food", "blue", "buffalo", "ha", "month", and "ago".
"Delivery" and "deliveries", "box" and "Boxes", or "Price" and "prices" share the same word root, but without lemmatization, topic models would model them as separate factors. That’s why product reviews should always be lemmatized in the last step of preprocessing.
- #6: Lemmatization: groups words into a single form (the lemma), keeping the word root information and semantics.
Text preprocessing is model-specific:
- FASTopic works with clean data on input; some cleaning (stopwords) can be done during the training. The simplest and most effective way is to use the Washer: The no-code app for text data cleaning offering a no-code way of processing data for text mining projects.
- BERTopic: the documentation recommends that " removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings". It uses transformers based on real text, not clean text without stopwords, lemmas, or tokens. For this reason, cleaning operations should be included in the model training.

4. Topic modeling with top-notch models
Let’s now check how the satisfaction factors are distributed across the topics. The questions we ask here are:
- What were the key problems and qualities customers reported on the product?
- How has product satisfaction changed over time?
The BERTopic and FASTopic papers describe the model architectures in detail. Also, my TDS tutorial on topic modeling explains topic classification with BERTopic on a political speech dataset.
4.1. FASTopic
Import the libraries and the data (complete code and the requirements are here). Then, create a list of clean reviews:
import pandas as pd
from fastopic import FASTopic
from sklearn.feature_extraction.text import CountVectorizer
from topmost.preprocessing import Preprocessing
# create a list of reviews
docs = data['clean_text'].tolist()
In FASTopic, bigram generation is not directly implemented. To solve this, we will make a bigram preprocessing class. The model works with bigrams as with individual tokens, so we join the words in bigrams with underscores.
# custom preprocessing class with bigram generation
class NgramPreprocessing:
def __init__(self, ngram_range=(1, 1),
vocab_size=10000,
stopwords='English'):
self.ngram_range = ngram_range
self.preprocessing = Preprocessing(vocab_size=vocab_size,
stopwords=stopwords)
# use a custom analyzer to join bigrams with "_"
self.vectorizer = CountVectorizer(ngram_range=self.ngram_range,
max_features=vocab_size,
analyzer=self._custom_analyzer)
# custom analyzer function to join bigrams with underscores
def _custom_analyzer(self, doc):
# tokenize the document and create bigrams
tokens = CountVectorizer(ngram_range=self.ngram_range).build_analyzer()(doc)
# replace spaces in bigrams with "_"
return [token.replace(" ", "_") for token in tokens]
def preprocess(self,
docs,
pretrained_WE=False):
parsed_docs = self.preprocessing.preprocess(docs,
pretrained_WE=pretrained_WE)["train_texts"]
train_bow = self.vectorizer.fit_transform(parsed_docs).toarray()
rst = {
"train_bow": train_bow,
"train_texts": parsed_docs,
"vocab": self.vectorizer.get_feature_names_out()
}
return rst
# initialize preprocessing with bigrams
ngram_preprocessing = NgramPreprocessing(ngram_range=(2, 2))
Let’s train the model for eight topics and display the top 20 bigrams for each topic in a data frame. We train on single tokens, then remove the underscores generating the bigrams.
# model training
model = FASTopic(8, ngram_preprocessing, num_top_words=10000)
# fit model to documents
topic_top_words, doc_topic_dist = model.fit_transform(docs)
# retrieve 20 bigrams for each topic
import pandas as pd
max_bigrams = 20
# Retrieve the bigrams for each topic and select only the word columns
topic_0 = pd.DataFrame(model.get_topic(0, max_bigrams), columns=["Topic_0_word", "Topic_0_prob"])[["Topic_0_word"]]
topic_1 = pd.DataFrame(model.get_topic(1, max_bigrams), columns=["Topic_1_word", "Topic_1_prob"])[["Topic_1_word"]]
topic_2 = pd.DataFrame(model.get_topic(2, max_bigrams), columns=["Topic_2_word", "Topic_2_prob"])[["Topic_2_word"]]
topic_3 = pd.DataFrame(model.get_topic(3, max_bigrams), columns=["Topic_3_word", "Topic_3_prob"])[["Topic_3_word"]]
topic_4 = pd.DataFrame(model.get_topic(4, max_bigrams), columns=["Topic_4_word", "Topic_4_prob"])[["Topic_4_word"]]
topic_5 = pd.DataFrame(model.get_topic(5, max_bigrams), columns=["Topic_5_word", "Topic_5_prob"])[["Topic_5_word"]]
topic_6 = pd.DataFrame(model.get_topic(6, max_bigrams), columns=["Topic_6_word", "Topic_6_prob"])[["Topic_6_word"]]
topic_7 = pd.DataFrame(model.get_topic(7, max_bigrams), columns=["Topic_7_word", "Topic_7_prob"])[["Topic_7_word"]]
# concatenate the dataframes
topics_df = pd.concat([topic_0,topic_1, topic_2, topic_3, topic_4,topic_5,topic_6,topic_7], axis=1)
# remove underscores from bigrams
topics_df = topics_df.applymap(lambda x: x.replace('_', ' ') if isinstance(x, str) else x)
We’ve modeled the customer satisfaction factors with a dog food product in eight distinct topics. Here are the manually annotated topic names:

FASTopic returns relatively distinct topics, sorting the comments of the customers:
- 0: Negative health effects, "sensitive stomach", "small bite", "stomach issue", "lose weight", "refuse eat", "taste wild", "digestive issue", "upset stomach", "stop eat", "gain weight"
- 1: Food quality, "love flavor", "quality ingredient", "good ingredient", "healthy ingredient", "ingredient quality", "flavor good", "taste great", "healthy love", "great healthy", "good healthy", "good health", …
- 2: Positive health effects, "healthy fur", "awesome pup", "eye bright"
- 3: Digestion effects, "smell bad", "runny poop", "horrible gas", "diarrhea vet", "terrible diarrhea," "sick week", "sick buy", "day vomit"
- 4: Pricing, "great price", "good price", "love price", "price great", "love cheap", "price deliver", "great deal", "price increase", "free shipping", …
- 5: Other, other factors.
- 6: Fur effects, "coat shiny", "fur baby", "skin issue", "shiny coat", "love coat", "coat soft"
- 7: Delivery, "open box", "bag rip", "big bag", "hole bag", "open bag", "inside box", "bag open", "bag hole", "heavy bag", "rip open", …
It is also useful to check the weight of these categories in the data. The full code is here.
We’ve modeled the customer satisfaction factors with a dog food product. But why is it beneficial for companies? Dynamic topic models offer a straightforward way of monitoring customer satisfaction over time. They indicate product-related problems and help take the right measures. Once the business decisions are taken into action, topic models check if they have an effect over time.
To do so, let’s model topic activity over time at a quarterly frequency.
import plotly.graph_objects as go
# convert date column to datetime
data['time'] = pd.to_datetime(data['time'])
# format date column to quarterly periods
data['date_quarterly'] = data['time'].dt.to_period('Q').astype(str)
periods = data['date_quarterly'].tolist()
# calculate topic activity over time
act = model.topic_activity_over_time(periods)
# visualize topic activity
fig = model.visualize_topic_activity(top_n=8, topic_activity=act, time_slices=periods)
# update legend to display only the topic number
fig.data = sorted(fig.data, key=lambda trace: trace.name)
for trace in fig.data:
trace.name = trace.name[0]
# update the layout
fig.update_layout(
width=1200,
height=600,
title='',
legend_title_text='Topic',
xaxis_tickangle=45 # set x-axis labels to 45-degree angle
)
# show the figure
fig.show()
The delivery problems in topic 7 peaked in Q3 2018. Customers complained about open and rip boxes much more often, but these problems were fixed in early 2019 (see the picture below).

4.2. BERTopic
BERTopic implements bigrams with _vectorizer_model,_ which also works as a data processing pipeline. The code and the requirements are here.
from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
nltk.download('stopwords')
# create a list of speeches
docs = data['text'].tolist()
We train on raw data and clean it with the vectorizer. During the training, the vectorizer cleans data from numbers and stopwords, returning lemmatized tokens for the bigram model.
# create stopwords list
standard_stopwords = list(stopwords.words('english'))
# extended list of English stopwords
stopwords_extended = [ "0o", ..]
# additional tokens to remove
additional_stopwords = ['blue','buffalo','dog','food','ha','month','ago']
# combine standard, extended stopwords, and additional tokens
full_stopwords = standard_stopwords
+ additional_stopwords
+ stopwords_extended
# define tokenizer retrurning lemmatized text without numbers
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
doc = re.sub(r'd+', '', doc) # clean numbers
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] # lemmatize
# vectorizer makes data processing and generates bigrams
vectorizer_model = CountVectorizer(tokenizer=LemmaTokenizer(),
ngram_range=(2, 2),
stop_words=full_stopwords)
# set-up model
model = BERTopic(n_gram_range=(2,2), # returns bigrams
nr_topics=9, # generate 9 topics, leave -1 for outliers
top_n_words=20, # return top 20 bigrams
min_topic_size=20, # topics contains at least 20 tokens
vectorizer_model=vectorizer_model,
umap_model = UMAP(random_state=1)) # setting seed topics reproduce
# fit model to data
topics, probabilities = model.fit_transform(docs)
Next, let’s prepare a dataframe with tokens from the model.
import pandas as pd
# retrieve bigrams for each topic and select only the word columns
topic_0 = pd.DataFrame(model.get_topic(0), columns=["Topic_0_word", "Topic_0_prob"])[["Topic_0_word"]]
topic_1 = pd.DataFrame(model.get_topic(1), columns=["Topic_1_word", "Topic_1_prob"])[["Topic_1_word"]]
topic_2 = pd.DataFrame(model.get_topic(2), columns=["Topic_2_word", "Topic_2_prob"])[["Topic_2_word"]]
topic_3 = pd.DataFrame(model.get_topic(3), columns=["Topic_3_word", "Topic_3_prob"])[["Topic_3_word"]]
topic_4 = pd.DataFrame(model.get_topic(4), columns=["Topic_4_word", "Topic_4_prob"])[["Topic_4_word"]]
topic_5 = pd.DataFrame(model.get_topic(5), columns=["Topic_5_word", "Topic_5_prob"])[["Topic_5_word"]]
topic_6 = pd.DataFrame(model.get_topic(6), columns=["Topic_6_word", "Topic_6_prob"])[["Topic_6_word"]]
topic_7 = pd.DataFrame(model.get_topic(7), columns=["Topic_7_word", "Topic_7_prob"])[["Topic_7_word"]]
# concatenate the dataframes
topics_df = pd.concat([topic_0, topic_1, topic_2, topic_3, topic_4,
topic_5, topic_6,topic_7], axis=1)
The annotated topics show similar categorization to FASTopic. The differences are categorizing Spanish tokens into a separate topic (T7) and filling T1 and T5 with adjectives of positive meaning. Delivery problems in T4 are identical to FASTopic’s classification.

Again, let’s focus on topic activity over time, which gives dynamic topic models additional value for BI. BERTopic uses token frequencies (not topic weights as FASTopic) for topic activity analysis.
# topic activity over time
import plotly.graph_objects as go
# create timestamps
data['time'] = pd.to_datetime(data['time'])
timestamps = data['time'].to_list()
# generate topics over time, 20 bins correspond to Q frequency
topics_over_time = model.topics_over_time(docs, timestamps, nr_bins=20)
# filter out topic -1 containing outliers
topics_over_time_filtered = topics_over_time[topics_over_time['Topic'] != -1]
# visualize the filtered topics over time
fig = model.visualize_topics_over_time(topics_over_time_filtered)
# update legend to display only the topic number
fig.data = sorted(fig.data, key=lambda trace: trace.name)
for trace in fig.data:
trace.name = trace.name[0]
# update the layout
fig.update_layout(
width=1200,
height=600,
title='',
legend_title_text='Topic',
xaxis_tickangle=45 # set x-axis labels to 45-degree angle
)
# show the figure
fig.show()
Most topics are stable over time, except T4, which categorizes delivery problems. As with FASTopic, BERTopic shows that customers’ negative complaints about damaged boxes rose in mid-2018.

Summary
Both models indicated delivery problems in mid-2018, which vanished in early 2019. With a topic model API monitoring customer comments on various channels, these problems can be fixed before they have a harmful effect on the brand.
The right data processing is essential for topic models to make sense in the applied world. Cleaning text in the right order minimizes the bias of each cleaning operation. Numbers and emoticons are typically removed first, followed by stopwords. Punctuation is cleaned afterward so that stopwords don’t break up into two tokens ("we’ve" -> "we" + ‘ve"). Additional tokens are removed in the next step in the clean data before lemmatization, which unifies tokens with the same semantics.
FASTopic deserves much better documentation, which now provides only basic information. Especially because its (1) simplicity of use and (2) stability in training on small datasets makes it a top-notch alternative to BERTopic. It is mainly practical for small companies like e-shops that typically don’t collect large text datasets and seek simple and efficient solutions. Data and full codes for this tutorial here.
If you enjoy my work, you can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!
References
[1] Grootendorst (2022). Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. Computer Science.
[2] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.