s

The
aiqc
boomerang plot visualizes various performance metrics across each split (train, validation, test) for every model in an experiment. When the points of a model trace are tightly clustered/ precise, it means that the model has discovered patterns that generalize across each population.
🧮 How to evaluate many tuned models
Imagine that you’ve just trained a large batch of models that all seem to be performing relatively well. How do you know which one is the best based on the metrics you care about? In order to answer this question, you’d start by:
- Run the raw predictions back through functions like
[sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
- Do this for each split/ fold(train, validation, test).
- Do this for each model.
- Do this for each metric.
At this point, you could calculate aggregate metrics for each model. However, with only 2 or 3 splits to learn from, aggregate metrics aren’t very useful.
For example, if you have a model that overfits on training & evaluation data at 100% & 95% accuracy respectively, but flunks the holdout data with 82% – then most aggregate metrics would be misleading. You’d have to introduce a range or standard deviation measure to make sense of it, but at that point couldn’t you just look at the 3 splits yourself? 🤷 So you’re right back where you started with a table of raw data.
Why not just visualize it? If you’ve trained a [Queue](https://aiqc.readthedocs.io/en/latest/notebooks/api_low_level.html#8.-Queue-of-training-Jobs.)
of models using AIQC, it’s as easy as:
queue.plot_performance(
score_type:str=None,
min_score:float=None,
max_loss:float=None
)
score_type
: choose from the following metrics for categorization (accuracy, f1, roc_auc, precision, recall) or quantification (R², MSE, Explained Variance).- The
min
andmax
arguments act as thresholds that remove any model with a split that does not meet, and resizing the graph to fit those that do.

🪃 As you may have guessed, it’s eponymously named the boomerang plot because of the curves it makes for each model.
The reason why AIQC is able to do this is that while the Queue
is training, it automatically generates metrics and plots for each split/ fold of each model according to the Queue.analysis_type
. So when the time comes for evaluation, this information can simply be called up by the practitioner.
🔬 Interpretation of model performance
The beauty of visualization is that it enables the practitioner to conduct their own unsupervised interpretation. We perform our own clustering analysis just by looking at the plot.
A quick glance and a hover over the chart above tells us that:
- The architectures on the right are inferior.
- The highly performant models on the left are overfit on the training data.
- The most generalizable is the orange one
Predictor.id==15
. - However, we’re not done training yet. We need to tweak the orange model to see if we can get its performance up. So next I’d take a look at it’s parameters and learning curve to see what could be improved.
predictor = aiqc.orm.Predictor.get_by_id(15)
predictor.get_hyperparameters()
predictor.plot_learning_curve()
⏩ Can we get insight sooner?
Having gone through this cycle many times, I decided to package the entire experience into a realtime dashboard to solve the following problems:
- Add models to the plot as they are trained.
- Separate process for the training queue and the plot.
- Change scores and metrics without recalling the plot
- Fetch hyperparameters and other supplemental information without manually querying the ORM.
from aiqc.lab import Tracker
app = Tracker
app.start()
"📊 AIQC Tracker http://127.0.0.1:9991 📊 "
Voila – we now have the realtime Dash app in the gif at the start of the blog post.
_AIQC is an open source library written by the author of this post_Please consider giving AIQC a star on GitHub⭐ https://github.com/aiqc/aiqc