A deep dive into Power BI incremental refresh feature

What is it, how does it work, and how does it really work?

Jonathan Law
Towards Data Science

--

Image by the author. Names in this image are fictional and computer generated.

If you are reading this, you most probably know what Power BI (PBI) is and what it is for. In this article, I would like to dive a little deeper into their incremental refresh, which definitely introduced much confusion across the community and people I know.

In PBI, there is a feature called incremental refresh that allows your dataset to be incrementally refreshed. Microsoft covered the benefits of an incremental refresh too, so do take a read before continuing here. Unlike the conventional ETL/ELT, PBI has a rather unique way of incrementally refreshing your data, which will be discussed.

*I would just want to highlight that this incremental refresh only applies to published datasets to PBI web service.

Portal

ETL/ELT Basics

Stitchdata wrote a detailed article regarding the basis of the ETL process which can be found here.

To summarize, an ETL tool will usually check for new data using Change Data Capture (CDC). However, not all database has CDC enabled for various reasons, and in that case, the latest row update timestamp will be used. New and changed data will then be appended into the data warehouse or overwriting the old data depending on the business case.

Microsoft Incremental Refresh Theoretical Basics

Looking back at the documentation Microsoft provided, incremental refresh is broken down into a few components which are:

  1. archived data
  2. incremental data range
  3. detect data changes

And it is also stated in the document that Microsoft refreshes data based on partitions.

To help you out, I have created a simple table in Excel to demonstrate the theoretical working method. Let us assume that today is 2022–08–02.

Image dataset provided by the author

Above is a small sample set of the initial dataset when we first load the data into the PBI service. Automatically a few partitions will be created, which would look something like this:

Initial partition dataset: Image provided by the author

1. Archived data

Image provided by the author: Power BI Incremental popup archive option screenshot

Without updating any data and setting the archive data range to 1 year, the PBI service will remove all data partitions 1 year before today (2022–08–02). In the example above, Partition 2020 will be removed.

2. Incremental data range

Image provided by the author: Power BI Incremental popup incremental option screenshot

If we set incremental data to the range of the 7 days before the refresh date, essentially PBI service will delete 7 days before today, and reload all 7 days of data. In our sample dataset above, everything in “Partition 2022–08-xx” will be deleted and reloaded from the source.

3. Detect data changes

Image provided by the author: Power BI Incremental popup detect data change option screenshot

Now when we select detect data changes, the PBI service will only refresh partitions of the incremental range (last 7 days in our example) where the max update date was changed.

Image provided by the author

In our dataset example, if F data was changed, no refresh should happen as Partition 2022-Q1 falls outside our last 7 days incremental range.

Updated incremental refresh range: image by the author

But if we change our incremental range from 7 days to 12 months before the refresh date, Partition 2022-Q1 data (E and F) will be deleted and reloaded.

The best case and worst case scenario

The best-case scenario would look like a transaction table, where new rows are added and old rows are not updated.

The normal scenario would be one where new rows are usually added, and updated rows are within a partition. This could be a simple concert event table, where ticket purchase and cancellation happen in a short time frame. 1 month after the event the data would not be updated anymore.

The worst-case scenario would be one where every partition has at least 1 update. This would cause nearly a full table to reload as PBI incremental refresh will reload the whole partition, potentially reloading the whole table. A use case might be an inventory system, where items are checked in months or years ago and checked out daily.

Real-world testing — Power BI Dataset

Now that we know how the PBI service incrementally refreshes the dataset, does it actually performs the same when deployed?

Image by the author

To test this out, I have set up a MySQL database with 500 rows of dummy data. Create dates and update dates ranging from 2021–08–15 to 2022–08–08.

Image by author: Incremental refresh configuration

The PBI incremental refresh configuration is as above screenshot.

I enabled query logging on MySQL to log down the queries that were run.

SET global general_log = 1;
SET global log_output = 'table';

Then I published the dataset/Power BI report with that dataset to a Premium Workspace and refreshed the dataset via Power BI Gateway which is installed on my PC.

Test case PBI01: Refresh without updating any data

For the first test case, I just refreshed the data again without updating any data in the database.

Image provided by the author
select max(`rows`.`update_timestamp`) as `update_timestamp`
from
(
select `_`.`update_timestamp`
from `test`.`inc_t` `_`
where `_`.`create_timestamp` >= '2022-05-01 00:00:00' and `_`.`create_timestamp` < '2022-06-01 00:00:00'
) `rows`

From the logs, the PBI service does check the max update for each partition, in our case:

  • 2022–03–01 > 2022–04–01
  • 2022–04–01 > 2022–05–01
  • 2022–05–01 > 2022–06–01
  • 2022–06–01 > 2022–07–01
  • 2022–07–01 > 2022–08–01
  • 2022–08–01 > 2022–09–01

In our case, as no data has been updated, no refresh happened and no additional queries were run. The first test case passed.

Test case PBI02: Updating row within incremental range

In the second test case, I will update 2 rows to 2022–08–28 from create date 2022–06–01 ( 2 rows to be updated) on the MySQL source database.

UPDATE INC_t
SET UPDATE_TIMESTAMP = '2022-08-08 09:29:23'
WHERE DATE(CREATE_TIMESTAMP) = '2022-06-01';
-- 2 row(s) affected Rows matched: 2 Changed: 2 Warnings: 0 0.015 sec

Then of course, manually trigger the scheduled refresh.

select `$Ordered`.`id`,
`$Ordered`.`name`,
`$Ordered`.`alphanumeric`,
`$Ordered`.`phrase`,
`$Ordered`.`country`,
`$Ordered`.`update_timestamp`,
`$Ordered`.`create_timestamp`
from
(
select `_`.`id`,
`_`.`name`,
`_`.`alphanumeric`,
`_`.`phrase`,
`_`.`country`,
`_`.`update_timestamp`,
`_`.`create_timestamp`
from `test`.`inc_t` `_`
where `_`.`create_timestamp` >= '2022-06-01 00:00:00' and `_`.`create_timestamp` < '2022-07-01 00:00:00'
) `$Ordered`
order by `$Ordered`.`id`

The initial check for MAX update for each partition ran, then a new query was run as above. This query will reload all data from 2022–06–01 to 2022–07–01.

This shows that incremental refresh is in line with the documentation, and the entire dataset is not loaded.

Real-world testing — Power BI Datamart (preview) service

Just recently Microsoft has introduced datamarts, which you can read more about here. And within datamart itself, there is an option for an incremental refresh. To test, I connected the datamart to the same MySQL database via the same Power BI Gateway that was set up.

Here was the query that ran upon the initial data loading. As shown, all data is loaded without any WHERE clause.

Image by the author

After the initial data is loaded, it is then time to set up incremental refresh on that table in datamart.

Image by the author
Image by the author: Incremental refresh configuration

The configuration is set up to be similar to the first test case I did in Power BI Desktop. However using datamart, there is no need to create the RangeStart and RangeEnd parameter as they will be automatically created.

And of course, manually triggering the schedule refresh for the datamart.

Test case D01: Refresh without updating any data

From the queries ran, they were no queries to check for MAX update. Instead, all the partitions for the last 6 months were queried and loaded.

select `_`.`id`,
`_`.`name`,
`_`.`alphanumeric`,
`_`.`phrase`,
`_`.`country`,
`_`.`update_timestamp`,
`_`.`create_timestamp`
from
(
select `_`.`id`,
`_`.`name`,
`_`.`alphanumeric`,
`_`.`phrase`,
`_`.`country`,
`_`.`update_timestamp`,
`_`.`create_timestamp`
from `test`.`inc_t` `_`
where `_`.`create_timestamp` >= '2022-05-01 00:00:00' and `_`.`create_timestamp` < '2022-06-01 00:00:00'
) `_`
order by `_`.`update_timestamp` desc
limit 4096

This indicates that the refresh changed data in datamart does not behave similarly to detect data change in PBI incremental refresh.

To test further, I will update my configuration to refresh the last 1 month's data and jump to the second test case.

Image by the author

And as expected, 1 month of data is queried and refreshed.

select `_`.`id`,
`_`.`name`,
`_`.`alphanumeric`,
`_`.`phrase`,
`_`.`country`,
`_`.`update_timestamp`,
`_`.`create_timestamp`
from
(
select `_`.`id`,
`_`.`name`,
`_`.`alphanumeric`,
`_`.`phrase`,
`_`.`country`,
`_`.`update_timestamp`,
`_`.`create_timestamp`
from `test`.`inc_t` `_`
where `_`.`create_timestamp` >= '2022-08-01 00:00:00' and `_`.`create_timestamp` < '2022-09-01 00:00:00'
) `_`
order by `_`.`update_timestamp` desc
limit 4096

Test case D02: Updating row from 2022–06–05

Given the refresh changed data does not behave equally to PBI service detect data changes, I will update rows which create date falls on 2022–06–05.

UPDATE INC_t
SET UPDATE_TIMESTAMP = '2022-08-28 23:00:00'
WHERE DATE(CREATE_TIMESTAMP) = '2022-06-05';
-- 4 row(s) affected Rows matched: 4 Changed: 4 Warnings: 0 0.000 sec

After running the refresh, only 1 query ran, which is to refresh this month's data only.

Image by author

Checking on the update timestamp column, it is evident that the latest update is not refreshed into the datamart. Therefore, the second test case for the datamart incremental refresh is considered a failure.

Issue with datamart

I have repeated this test on the datamart with other variables and data sources such as Oracle DB, however, the results are always the same where the refresh changed data is never respected. A full refresh of all the incremental range partitions is always happening.

Conclusion

Incremental refresh is essential in optimizing the data load from the source database, reducing the overhead on the source database while ensuring the report data is up to date. With the introduction of Microsoft datamart, unifying data across various sources would be made easier in the near future once it is out of preview. Hopefully, this article clears some confusion on how the incremental refresh works now.

The Power BI desktop incremental refresh works as expected but unfortunately, the datamart incremental refresh does not. As datamart is still in preview and I am sure Microsoft will fix it soon, this is just a PSA that datamart may impact your production database for the time being. This article will be updated too once I have tested it again and confirmed it is fixed, or if I have missed something. I will also be testing Microsoft Dataflow, which is supposed to be deprecated soon. However, if it does serve as a better connector to datamart, I will write another article on the setup!

--

--

I am Jonathan Law Hui Hao, a Business Support Specialist in Malaysia. I combine logistics and process improvement with technology.