Why Incremental Loading is Valuable
One of the most important goals of big data analytics is for the data scientist to work efficiently and effectively through the extraction, transformation, and loading process. When working with large administrative datasets this usually means only working with and touching the necessary data required to complete the project. Many steps related to data processing and data analysis can be automated with knowledge of the data and the specific project requirements. Because necessary data fields for analysis may have missing data, incomplete data, or inaccurate data it is often necessary for a person to manually review files that can not be processed through the written programming code. When this happens, it is important that the number of files exported for review be handled in a way that minimizes the risk of further data degradation (preserves accuracy) and reduces the time required of the reviewer. Incremental loading is one option to turn to in this environment.
Trap the data for review
The first step is to trap, or simply identify, the data that is suspect and in need of further human review before being processed completely. For this step, you need to create a new identifier and detect all the invalid data that could potentially be processed incorrectly if it were sent through the program "as is".
How to trap using the IFN function in SAS. It’s possible to identify suspect data using the IF-THEN/ELSE logic. For example, suppose a variable X is a numeric variable that needs to be in the interval [0,100].
/*Call data and trap invalid values to ensure data accuracy*/
data data_adminclaims2;
set data_adminclaims1;
/*Unique ID Created for manual case review*/
id_review = _n_;
/*Create destination pathway for data*/
/*Isolate out invalide values of variable X with variable Y*/
if x<0 then;
y=0;
else if x>100 then
y=1;
else
y=x;
run;
The program here isolates invalid numeric values with six simple lines of code that can be expanded to other variables (included character). This is powerful for administrative data that will have multiple data variables requiring review with potentially numeric or character data.

Prepare Files For Review and Create Output Data
The second step is to sequester out the "trapped" data from the previous step so that it can be packaged together and exported for a review dataset by a human. This is the selective movement of data from one system to another and created a load pattern that sets you up to identify properly the data that was created and modified since the previous processing load and run.
/*Prepare dataset of pending administrative data for manual external review*/
/*Set review rules for each file created*/
data data_y;
set data_adminclaims2;
run;
data data_z;
set data_adminclaims2;
if z=0;
if z=1;
run;
data exported_review;
set data_y data_z;
run;
proc sort data=exported_review nodupkey;
by id_review;
run;
Export Current Dataset for Review
The third step is to create a review file, bump it against a previously reviewed file, and export the final quality control dataset so that it can be reviewed manually. Inevitably these files have been exported because there is some specific problem with their data quality or content. Because the process is recurring without a function such as an incremental load the number of review files can accumulate, become messy, and cause data quality control concerns as well as place time constraints on staff. After exportation, the person reviewing the data must be knowledgeable about how the data should be correctly entered and be able to apply a decision tree to accurately triage where the data goes after review (i/e – keep the data or remove it).
/*Export review dataset for internal manual review by employees*/
proc export data=cloud_directory.exported_review
outfile="&parent_dir.DataDestinationExportManualReview_&todaysDate..xlsx"
dbms=xlsx replace;
run;
This step creates an identifier that is unique and specific to the specific person or event. The variable can be numeric or character and parts can be added to ensure a proper number of digits are included to guarantee a unique review file.
/*Import previous internal reviewed validated by employees for comparison to current temporary dataset in need of review*/
proc import out = already_reviewed
datafile="CloudDirectoryDataalready_reviewed.xlsx"
dbms=xlsx replace;
run;
/*Create an incremental file that matches previously reviewed files with the current data run stored in the work folder to ensure only new and unique files are exported*/
proc sql;
create table exported_review3 as select l.*, r.id_review2 as id_new
from exported_review2 l left join already_reviewed r
on l.id_review2 = r.id_review2;
quit;
data exported_review4;
set exported_review3;
if id_new=' ' then output;
run;
/*These file are unique and get exported for complete manual review in new data cycle*/
proc export data=exported_review4
outfile="CloudDirectoryDataManual_ReviewExportedReview4_&todaysdate..xlsx"
dbms=xlsx replace;
run;
The step above is where files from a current program run are compared to previous program runs. This ensures that the exported file has a few files for manual review as possible. Completing this step protects the data integrity of files already processed and reduces overall quality control time allowing for the protection of current data integrity.

Re-integration of Separated Dataset into Program for Full Data Processing
The goal of the entire process is to re-integrate the data requiring some form of official review back into the program. Once included back into the program a combined dataset needs to be created that merges the reviewed data with the data being currently processed (and not requiring any review) so that the program can finish running whatever tasks are being completed on the project.
/*This dataset is imported for integration into the current data cycle and is included in data processing*/
proc import out=Review_Files
datafile="CloudDirectoryDataReviewedFilesReviewedFiles_&todaysdate..xlsx"
dbms=xlsx replace;
run;
data Reviewed_Files2;
set Reviewed_Files;
if DataApproved = "Yes";
run;
/*This dataset is created and fully reintegrates the review data with the currently processing data to ensure the program continues with only quality controlled data*/
data Quality_Controlled;
set Exported_Review2 ReviewedFiles2;
run;
What makes the process of incremental loading valuable
First, and foremost the process saves a lot of time from a CPU and human vantage point. Touching a fraction of the data in a large data warehouse environment requires less computational power at an individual workstation. And exporting a small number of review files makes any process requiring a human review more efficient from a time, accuracy, and security standpoint. The second benefit of leveraging this technique is that any time a program is run there is potential for the process to fail or encounter problems. An incremental load adds and modifies less data so that in the event of a program error fewer datafiles will need correction or verification (i/e data validation and change verification are expedited). Finally, it is assumed that incremental loads are being used in times when a program will be run on a periodic schedule with large amounts of data. Implementing this process can minimize data bottlenecks as well as allow for a steady run time and performance over the course of periodic program runs.
It may be apparent that incremental loading is also versatile and can be used for three basic functions – to insert data, to insert and update data, or to insert update and delete data. The process above describes the third process where new data is added, old data is updated, and data that is no longer needed is deleted.

What makes incremental loading challenging
Unless you come from a programming background or are well versed in Data Science the incremental loading process tends to be quite complex to learn at first. While a full load simply dumps all the previous data into the program an incremental load requires the added logic of identifying new and specific data that requires separate tasking. With large datasets stored in a data warehouse environment, the logic to sort through the data appropriately and process accordingly can become quite complex quickly. This requires both the knowledge of the process and the time to implement the appropriate code within the program. Finally, to quality control, the process requires a good team of programmers who can verify each other’s work within the data environment. One final consideration is that there may not be a single "good" way of identifying the data that is new or changed or requires action. This requires an additional step of creating an appropriate identifier.
With these considerations in mind, it is most appropriate to implement an incremental load when the size of the data is large, querying the data is slow because of size or technical aspects, and there is a specific means to implement and detect a change, and data is either deleted or retained in an external destination.
Summary of key points
The goal of the data scientist is to work efficiently and effectively with large amounts of data and be familiar with the extraction, transformation, and loading process of this data. To effectively complete the goals of a project, a data scientist must strive to work only with the data necessary and be able to separate this data from other data within the data environment. This usually means automating parts of the data processing and analysis program with specific knowledge of the data specifications and project requirements. Some data fields in administrative datasets being utilized for analysis may have missing data, incomplete data, or inaccurate data. The goal here is to create an efficient review and correction process through the program code. Incremental loading happens to be a complex but valuable technique that will enable the data scientist to handle large data and its complexities with greater ease.