An introduction to data cleaning and manipulation with Knime

Python, SQL, R and Julia are some popular programming languages used in Data Analysis by data scientists and analysts globally. There are suites of software such as Alteryx and Knime that have an IDE (integrated development environment) where data enthusiasts and technical experts with little programming background can work on their data to extract meaningful information from it. The key feature of these IDE, called visual programming, allows the user to visualize the flow of data from raw information to processed information, models and to the final meaningful output. In this article I will give a brief introduction to visual programming with Knime using Fitbit health data available from Kaggle.
The Extract → Load → Transform approach will be used for this exercise. The data downloaded has already been ‘extracted’. The load and transform will be done using Knime.
Reading and Manipulating Data

The ‘CSV reader’ node is used to load data to the Knime platform from its source. There are similar nodes such as Excel reader, JSON reader, Table Reader and Web reader. For this article I will focus on the CSV reader. The description window in Knime IDE as well as documentation in Knime Hub have more information about each data input node used in Knime.
To input data, right click on CSV reader and choose ‘Configure’ from the options. In the Configure window that opens, you can choose the file that you want to import as CSV from your source location.

If you are familiar with the data, you can click on the ‘Transformation’ tab in the ‘Configure’ window and choose which columns you want to input. This way you don’t have to deal with unnecessary information during larger Data Processing in case you are building models etc. In the example above, I have chosen the ‘Total Steps’, ‘Total Distance’ and ‘Calories’ columns alone and left the rest.

Another way to choose the columns is using the ‘Column Filter’ node. The configuration window for this node is shown here. Both ways of choosing the columns is good; however the ‘Transformation’ tab from the CSV reader window gives you insight into your data type as well as allows you to move your columns around so that they are arranged in a format that you are comfortable with. The final table can be viewed by right clicking on the ‘Column Filter’ and choosing ‘Filtered table’.

Another useful node for data entry is the ‘Column Rename’ node. As the name suggests it allows you to change the column names of the input data to easy variable names (much like in programming) for your model. This does not change the column names in the source file at all. Here, I am changing the column names for ‘TotalSteps’ and ‘TotalDistance’ to ‘Steps’ and ‘Distance’ respectively.

In my example I have used two CSV reader nodes to read data from two sources. Now that the input has been modified to have the same columns, these two tables can be joined to make a single output table for our data processing. The Concatenate node is used for this purpose. You can choose whether duplicate values in the tables can be skipped or suffixed with a value or if the execution should fail. Secondly you can choose whether you want an intersection of columns or union of columns. By choosing intersection, you will only see columns where the data is matching whereas in Union of columns, all the data is merged together to give you an concatenated table.

Now that the data has been filtered by columns and merged, the last manipulation we will do is to clean the data to remove any rows where the steps is 0. The ‘Row Filter’ node can be used for this purpose. The filter node has various options for including or excluding rows based on row type, row number, row value and row ID. I have chosen ‘Steps’ as the column and excluded rows where the value is 0. As you can see you can use this node for filtering rows in just 1 column. For filtering multiple columns, instead of using multiple Row filter nodes, you can use the ‘Rule based Row Filter’ node. This will allow you to write simple instructions for the filtering of multiple columns.

The ‘Linear Correlation’ node is a visualization node that gives you a measure of correlation between the different columns in your data. This lets you see how change in one column will affect the data or output of another column. The correlation matrix shown here is the output of the Linear Correlation node. The dark blue cells are tightly correlated while the lighter blue cells have lesser correlation. If any cell is red, then they have an inverse correlation between the variables (X and Y columns) in that cell. A cross (X) implies that there is no correlation.

The ‘Scatter Plot’ node allows you to plot on column with another. It does not allow a secondary axis. The output of the Scatter plot node is shown above. You can customize the plot with labels, title and sub-titles and Knime also allows you to choose the view with other columns in the X and Y axis. This allows for faster viewing of the relation between different columns rather than having to make a separate node for each chart.
Conclusion
This article is intended to give a brief introduction to data entry and transformation in Knime. Knime is a powerful Visual programming tool that can a huge repository of nodes and features available through its IDE and the Knime hub. With little to no programming required, Knime allows users to work with their data, create models and deploy machine learning models on their data for fast processing rather than having to code everything using Python, R or SQL. While Python, R and SQL are key skills required for Data Science, Knime provides an excellent alternative for those that are not comfortable with programming and yet need to make decisions based on model outputs.
References:
- Knime Hub – https://hub.knime.com/
- Knime Courses – https://knime.learnupon.com/dashboard
- Kaggle – https://www.kaggle.com/
- Cover Image from Unsplash