GETTING STARTED | DATA SCIENCE INFRASTRUCTURE | KNIME ANALYTICS PLATFORM

Make-or-Buy
Do you want to get started with data science but lack the appropriate infrastructure or are you already a professional but still have knowledge gaps in Deep Learning?
Then you have two options:
- Rent a virtual machine from a cloud provider like Amazon, Microsoft Azure, Google Cloud or similar.

2. Build your own physical machine and install the right software

I tried both options but in the end the decision to build my own rig was the better one, and these are the reasons why:
-
Costs savings If you need a system with a strong graphics card (GPU), a fast processor (CPU) and a lot of RAM, building a machine can actually save you money in the long term! Prices vary widely from provider to provider depending on cloud services, but in the end you can build your own machine for that price and keep it forever. And if you need Windows as your operating system (OS), you will definitely save money. And believe me: you will need Windows at the end!
- More power and resourcesA study from Bizon-Tech shows that a pre-build using 1 GPU is up to 10 times cheaper and those with 4 GPUs are up to 21 times cheaper within 1 year compared to web-based services. And when it comes to storage capacity, prices for web services go through the roof above a certain size.
- Your machine can also be used for other tasksFinally, you can use your computer for other tasks, such as film editing or PC gaming, and keep it forever.
Part 1: Choosing the right system and software
To build our system, we need to consider several points in advance. One of the key points is the choice of the right OS. We have the option to choose between Windows 10 Pro, Linux and Mac OS X. But which system meets all our requirements?
Let’s list our necessary requirements first:
-
Deep learning capabilitiesWe want to compute deep learning models. The best choice at the moment is still to use nvidia GPUs for this task. Both TensorFlow and Keras, the two state of the art deep learning frameworks, require an nvidia graphics card. Apple’s Mac OS X has stopped supporting nvidia and even with a Hackintosh (a custom Mac built with PC components) it is no longer possible to get an nvidia GPU to work. On Windows and Linux systems, the nvidia GPU works without problems and you can also find the necessary drivers. -> Result: Windows and Linux tied
-
Jupyter-NotebookWe want to install Jupyter-Notebook with Python. Jupyter is data scientists’ computational notebook of choice and we definitely don’t want to have to miss that. Anaconda offers a correspondingly uncomplicated installation for all operating systems and so we also have Python already installed. -> Result: Windows, Linux and Mac OS X tied.
-
The right data blending and ETL toolThe best data blending and ETL tool is in my opinion KNIME. I have already written an article about this [here](https://www.knime.com/software-overview), where you can read the reasons. There are installation packages of Knime for all three operating systems here. -> Result: Windows, Linux and Mac OS X tied.
-
A good visualization toolYou won’t believe how difficult it is to find a visualization tool that is available for free and offers all the features you need. Tableau Public is here the right choice. But you need to know some workarounds to be able to use it in a productive mode. But anyway it’s still the best choice at the moment. The installation packages are only available for Windows and Mac OS X. -> Result: Windows and Mac OS X tied.
-
Database for storageTo get things easy, we will start with SQLite as database-solution. It’s not really a sophisticated database like Oracle, MySQL or Teradata, but you can use most SQL commands with it and in combination with KNIME it works very well. In my opinion SQLite is completely underrated. There are installation packages for all three operating systems. (If KNIME is already installed, you do not need to install any additional SQLite packages) -> Result: Windows, Linux and Mac OS X tied.
-
Remote Desktop from other Laptops/TabletsWe want to be able to log in remotely to our machine and work on it from a laptop or tablet, for example. I have tried various tools from VNC to Teamviewer. But so far I got the best experience with the Windows Remote Desktop app. It also works best via a tablet, always providing the appropriate screen resolution and an experience as if you were sitting in front of your machine. To enable Remote Desktop on Windows, you have to upgrade from Home to the Pro Version -> Result: Windows
So, in the end, which operating system best fulfills all our requirements? Maybe you will be surprised but the winner is: Windows 10 Pro!
Part 2: Building the machine
Now we have to build our machine and we have to choose the right components. A PC build consists of the following components:
- CPU
- Motherboard
- PC-Case
- RAM
- Cooling
- Hard disk
- GPU (Graphics Card)*
- PSU (Power supply)
CPU If you are working with gigabyte-sized data and if you need to do a lot of queries, I would recommend investing in a good CPU. CPUs are mainly used for data loading in deep learning. More threads on a CPU means you can load more data in parallel to feed into your models for training. There is a big fight on the market between Intel and AMD for the best performance and price. But for now, I would go for AMD Ryzenz‘s CPUs. They keep releasing relatively affordable multithreaded processors every year. For my PC, I even bought an used AMD Ryzen 7 2700X (3.70GHz) from an old gamer. It’s got a few years on it, but if you’re lucky, you can get it cheap through an online auction like eBay etc. For the best CPU multi-core (not single-core) performance, check the Geekbench Benchmark site.

MotherboardMake sure the motherboard is compatible with your CPU and RAM. It is always a good choice to buy the motherboard together with the CPU in one package. When choosing a motherboard make sure it has enough PCIe slots for the numbers of GPUs you want. The rule here is one GPU will take up space for 2 PCIe slots. An other point is the form factor. I would definitely choose a classic ATX motherboard here, since our goal is not to build a small-factor PC, but a high-performance data science solution.
PC-CaseFor the PC-Case I have chosen the Corsair Carbide Air 540. It offers enough space for all components, it’s easy to assemble and has a good airflow. And always keep an eye on the temperature. If you need to calculate a complicated deep learning model, the temperature of the machine can rise sharply over time.
RAMThere is a whole science behind getting the right RAM specifications. But the most important point is still the amount of RAM (more = better). If you are working with large data sets, such as images or log data, your data can fit entirely in memory, which can significantly speed up processing time. For my rig, I went for 32 GB of RAM of corsair.
CoolerA strong cooler is very important to keep the temperature of a system low. A water cooling set up has a high performance and reduces the noise. But not always. I have tried different setups in the past and have to say that even a good air cooler can be quiet. And most certainly it usually takes less space and is also easier to install. Here comes the advantage of an AMD Ryzen CPU. It heats up less fast than an Intel CPU.
StorageThe same applies to the hard disk as to the RAM. More = better. But not only! If you want to optimize your data loading speed, you’ll need faster storage from a solid-state drive (SSD). Solid-state drives are more expensive than a standard hard drive, but that should not be a purchase criterion. I would recommend to install a SSD for the Operating System and the installed software (size 500 GB) and one for the data (size 1 to 2 TB). I like the SSD of Samsung. But any other brands will also do it.
GPU (Graphics card)Actually, it would be possible to compute (some) deep learning models even with a strong CPU only. But you would need time… a lot of time! GPUs are super-fast at computing deep learning models because, unlike CPUs with a comparatively small number of computational cores, GPUs have hundreds or thousands of simple cores that are super-fast at matrix multiplication. As I said before, we should strongly go for a nvidia graphics card since all current state of the art frameworks (be it Keras, TensorFlow, PyTorch or any other library) fully support nvidia’s CUDA SDK, a software library for interfacing to GPUs. Another important point are Tensor Cores. Tensor Cores accelerate matrix operations, which are foundational to deep learning, and perform mixed-precision matrix multiply and accumulate calculations in a single operation.
Tensor Cores can be found in the nvidia RTX GPU models. I opted for a relative cheap Geforce RTX 2070 8G from MSI. With an RTX 2070 and 8 GB RAM, you can train most SOTA (state-of-the-art) deep learning models and still not have to pay too much. See also the list with all GPUs on the site of Lamba-Labs.

PSU (Power supply)Now we have nearly all our components. The question now is: How much power will we need? To calculate the needed power there are different PSU-calculators. For my system, I will need a PSU of around 360 W. To have enough power in the future when I add a second graphics card, I opted for a Seasonic Focus GX-750 with more than enough 750W.

This system cost me a total of about $1,500:
Components: CPU+Motherboard+32GB RAM: $299 SSD1 with 500GB: $98 SSD2 with 1TB: $135 PC-Case: $138 GPU: $707
A year ago, I tried one of the cheapest cloud services available but still with very good value, and had to pay $700 for a year for a physical machine but without a dedicated graphics card.
For your custom system, use PC Part Picker when building your rig.
Part 3: Installation of the software
After assembling the whole system, we need to install an operating system. We have already chosen Windows since it fits our needs.
Windows (installation)The first step is to download Windows and create an installation media. We will copy the installation files on an USB-stick.
nvidia and CUDA-Drivers (installation)To get the most of your graphics card you need the appropriate nvidia drivers. In addition to the normal GPU drivers, the Cuda toolkit must also be installed. In a further step, we will install KNIME with the Keras and TensorFlow integration. But more later.
Jupyter-Notebook by Anaconda (installation)To install Jupyter-Notebook and Python the easiest way is through the installation of Anaconda. When the installation is complete, you can start Jupyter-Notebook from the corresponding tile (see red arrow in the picture below)

You can now use Jupyter from the following address in the browser of your local machine: http://localhost:8888/tree

KNIME and Deep Learning Integration (installation)KNIME is at the heart of our data science infrastructure, as it will orchestrate everything from data blending to machine learning and deep learning to preparation for visualization.
In addition to KNIME we also need to install the "Deep Learning Extension" called KNIME Deep Learning – Keras Integration. It offers a codeless GUI-based integration of the Keras library, while using TensorFlow as its backend.
The following steps need to be performed:
- Installation of the Deep Learning KNIME Extensions (Keras and TensorFlow)
- Creation of Conda environment
Installation of the Deep Learning KNIME Extensions You can install the extensions from within KNIME Analytics Platform by clicking on File from the top menu and selecting "Install KNIME Extension…". This opens the dialog shown in Fig. 13.
To install the Keras and TensorFlow nodes, you need to select the following:
- KNIME Deep Learning – Keras Integration
- KNIME Deep Learning – TensorFlow Integration

You should now have the Keras and TensorFlow nodes in your node repository shown in Fig. 14.

The KNIME Keras Integration and the KNIME TensorFlow Integration depend on an existing Python installation, which requires certain Python dependencies to be installed. The KNIME Deep Learning Integration uses Anaconda to manage Python environments. If you have not yet installed Anaconda (see above), install it now.
Creation of Conda environment Next, we need to create an environment with the correct libraries installed. To do so from within KNIME Analytics Platform, select File -> Preferences from the top menu. This will open a new dialog with a list on the left. From the dialog, select KNIME -> Python Deep Learning.

From this page, create some Conda environments with the correct packages installed for Keras or TensorFlow 2. For the moment, it will be sufficient to set up an environment for Keras. To create and set up a new environment, enable "Use special Deep Learning configuration" and set Keras to the library used for deep learning in Python. Next, enable Conda and provide the path to your Conda installation directory. In addition, to create a new environment for Keras, click on the "New environment…" button in the Keras framework.

Since we have a GPU installed, we should create a new GPU environment to benefit from all the power of the graphics card. Now we are ready to work with deep learning models.
In my next article, I’ll show you in detail how to create Deep Learning models with just a few clicks. In the meantime, I suggest that you watch this good introductory video about Codeless Deep Learning with KNIME.
I also highly recommend the book Codeless Deep Learning for an easy introduction to the topic.

Tableau Public (installation)To get Tableau Public, you need to sign up for a free profile here. But it’s definitely worth it, because you get an online space to share your Tableau Dashboards anywhere on the web.

If you are not familiar with Tableau, I recommend you to go through this tutorial here. A good example for what you can do with the combination of KNIME & Jupyter & Tableau is described in my following video and articles:
Data Science With KNIME, Jupyter, and Tableau Using COVID-19 Projections as an Example
SQLite (installation)Now comes the easy part: If you already have KNIME installed, then nothing else needs to be installed. You can already create, read and write your own SQLite databases. I would like to show you this with the following KNIME workflow:
- I put the path of the Iris-data in the CSV Reader node.**2. Then I write the location of the test.sqlite database in the SQLite Connector** node.
- With the DB Writer node, I write the Iris data set in the SQLite database.
- Then I query with SQL in the DB Query Reader node all rows with a sepal.length greater than 5.
- Finally, I output the result with an Interactive Table node.

In a upcoming article, I’ll show you what else you can do with SQLite in combination with KNIME.
Remote Desktop from other Laptops/TabletsTo set up your PC to enable remote connections, you have to upgrade to Windows Pro and than you are allowed to turn on Enable Remote Desktop.

Finally on your Windows, Android, iOS or Mac OS X device: Open the Remote Desktop app (available for free from Microsoft Store, Google Play, and the Mac App Store), and add the name of the PC that you want to connect to. Select the remote PC name that you added, and then wait for the connection to complete.
I am now even able to create and edit KNIME workflows on my smartphone (more or less :-)).
Now you should be ready to get into data science and even start building deep learning models. Go ahead and…
Thanks for reading!Please feel free to share your thoughts or reading tips in the comments.
Follow me on Medium, LinkedIn or Twitter and follow my Facebook Group "Data Science with Yodime"