Scala Spark Machine Learning

Data Science and Machine Learning with Scala and Spark (Episode 01/03)

Machine Learning on Scale with Spark Distributed Computation

MA Raza, Ph.D.
Towards Data Science
4 min readJun 20, 2020

--

Learning Scala: Scala logo downloaded from Goggle images

Scala is the core language to be used in writing the most popular distributed big data processing framework apache Spark. Big Data processing is becoming inevitable from small to large enterprises. The amount of data being produced is increasing by every second.

Extracting the valuable insights from data requires state of the art processing tools and frameworks. Researchers are working hard to develop application-specific tools, programming languages, and frameworks. Scala is one of those inventions to leverage big data processing.

Scala programming language is build to implement scale able solutions to crunch big data in order to produce actionable insights.

In this series, I will develop a series of tutorials in Google colab to learn Scala and apply for data science and machine learning applications. Python is considered to be the most common programming language for Data Science and machine learning. However, when it comes to processing large data, Data scientists should know alternatives such as Scala.

Who should follow these tutorials?

If you are a data scientist with a Pythonic approach and want to add another big data processing tool, this series is for you.

I have structured the series into three parts. In this story, I will cover the basics of the Scala programming language. We will cover spark with Scala in the next article and finish the series with Machine Learning with Spark and Scala. For each of the stories, there will be associated with google colab with information and practice codes.

In the first story, you will learn how to install or set up Scala in Google colab, most important data types, Scala expressions, Scala Functions, and Scala Objects.

Scala Data Types

While working with python, it is not required to specify the variables' types however Scala require you to specify the data type of variables. Therefore it makes more sense of understanding and practicing the data types implemented in Scala. Below are the most basic data types implemented in Scala. You can do practice using google Colab.

  1. Scala Data Types: Int, Char, Float, Double, Long, Short, Boolean
  2. Scala Collections: Arrays, Vectors
  3. Scala Maps: Key-value pairs

Scala Expressions

As with other programming languages, Scala expressions follow a similar approach with numerical, Boolean, and logical expressions. Some common and basic expressions are including in the Google colab notebook.

// Syntax
val <identifier>[: <type>] = <expression>
var <identifier>[: <type>] = <expression>
// Comment out one line
println("Hello Scala")
// sample expression
var x = 2 * Math.sqrt(10) / 5

Scala Functions

Scala is an objected oriented programming language and you can write functions pretty much following the same style. below is the syntax of defining a function in Scala. You can practice the real examples through google colab notebook accompanying this blog.

def <identifier>(<identifier>: <type>[, ... ]): <type> = <expression>

One example of a function to multiply two integers with integer output is given below.

def multiplier(x: Int, y: Int): Int = { x * y }

Scala Classes and Objects

Classes are considered to be very helpful in object-oriented programming and it is true for Scala too. Working with big data and complex code structures, classes make things simpler for programmers with Scala. Like other languages, you can write classes in Scala. Below are some examples. A minimal class example is below

class User

val user1 = new User

Tour of Scala Classes is a quick guide to write classes in Scala. More examples are given in google colab notebook.

Import Packages

Like other languages, you can import native as well as third party packages in Scala. the syntax is pretty much the same across java and python.

import Array.

Parallel Processing in Scala

The key objective of Scala is to work with large data sets. To achieve that Scala has a very clean solution of processing in parallel with minimal code change. If you are coming from Python, multiprocessing is hard to set up and often results in unintended results and failures. Scala has much better support of multiprocessing. Among other operations, filter and map are the most commonly used when processing in parallel. Follow some examples below.

Google Colab Notebook

I have prepared a functional Google colab notebook. Feel free to use the notebook for practice.

the-reality-project Conclusion

In this episode, we have learned the basics of Scala and covered below key concepts with exercises.

  • Running Scala in Google colab
  • Basic Scala Data Types
  • Scala Functions and Classes
  • Parallel Processing in Scala

In the next episode, we will learn about spark with Scala using Google colab.

References Readings/Links

Machine Learning with Scala in Google Colaboratory

Scala Docs -https://docs.scala-lang.org/tour/tour-of-scala.html

https://www.lynda.com/Scala-tutorials/Scala-Essential-Training-Data-Science/559182-2.html

--

--