Analytical Hashing Techniques

Spark SQL Functions to Simplify your Life

Scott Haines
Towards Data Science
4 min readMar 11, 2021

--

Photo Credit: https://unsplash.com/@swimstaralex

Anyone working in the field of analytics and machine learning will eventually need to generate strong composite grouping keys, and idempotent identifiers, for the data they are working with. These cryptographically strong identifiers help to reduce the amount of effort required to do complex bucketing, deduplication, and a slew of other important tasks.

We will look at two ways of generating hashes:

  1. Using Base64 Encoding and String Concatenation
  2. Using Murmur Hashing & Base64 Encoding

Spark SQL Functions

The core spark sql functions library is a prebuilt library with over 300 common SQL functions. However, looking at the functions index and simply listing things isn’t as memorable as running the code itself. If you have the spark-shell, then you can follow along and learn some analytical hashing techniques.

Spin up Spark

$SPARK_HOME/bin/spark-shell
Image contents shows the Apache Spark Shell environment running Spark 3.1.1 on Java 11 and using Scala 2.12
Above: Reference to my Shell Environment

With the spark-shell up and running, you can follow the next step just by running :paste in the shell to paste multiline. *(:paste, then paste the code, and then cmd+D to process the code)

Import the Libraries and Implicits

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

Create a DataFrame

val schema = new StructType()
.add(StructField("name", StringType, true))
.add(StructField("emotion", StringType, true))
.add(StructField("uuid", IntegerType, true))
val df = spark
.createDataFrame(
spark.sparkContext.parallelize(
Seq(
Row("happy","smile", 1),Row("angry", "frown", 2))
),
schema
)

At this point you should have a very simple DataFrame that you can now apply the Spark SQL Functions to. The contents of which are shown using df.show().

scala> df.show()+-----+-------+----+
| name|emotion|uuid|
+-----+-------+----+
|happy| smile| 1|
|angry| frown| 2|
+-----+-------+----+

Now we have a simple data frame. Next we can add a base64 encoder column to the DataFrame simply by using the withColumn function and passing in the Spark SQL Functions we want to use.

Hashing Strings

Base64 Encoded String Values

val hashed = df.withColumn(
"hash", base64(
concat_ws("-", $"name", $"emotion")
)
)

The results of this transformation yield us a new column that is the result of base64 encoding the concatenated string values from the columns name and emotion. This is broken down as the following flow.

df.withColumn("concat",
concat_ws("-",$"name",$"emotion"))
.select("concat")
.show
+-----------+
| concat|
+-----------+
|happy-smile|
|angry-frown|
+-----------+

The end result of the full columnar expression is as follows.

scala> hashed.show()
+-----+-------+----+----------------+
| name|emotion|uuid| hash|
+-----+-------+----+----------------+
|happy| smile| 1|aGFwcHktc21pbGU=|
|angry| frown| 2|YW5ncnktZnJvd24=|
+-----+-------+----+----------------+

Nice. Right.

Next. We can look at a stronger technique for hashing. This uses the Murmur3 Hashing algorithm, and explicit binary transformations before feeding into the base64 encoder.

Murmur Hashing and Binary Encoding

There are many ways to generate a hash, and the application of hashing can be used from bucketing, to graph traversal. When you want to create strong hash codes you can rely on different hashing techniques from Cyclic Redundancy Checks (CRC), to the efficient Murmur Hash (v3). We will use what we can get for free in Spark which is the Murmur3.

hashed.withColumn("binhash",
base64(bin(hash($"name",$"emotion")))
)
.select("uuid", "hash", "binhash")
.show(false)

Which will return the following rows (comparing the two hashing methods) based on the same input data.

+----+----------------+--------------------------------------------+
|uuid|hash |binhash |
+----+----------------+--------------------------------------------+
|1 |aGFwcHktc21pbGU=|MTAxMTEwMDAxMTAwMDAwMTAwMDAwMDEwMTExMDAxMA==|
|2 |YW5ncnktZnJvd24=|MTEwMTAwMDEwMTExMTExMDEwMDAwMDExMDAxMTAxMA==|
+----+----------------+--------------------------------------------+

Looking at the Spark Code Generation

If you are curious to see how Spark works behind the scenes there is a great new feature of the explain function that will enable you to view the code that Spark generates (and optimizes) for your transformations. To view this, all you need to do is the following.

hashed.withColumn("binhash",
base64(bin(hash($"name",$"emotion")))
)
.select("binhash")
.explain("codegen")

This will output the java code and explain more about your computation.

Output a lot of Java code that has been optimized for Spark Catalyst
Above : Looking at the Spark Code Generation

This code is part of Spark’s Catalyst Optimizer and luckily there is a high probability you will never have to work at this lower level and can likely just go on with your life. But if you are writing custom data source readers and writers then this will likely be something you will want to deep dive into. If nothing else you can learn more about the underlying mechanics, in the example use case from above, the codegen details the use of the murmur hash library being used. This is a nice tool for debugging and for those who just want to learn in a 360 degree model.

Summary

You now have two more techniques that you can use in order to create strong composite keys or to use as a spring board for creating idempotent keys. I just thought it would be fun to share these techniques since they come in handy and reuse the core libraries that ship along side Spark. Happy Trails.

--

--

Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.