A Data Scientist’s Guide to Data Structures & Algorithms, Part 2

Paulina Zheng
Towards Data Science
12 min readSep 6, 2018

--

In my last post, I described Big O notation, why it matters, and common search and sort algorithms and their time complexity (essentially, how fast a given algorithm will run as data size changes). Now, with the basics down, we can begin to discuss data structures, space complexity, and more complex graphing algorithms.

Space

Previously, I used Big O notation to describe time complexity for some common search and sort algorithms. Big O is also used to describe space complexity. After all, time is not the only limited resource at our disposal. Any computer has only so much space (also known as memory) and any given algorithm can also be characterized by the amount of working storage that it requires. One can imagine that as our dataset of interest grows, space becomes more of a concern.

When working with data, selection of the correct data structure allows us to optimize both time and space, such that our pipeline runs smoothly enough to allow us to draw conclusions

Data Structures

Data can be stored in different ways; the manner of storage is truly context-dependent.

Let’s begin with more abstract data structures that are defined by their ordering behavior. These data structures, known as the stack and queue, may not have a formal implementation in many languages, though they can be and often are manually implemented.

Their importance primarily lies in explaining different ways that data can be managed and memory can be accessed. As data size increases, the selection of a data structure becomes increasingly important; the big O times associated with various operations depend on the data structure. Stacks and queues, as concepts, inform how fundamental data structures (like lists and arrays) are organized. As a practical matter, stacks and queues are also often incorporated in many algorithms and integral to how such algorithms work.

Stack

Last-In-First-Out (LIFO) ordering. The last item added to a stack will always be the first to be removed. To understand how a stack works, imagine a hypothetical stack of plates.

You always add a new plate to the top (or end) of the stack. When you want a plate, you will grab the top-most (or most recently added) plate from the stack. The last plate added (‘pushed’) to the stack is the first to be removed (‘popped’).

You might recognize this ‘push’ and ‘pop’ terminology from dealing with lists (and indeed, we will be addressing the list (or array) as a data structure later in this post).

visual representation of a stack

Queue

First-In-First-Out (FIFO). The first item added to a queue will always be the first to be removed. This is best illustrated with an actual queue of people.

The first person to enter the queue will also be the first person to receive the desired service, right? This is exactly how queues work as data structures. Newest items are still pushed to the end of the queue but the first item is the first to be popped.

Now that we understand stacks and queues, we can describe concrete data structure implementations that we would typically encounter.

Concrete Data Structures

How do we reconcile these abstract structures, the stack and queue, with data structures we encounter regularly?

Array
An array is comprised of a linear collection of items (known as elements), stored contiguously in memory. Any given element can be accessed using a numerical index which points to the element’s location in the array. In Python native, this is implemented with the list type.

Hash Tables
In a hash table, unique keys are mapped to values. In Python native, a dictionary is the implementation of a hash table.

Graphs

As an abstract data structure, a graph comprises of nodes and edges to model connections. Think of how one would model a social network:

How are Bob, Sue, and Joe connected?

For any given node, its neighbors constitute all the nodes it’s connected to. It is important to note that the above graph is an example of a directed graph (the nodes are arrows, indicating the one-way direction of the connection). Because Sue is connected to Joe (the connection goes from Sue to Joe), Joe is Sue’s neighbor. Sue is not Bob’s neighbor, however. The relationship only goes one way.

For undirected graphs, the edges are not arrows and relationships go both ways.

In such an undirected graph, Joe and Bob would both be Sue’s neighbors.

Trees
A tree is a special type of a graph, characterized by a ‘hierarchical’ structure. Those familiar with decision trees will already be aware of a tree’s general structure.

every tree has a single root node from which all other nodes stem. each ‘child’ node will only have one parent node.

Every tree has a single root node, which essentially serves as the ‘origin’ from which all other nodes stem. There will only ever be one path between any two nodes (whereas for regular graphs, there can be multiple edges between two nodes).

Because of the characteristics that dictate their respective structures, a graph is typically used to model a network and a tree is used to model a hierarchy. There are many other differences between graphs and trees but this is essentially one of the most important points.

However, the same algorithms operate on both graphs and trees (although implementation differs due to structural differences). Because graphs and trees are used to model relationships through nodes and edges, such algorithms are intended to ‘search’ along these edges to characterize the distance (and degree of connection) between any given nodes.

(More advanced trees include: binary search trees and red-black trees).

Graph/Tree Traversal

There are different ways to traverse (search) these structures such that we can find a node of interest from a starting node.

Why is this useful? Well, in a practical sense, you can do things such as:

  • Find the quickest route to class
  • Find the shortest path to winning chess
  • Find the path that leads to the exit in a ‘maze’
  • Find the closest X-company connection in your LinkedIn network

As evident from these examples, we may model many problems as graphs and then solve such problems using graph search algorithms. We use these algorithms to assess (1) whether a path exists between the nodes of interest and/or (2) the shortest path between these nodes.

These algorithms are most easily explained using a tree (due to its hierarchical nature). However, this all also applies to a graph.

Breadth-First Search (BFS)
BFS starts at the root node (or some random node for a graph) and it checks all the nodes at the first ‘level’ or depth before moving onto the next depth. BFS is typically used to assess if a path exists between any two given nodes and then, the shortest path between those two nodes.

numbers correspond with the search order

How would the implementation work? Let’s consider the behavior that BFS requires: the first item (root) is also the first to be searched (and removed once checked). This is first in, first out (FIFO) behavior so a queue would be used to sequentially check nodes until the target is found.

This may be confusing so let’s describe this in higher-level terms.

We use a graph (or tree) structure to describe relationships. Because a graph is an abstract data structure, it must be manually implemented in code using a concrete data structure such as a hash map (mapping each node to its neighboring nodes). Now we need to implement our algorithm of interest (BFS) on our graph. Because BFS is characterized by FIFO behavior, we use the queue structure to manually implement BFS.

In the abstract, we can say the tree structure is used to describe relationships. The stack and queue structures are used to describe behavior.

How would BFS be useful in an everyday context?

You’re interested in company X. You want to find someone who works at company X in your network. You’d prefer to find someone who is ‘closer’ to you; a first-degree connection is much more preferable than a fourth-degree connection. This is perfect for a BFS. You first search your first-degree connections (added first) before you search your second-degree connections (their friends) and so on.

Depth-First Search (DFS)
DFS starts at the root node (or some random node for a graph) and searches along a single branch all the way down before backtracking.

numbers correspond with the search order

In contrast to BFS which involves the implementation of a queue, DFS involves the implementation of a stack. The root node is always checked first (in both BFS and DFS). If it’s not the node of interest, itssubsequent children are added to the top of the stack:

  • Nodes 8, 7, and 2 are added to the stack in that order. At this point, node 2 is at the top of the stack
  • The top of the stack (node 2) is checked. If it’s not the node of interest, its children, nodes 6 and 3, are added to the top of the stack in that order. At this point, node 3 is at the top of the stack.
  • The top of the stack (node 3) is checked. If it’s not the node of interest, its children, nodes 5 and 4, are added to the top of the stack in that order. At this point, node 4 is at the top of the stack.
  • Node 4 is checked. It has no children. If it’s not the node of interest, the next node (node 5) is checked.

And so on…

DFS is typically used in more complex algorithms.

Unlike trees, graphs may be cyclical (each node can have more than one connection and so, there’s a risk of returning to the same node). These algorithms become more complicated when used with a graph structure. I recommend the Stanford algorithms course for further information.

BFS vs. DFS

The use of BFS and DFS (and associated run times) truly vary depending on the data and the graph/tree structure.

Time complexity is the same for both algorithms. In both BFS and DFS, every node is visited but only once. The big-O time is O(n) (for every node in the tree).

However, the space complexity for these algorithms varies.

For BFS, which traverses all nodes at a given depth in the tree and uses a queue implementation, the width of the tree matters. The space complexity for BFS is O(w) where w is the maximum width of the tree.

For DFS, which goes along a single ‘branch’ all the way down and uses a stack implementation, the height of the tree matters. The space complexity for DFS is O(h) where h is the maximum height of the tree.

Dijkstra’s Algorithm

With BFS, we were assuming that all the tree was unweighted. The shortest path is defined simply as the path with the fewest edges.

Like BFS, Dijkstra’s algorithm also seeks to find the shortest path between nodes but it operates on weighted graphs (directed acyclic graphs); the edges have different weights, or some cost (such as time or distance).

Because edges will have different weights, the shortest path between any two given nodes may not necessarily be the path with the fewest segments. Dijkstra’s algorithm identifies the path with the smallest total weight.

Process
1. From the starting node, find and visit the ‘cheapest’ node

assume cost is in minutes. from start, go to node A (has the lowest cost, 3 versus 10)

Time to node A: 3
Time to node B: 10
Time to end: infinity (unknown)

2. At this new node, update the costs of its neighboring nodes (for any given neighbor, you do this by calculating the cost to get there from the starting node).

2a. If any neighbor node’s cost from prior visits change (you must visit every node in the graph so costs may change), you must update any costs.

time to node B is updated from 10 to 7 minutes (fastest path is through node A)

Time to node A: 3
Time to node B: 7
Time to end: 8

3. Repeat for every node in the graph.

focus on node B now

No weights need to be updated.

Time to node A: 3
Time to node B: 7
Time to end: 8

4. Calculate the final, lowest-cost path.

The shortest path remains through node A.

Big O Times for Data Structures

For any given data structures, there are different big O run times associated with access, search, insertion, and deletion.

For example, for an array, if any given element is deleted, all subsequent elements have to be shifted accordingly. Accessing an array is very quick (O(1)) because the elements are arranged contiguously in memory and they can be accessed via index. But the contiguous arrangement makes the big-O time associated with deletion much worse.

source: http://bigocheatsheet.com/

With the basics of data structures down, I thought it might be helpful to briefly discuss two topics that are frequently used when discussing algorithms.

Greedy Algorithms

A greedy algorithm is a quick-and-dirty way of finding some maximum or minimum. It is used to describe common algorithms such as decision tree analysis (decisions are made at each step based on how much entropy is reduced).

The knapsack problem is a common way to describe how greedy algorithms work.

Say you have a knapsack, one that can only hold 20 pounds. You have several items before you, each with a different weight and value. You want to maximize the total value of items you’re putting in the knapsack, without exceeding the maximum weight that the knapsack can carry. (Assume size isn’t an issue.)

  • Laptop: $2200 (3 lbs)
  • Monitor: $2500 (18 lbs)
  • Speakers: $1500 (15 lbs)

On the first step, you reach for the highest-value item, the monitor. On the second step, you would reach for the next highest-value item, the laptop. However, the knapsack can only hold 20 pounds so now there’s no more room for the laptop or any other items. You have $2500 worth of items.

The optimal solution would have actually been to take the laptop and speakers. You would not have exceeded the maximum weight of 20 pounds and would have had $3700 worth of items.

The knapsack problem illustrates the process of a greedy algorithm and its flaws. It’s not perfect and certainly, you’re not guaranteed the optimal solution with a greedy algorithm. But it’s close enough to work and sometimes, that’s all you need. In the end, you still have $2500 worth of items in your bag.

Dynamic Programming

As opposed to a greedy algorithm, dynamic programming is an exhaustive method of problem solving which involves breaking a larger problem down into subproblems and solving those subproblems. It is guaranteed to find the optimal solution. After each step, decisions are made based on prior decisions and prior steps may be reconsidered (see Dijkstra’s algorithm above for an example of dynamic programming).

It is different from the divide & conquer (D&C) method which also involves subproblems. D&C involves recursively solving sub-problems and then combining those solutions to solve the larger problem. Dynamic programming solve overlapping sub-problems (consider how the cost associated with reaching any given node depends on the prior nodes visited). Each sub-problem only needs to be solved once because the solution for each sub-problem is stored in a table such as an array or hash table (known as memoization) for future reference.

I hope that it is clear why understanding data structures and algorithms are so integral for successful work in data science. Certainly, there is much more beyond what was covered in my blog posts and I encourage you to seek out additional resources to truly enrich your understanding of this topic. See below for some resources I recommend (also listed in my prior blog post):

--

--

Data scientist and machine learning engineer, seeking to understand and help the world through data.