The world’s leading publication for data science, AI, and ML professionals.

How Data Science Helped Sherlock Holmes Find a Murderer

Solve the "Who Killed the Duke of Densmore" mystery using graph theory, constraint programming, and mixed-integer linear programming

Photo by Volodymyr Hryshchenko on Unsplash
Photo by Volodymyr Hryshchenko on Unsplash

Inventory management, portfolio optimization, machine scheduling, vehicle routing, and many other real-life problems are excellent examples of how Data Science and analytical techniques can be employed. It’s no surprise that these are some of the first problems we learn in university. However, I’m fascinated by the numerous other problems that can be represented as an abstract entity called model, which can be a graph, a schedule, or a set of mathematical equations and solved utilizing appropriate techniques. Once you acquire modeling skills, it is straightforward to apply these techniques to both a supply-chain problem and a simple puzzle.

In this article, I will demonstrate how the mystery of the Duke of Densmore’s murder can be solved using Graph Theory, Constraint Programming, and Mixed-Integer Linear Programming.


The book "Who Killed the Duke of Densmore" by Claude Berge is a mystery story that focuses on identifying the person who planted a bomb in Densmore Castle on the Isle of White, leading to the Duke’s death. Despite the passage of ten years, the case remained unsolved until Sherlock Holmes and Watson decided to delve into the matter.

When the Duke of Densmore met his demise due to a bomb explosion at his castle on the Isle of White, his six ex-wives – Ann, Betty, Charlotte, Edith, Felicia, Georgia, and Helen – became the primary suspects. Complicating matters was the destruction of the Duke’s will in the blast, rumored to be harsh against one of the former spouses. However, it was revealed that all six women had received invitations to the castle before the tragedy occurred.

All of his ex-wives swore they had been to the castle only once. They also did not remember how long they stayed there but knew who they met during the visit.

  • Ann met Betty, Charlotte, Felicia and Georgia.
  • Betty met Ann, Charlotte, Edith, Felicia and Helen.
  • Charlotte met Ann, Betty and Edith.
  • Edith met Betty, Charlotte and Felicia.
  • Felicia met Ann, Betty, Edith and Helen.
  • Georgia met Ann and Helen.
  • Helen met Betty, Felicia and Georgia.

Based on the information provided, Holmes and Watson believe that the killer had visited the castle on multiple occasions, which contradicts the ex-wives’ claims of only having visited once. The detectives suspects that one of the former spouses is be lying about the number of visits she made to the castle. Identifying which of the ex-wives lied about the frequency of their visits is the key to solving the mystery.

I’ll show three ways to solve this puzzle: using Graph Theory, Constraint Programming and Mixed-Integer Linear Programming.

Graph Theory

This is how Berge ultimately solved the mystery, and it appears to be the most straightforward approach. Each of the ex-wives had a specific duration of stay at the castle, with a defined arrival and departure time. Although the exact duration of their visits and the order in which they arrived and departed remain unknown, we know from the interrogation that some of their visits overlapped.

For the sake of simplicity, let’s focus on Helen and her interactions with Betty, Felicia, and Georgia during their respective visits to the castle. To illustrate their visits, we can use horizontal lines, with the endpoints representing their arrival and departure times, as shown below:

If two intervals overlap, it means that the corresponding individuals were at the castle simultaneously and thus must have met. We can represent the intervals as a graph, where each node corresponds to a person. If two intervals intersect, we draw an edge between the corresponding nodes. In the case of Helen, Betty, Felicia, and Georgia, the resulting graph looks like the one below:

This type of graph is called an Interval Graph, which can be used to model any set of intervals. However, not all graphs can be represented as intervals. For instance, consider the graph below:

It cannot be decomposed into intervals, as it is impossible to find four intervals, such that A-B, B-C, C-D, and D-A intersect. This graph is called C4. If a graph has C4 as a subgraph, certainly it is not an interval graph.

Therefore, if the graph is not an interval graph, we know that at least one person has lied about their visit to the castle. We can use this fact to identify the true culprit by constructing the graph from the interrogation and checking whether it is an interval graph or not.

Let’s use library networkx in python, and create the graph:

people = ['Ann', 'Betty', 'Charlotte', 'Edith', 'Felicia', 'Georgia', 'Helen']
people_met = {'Ann': ['Betty', 'Charlotte', 'Felicia', 'Georgia'],
              'Betty': ['Ann', 'Charlotte', 'Edith', 'Felicia', 'Helen'],
              'Charlotte': ['Ann', 'Betty', 'Edith'],
              'Edith': ['Betty', 'Charlotte', 'Felicia'],
              'Felicia': ['Ann', 'Betty', 'Edith', 'Helen'],
              'Georgia': ['Ann', 'Helen'],
              'Helen': ['Betty', 'Felicia', 'Georgia']}

def create_graph(plot_graph=False):
    """
    Generate graph based on interrogation, and draw it it plot_graph = True
    :param plot_graph: True if graph is plotted
    :return: Graph
    """
    G = nx.Graph()
    for person in people:
        G.add_node(person)
    for p1 in people_met.keys():
        for p2 in people_met[p1]:
            G.add_edge(p1, p2)
    if plot_graph:
        nx.draw_spectral(G, node_size=500, node_color=['red', 'green', 'pink', 'brown', 'yellow', 'orange', 'skyblue'],
                         with_labels=True)
        plt.show()
    return G

Now that graph is defined, two following steps are required:

  • Generate all the simple cycles of the graph, using _simplecycles function of networkx (see doc here). All elementary cycles of all lengths will be generated.
  • For all cycles with length equal to 4, generate the subgraph using subgraph function. And check if it is not _chordal_. In other words, check if the cycle is a C4 – a cycle of length 4 where all nodes have degree equal to 2 (see here).
def check_cycles(G):
    """
    Print all the chordless cycles with length = 4
    :param G: Input graph
    """
    list_cycles = list(sorted(nx.simple_cycles(G.to_directed())))
    for l in list_cycles:
        if len(l) == 4:
            H = G.subgraph(l)
            if not nx.is_chordal(H):
                print(l, len(l))

G = create_graph(False)
check_cycles(G)

The C4 subgraphs are:

  • Ann, Charlotte, Edith, Felicia
  • Ann, Felicia, Edith, Charlotte
  • Georgia, Ann, Betty, Helen
  • Georgia, Ann, Felicia, Helen
  • Georgia, Helen, Betty, Ann
  • Georgia, Helen, Felicia, Ann

Ann is present in all cycles, if she is eliminated from the set, an interval graph can be generated perfectly, indicating that she may be lying and could be deemed as the killer.

Constraint Programming

Constraint Programming (CP) is a problem-solving approach that addresses combinatorial and logical problems by focusing on finding feasible solutions that satisfy given constraints. CP algorithms achieve convergence by progressively reducing the set of feasible solutions through the addition of new constraints. Given its ability to handle complex logical statements and constraints, CP is particularly effective in tackling problems like the one at hand.

To solve our problem, we can use a CP model, which involves defining N as the set of people, _si as the arrival time of person i at the castle, _di as the duration of their visit, and _ei as the end of their visit. We can also introduce a logic variable _vi, which is TRUE if person i cannot be included in the visit schedule, and FALSE otherwise. Additionally, _meti is the set of people that person i met during their visit. The integer variables s, d, and e have a high upper bound (1000) and must be non-negative.

We’ll use the _ortools_ library to solve this problem:

from ortools.sat.python import cp_model
import pandas as pd
from matplotlib import pyplot as plt

people = ['Ann', 'Betty', 'Charlotte', 'Edith', 'Felicia', 'Georgia', 'Helen']
people_met = {'Ann': ['Betty', 'Charlotte', 'Felicia', 'Georgia'],
           'Betty': ['Ann', 'Charlotte', 'Edith', 'Felicia', 'Helen'],
           'Charlotte': ['Ann', 'Betty', 'Edith'],
           'Edith': ['Betty', 'Charlotte', 'Felicia'],
           'Felicia': ['Ann', 'Betty', 'Edith', 'Helen'],
           'Georgia': ['Ann', 'Helen'],
           'Helen': ['Betty', 'Felicia', 'Georgia']}

model = cp_model.CpModel()
num_vals = 1000

# Arrival time of person i in the castle
s = {i: model.NewIntVar(0, num_vals, 'start_' + i) for i in people}
# Duration of the visit of person i in the castle
e = {i: model.NewIntVar(0, num_vals, 'end_' + i) for i in people}
# Departure time of person i from the castle
d = {i: model.NewIntVar(0, num_vals, 'duration_' + i) for i in people}
# Logic variable that is True (1) is person cannot be included in the visit schedule. False otherwise
v = {i: model.NewBoolVar('v_' + i) for i in people}

The initial step is to determine if two individuals overlap or not, which means verifying if the intervals _(s_i, ei) and _(s_j, ej) intersect at any moment. This can be achieved by satisfying either one of the two following conditions:

  • _s_j < ei AND _s_i < ej
  • _s_i < ej AND _s_i > sj

It is also necessary to define when two intervals DO NOT overlap. It should respect one of the two conditions:

  • _e_i ≤ sj
  • _s_i ≥ ej

Essentially, for two people to have met, their intervals must overlap, or else they must not overlap. However, what if a person cannot be included in the schedule? This is where the variable _vi comes into play.

Finally, for every pair of people i and j:

When two individuals i and j meet, they must respect to either of the two overlapping conditions: (_s_j < ei AND _s_i < ej) OR (_s_i < ej AND _s_i > sj). If they fail to comply, one or both of them will not be included in the schedule (_vi = True OR _vj = True).

On the other hand, if two individuals i and j do not meet, they should follow the non-overlapping condition (_e_i ≤ sj OR _s_i ≥ ej). If they do not follow this condition, one or both of them will not be included in the schedule (_vi = True OR _vj = True).

To represent each of these conditions, we will use the Boolean variables b1, b2, b3, and b4.

  • b1 = _s_j < ei AND _s_i < ej
  • b2 = _s_i < ej AND _s_i > sj
  • b3 = _e_i ≤ sj
  • b4 = _s_i ≥ ej
# Auxiliary Boolean variables
b1 = {(p, p2): model.NewBoolVar(name='b1_' + p + "_" + p2) for p in people for p2 in people if p != p2}
b2 = {(p, p2): model.NewBoolVar(name='b2_' + p + "_" + p2) for p in people for p2 in people if p != p2}
b3 = {(p, p2): model.NewBoolVar(name='b3_' + p + "_" + p2) for p in people for p2 in people if p != p2}
b4 = {(p, p2): model.NewBoolVar(name='b4_' + p + "_" + p2) for p in people for p2 in people if p != p2}

for i in people:
    for j in people:
        if i != j:
            if j in people_met[i]:
                # Overlap condition 1
                model.Add(e[i] &gt; s[j]).OnlyEnforceIf(b1[(i, j)])
                model.Add(s[i] &lt; e[j]).OnlyEnforceIf(b1[(i, j)])
                # Overlap condition 2
                model.Add(s[i] &lt; e[j]).OnlyEnforceIf(b2[(i, j)])
                model.Add(s[i] &gt; s[j]).OnlyEnforceIf(b2[(i, j)])
                # At least one of the overlap conditions
                model.AddAtLeastOne([b1[(i, j)], b2[(i, j)], v[i], v[j]])
            else:
                # No overlap condition 1
                model.Add(e[i] &lt;= s[j]).OnlyEnforceIf(b3[(i, j)])
                # No overlap condition 2
                model.Add(s[i] &gt;= e[j]).OnlyEnforceIf(b4[(i, j)])
                # At least one of the non overlap conditions
                model.AddAtLeastOne([b3[(i, j)], b4[(i, j)], v[i], v[j]])

Duration of person’s visit (_di) is 0 is value _vi is True. Which means it is not included. If _vi is False, _di must be strictly positive.

# Duration of person (d_i) is 0 is value v_i is True. If v_i is False, d_i must be strictly positive.
for i in people:
    model.Add(d[i] == 0).OnlyEnforceIf(v[i])
    model.Add(d[i] &gt; 0).OnlyEnforceIf(v[i].Not())

Finally, the last constraint ensures that arrival time plus duration is equal to the end time, for each person

# Departure time is equal to arrival time plus duration
for i in people:
    model.Add(s[i] + d[i] == e[i])

Our objective function is maximizing the number of people included, which is the same as minimizing the sum of variable _vi.

# Objective function is maximizing the number of people included - minimizing the sum of variable v_i.
var_obj = 0
for i in people:
    var_obj += v[i]
model.Minimize(var_obj)Running the model we can plot the gantt chart generated:

Solving the model and plotting the gantt chart we have the following:


solver = cp_model.CpSolver()
status = solver.Solve(model)

if status == cp_model.OPTIMAL:
   for p in people:
      print(p, solver.Value(s[p]), solver.Value(e[p]), solver.Value(d[p]))
gantt_result_dict = {'Person': [], 'Start': [], 'End': [], 'Duration': []}
for i in people:
   gantt_result_dict['Person'].append(i)
   gantt_result_dict['Start'].append(solver.Value(s[i]))
   gantt_result_dict['End'].append(solver.Value(e[i]))
   gantt_result_dict['Duration'].append(solver.Value(d[i]))
gantt_result_df = pd.DataFrame.from_dict(gantt_result_dict)
print(gantt_result_df)
fig, ax = plt.subplots(1, figsize=(16, 6))
ax.barh(gantt_result_df['Person'], gantt_result_df['Duration'], left=gantt_result_df['Start'])
plt.grid()
plt.show()

Who’s again the person who could not be included and has duration equal to 0? Ann!

Mixed-Integer Linear Programming

While it may not be the simplest approach, my favorite way to solve this puzzle is by using MILP. To learn more about how to model and solve MILP, you can refer to my article below:

https://medium.com/towards-data-science/heuristics-as-warm-start-for-mixed-integer-programming-mip-models-9046781dd21f

We still have the set N, which represents the people involved, and _meti, which is the set of people who i has met. In this scenario, we will adopt a time-indexed method where we define a set of periods T beforehand (let’s assume T = 6). The variables are as follows:

  • _yi: binary. 1 if person i is included. 0 otherwise
  • x(i, t)_: binary. 1 if person i is in the castle at t. 0 otherwise
  • z(i, j, t)_: binary. 1 if person i and j are in the castle simultaneously at t. 0 otherwise
  • w(i, t)_: binary. 1 if person i arrives in the castle at t. 0 otherwise

Let’s also use ortools to solve it:

import pandas as pd
from matplotlib import pyplot as plt
from ortools.linear_solver import pywraplp

people = ['Ann', 'Betty', 'Charlotte', 'Edith', 'Felicia', 'Georgia', 'Helen']
people_met = {'Ann': ['Betty', 'Charlotte', 'Felicia', 'Georgia'],
              'Betty': ['Ann', 'Charlotte', 'Edith', 'Felicia', 'Helen'],
              'Charlotte': ['Ann', 'Betty', 'Edith'],
              'Edith': ['Betty', 'Charlotte', 'Felicia'],
              'Felicia': ['Ann', 'Betty', 'Edith', 'Helen'],
              'Georgia': ['Ann', 'Helen'],
              'Helen': ['Betty', 'Felicia', 'Georgia']}

periods = 6
T = [p for p in range(0, periods)]
solver = pywraplp.Solver('puzzle', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)

# 1 if person i is included. 0 otherwise
y = {(i): solver.IntVar(0, 1, "y_{0}".format(i)) for i in people}

# 1 if person i is in the castle at t. 0 otherwise
x = {(i, t): solver.IntVar(0, 1, "x_{0}_{1}".format(i, t)) for i in people for t in T}

# 1 if person i and j are in the castle simultaneously at t. 0 otherwise
z = {(i, j, t): solver.IntVar(0, 1, "z_{0}_{1}_{2}".format(i, j, t))
for i in people for j in people for t in T if i != j}

# 1 if person i arrives in the castle at t. 0 otherwise
w = {(i, t): solver.IntVar(0, 1, "w_{0}_{1}".format(i, t)) for i in people for t in range(0, periods - 1)}

Constraints are shown below:

(1) guarantees that any individual included in the schedule must have been in the castle at least once. (2) ensures that if a person is included, they must have been present in the castle at the same time as other individuals they met. (3) and (4) ensure that two individuals were simultaneously present in the castle at a given time if and only if they were both included and present at the time. (5) ensures that two individuals who did not meet cannot be present in the castle at the same time. (6) defines the arrival time for each person. (7) disallows any person from being present in the castle during periods 0 and T (for simplicity). Finally, (8) guarantees that any person included in the schedule has only one recorded arrival time.

# If a person is included she must have been in the castle at least once
c1 = {i: solver.Add(solver.Sum(x[(i, t)] for t in T) &gt;= y[i]) for i in people}

# If a person is included she must have been in the castle simultaneously with other people she met at least once
c2 = {(i, j): solver.Add(solver.Sum(z[(i, j, t)] for t in T) &gt;= y[i])
      for i in people for j in people_met[i]}

# Two people were simultaneously at t in the castle only if they are both there at t
c3 = {(i, j, t): solver.Add((z[(i, j, t)] &gt;= x[(i, t)] + x[(j, t)] - 1))
      for i in people for j in people for t in T if i != j}

# Two people were simultaneously at t in the castle only if they are both there at t and if they are included
c4 = {(i, j, t): solver.Add(
    (x[(i, t)] + x[(j, t)] + periods * (1 - z[(i, j, t)]) &gt;= y[i] + y[j]))
      for i in people for j in people for t in T if i != j}

# Two people that did not meet cannot be in the castle simultaneously
c5 = {(i, j): solver.Add(solver.Sum(z[(i, j, t)] for t in T) == 0)
      for i in people for j in people if i != j and j not in people_met[i]}

# Defines the arrival time of person
c6 = {(i, t): solver.Add((w[(i, t)] &gt;= x[(i, t + 1)] - x[(i, t)]))
      for i in people for t in range(0, periods - 1)}

# Person cannot be in the castle in period 0 and period T
c7 = {(i): solver.Add(x[(i, 0)] + x[(i, periods - 1)] == 0)
      for i in people}

# If a person was included it has just one arrival time
c8 = {(i): solver.Add(solver.Sum(w[(i, t)] for t in range(0, periods - 1)) == y[i])
      for i in people}

Objective function is maximizing the number of people included

number_of_people = solver.Sum(y[i] for i in people)
solver.Maximize(number_of_people)

Solving the model and plotting the gantt chart we have the following:

gantt_result_dict = {'Person': [], 'Start': [], 'End': [], 'Duration': []}
duration = {i: 0 for i in people}
start = {i: 0 for i in people}
end = {i: 0 for i in people}
for i in people:
    for t in T:
        if x[(i, t)].solution_value() &gt;= 0.1:
            duration[i] += 1
            if x[(i, t - 1)].solution_value() == 0:
                start[i] = t
    end[i] = start[i] + duration[i]
    gantt_result_dict['Person'].append(i)
    gantt_result_dict['Start'].append(start[i])
    gantt_result_dict['End'].append(end[i])
    gantt_result_dict['Duration'].append(duration[i])
gantt_result_df = pd.DataFrame.from_dict(gantt_result_dict)
print(gantt_result_df)
fig, ax = plt.subplots(1, figsize=(16, 6))
ax.barh(gantt_result_df['Person'], gantt_result_df['Duration'], left=gantt_result_df['Start'])
plt.grid()
plt.show()

Who’s again not included? Ann!


This example demonstrates how a problem can be approached and solved in various ways using different techniques, yet all methods produce the same outcome. It’s possible that if you attempted to solve the problem yourself, you might develop an alternative approach or model.

Although, while considering this specific problem, Graph theory appears to be the most straightforward method, it cannot be universally applied to all problems since it’s an adhoc approach. The differences between Mathematical programming (MILP) and Constraint programming could be the subject of an entire article or even a book. Nevertheless, the main disparities are:

  • Mathematical programming models support both discrete and continuous decision variables, whereas constraint programming only supports discrete decision variables (integer or Boolean).
  • Constraint programming supports logical constraints as well as arithmetic expressions, including modulo, integer division, minimum, maximum, and an expression that indexes an array of values by a decision variable. It can also utilize specialized constraints, such as the all different and no overlap constraint, that can simplify modeling, whereas mathematical programming only supports , and = constraints.

References

[1] B. Claude and W. Ian, Who killed the Duke of Densmore (1994), Biblioteque Oulipienne #67

All images unless otherwise noted are by the author


Related Articles