STATISTICAL REASONING
We are running through a tough evolution road.
Many of us hope and strive for Artificial Intelligence in which the word intelligence is more relevant than the word artificial. Many steps forward have been taken. Many shortcomings still have to be addressed.
Along this way, one of the more exciting models are the Markov Logic Networks (MLN) that want to bridge the gap between two very powerful branches of reasoning on reality – logic and statistics – and ensuring that the result will be an even more powerful and effective reasoning technique.
THE MAGIC COMBO
There are many ways to create models able to reason or in other words, achieving the ‘intelligence’ part of the AI. Yet the ‘logic-statistics magic combo’ has proven to be the most useful one, both in the sense of the beauty of the theoretical part and from the side of practical application to problems we like AI to tackle. The power of deductive thinking in complex scenarios and the management of uncertainty, a relevant part of reality.
In this magic combo, MLNs sit in the sphere of Statistical Relational Learning. In particular, they combine first-order logic and a probabilistic graphical model (Markov networks, also called Markov random fields) and have proven to be the more powerful model in many applications.
Other models or languages (ProbLog, etc.) belonging to this magic combo are affected by severe problems in real applications. If you want to know more about other approaches and why they need some critical ingredients they don’t have, you can learn about them here on Medium:
The way MLNs work, Markov Logic’s way is in essence: syntactically to extend first-order logic by attaching weights to logic formulas and semantically to see those weighted formulas as templates for constructing Markov networks.
I will be on that again in a minute!
If first-order logic says nothing to you, probably you are a penguin. But don’t worry, you can start to make up for it here.
And if you want to know some basics about Probabilistic Graphical Models you can read this interesting article on Towards Data Science by Branislav Holländer.
Instead, if you don’t know or don’t want to know anything about, you could reconsider reading this article…
Here’s the plan:
- we recall some fundamentals from Markov networks, essential to better understand the core of the whole topic:
- Markov Logic.
- This will lead you to know Markov Logic Networks crystal clear…
- …and we will learn whether they are the best statistical thinking model available!
MARKOV NETWORKS
A Markov network is a log-linear model representing the joint distribution of a set of random variables corresponding to nodes in an undirected graph having the required Markov properties. These properties allow using specific semantics that tells us what variables become independent of what else variables and allow us to see the graph as a structure of independence.
Undirected edges connect each node (each random variable) if we perceive a direct connection, some form of symmetric ‘affinity’, between them. In other terms, edges capture the relationship between two nodes. The model parameters are the so-called potential functions, each of which is associated with a maximal clique of the graph (each completely connected subgraph) and can take any non-negative values. The parameters are not probabilities, rather they can be interpreted as a degree of regularity, values of state of nature.
This affinity can be viewed as a well-known causal relationship, some mutual influence, association, correlation, or whatever, it is up to the designer of the model to get it right and sound.
The whole assignment of specific values for each variable is a state or world. The probability of a state is the result of the product of all the potential functions, normalized. And here, we are doomed, because: if the cliques are large, this becomes intractable, because the size of a potential function is exponential in the combination of variables/values of the domain.
But we are saying that a Markov network is a log-linear model, so we can reformulate the full-joint distribution as an exponential family model in the canonical form, to be used in practice. So, instead of a product of potential functions, we can switch to a representation of the Markov network having an exponentiated sum of weighted features. We’ll have to deal with complexity problems anyway, though less severe.
MARKOV LOGIC
Turning our attention to this representation, understanding Markov Logic, essential for MLNs, will be easier.
A feature may be any real-valued function (for example a logical function) of the state on a clique or part of a clique and is weighted by the log of the respective potential. For example, a dichotomic feature related to a clique A-B-C could be modeled in these terms: 1 if not A or B or C; and 0 otherwise. Or be modeled as 1 if A and C; 0 otherwise.
And this is another beautiful point of the Markov Networks! Ok, they model some kind of affinity among groups of variables, through the clique. Using the exponential representation, even if you have very large cliques with many many states, if you know a set of states you are interested in and what subset of features contributes to it, you can just focus on those features with their weights! The gain you have is a more compact model with only the items that matter for your analysis.
I hate to break it to you but even if you can use a more compact representation than the potential-function form, specifying a much smaller number of states and features (for example logical functions as we saw in the examples above), also the canonical form representation is exponential in the size of the cliques.
In general inference in Markov networks is #P-complete.
So in general, you are forced to use approximate inference in Markov Networks. Markov Chain Monte Carlo methods, and Gibbs sampling, in particular, are the norm here. It works sampling each variable given its Markov blanket – that in Markov Network is simply the set of node neighbors in the graph. Marginal probabilities are computed by counting over these samples and conditional probabilities are computed by running the Gibbs sampler with the conditioning variables clamped to their given values.
The examples of the dichotomic features above, don’t look like a set of constraints over a complex reality of interest?
And isn’t it the description of a logical knowledge base, a logical theory? A set of hard constraints on the set of possible worlds that are nothing but assignments of values for all the variables involved or knowledge base interpretations or states or so-called groundings.
In a logical knowledge base constraints are hard. Of course, such rules are not always true. If you violate even one constraint that is if you violate a single grounding of a single logical formula, then the world becomes false. Very annoying!
As the people from the semantic web know very well, putting together more and more hard constraints results in a very narrow space of true interpretations of those logical rules (called models)… if not to a paradox (no models at all).
But that’s the idea: what if we relax the concept that a true interpretation must satisfy all the constraints?
The basic idea in Markov logic is to soften these constraints: when a world violates one constraint it is less probable, but not impossible, to perform probabilistic inference on these constraints. And this takes place with the MLNs. Let’s see how!
MARKOV LOGIC NETWORKS
A MLN is a set of pairs (F,w) where F is a first-order formula and w is a real number, a positive or negative weight. So, where did Markov Networks go?! In the semantics!
MLN defines templates for using Markov networks in which the features are the groundings of the constraints, the assignment of values of first-order logic formula. More precisely, given a set of pairs (F,w) and a database, a set of values, a MLN defines a Markov network in this way:
- one node for each grounding of each predicate in the MLN
- one feature for each grounding of each formula F in the MLN with the corresponding weight w
- all the features that are groundings of the same formula have the same weight
Easier to see than to read!
This example shows a MLN composed of two formulas and two weights: the purple one states that one who plays sport is healthy with a weight of 1.3 and the blue one says that for each pair x and y that are friends, then the two friends will both play sport with a weight of 0.7. Moreover, we have a set of values A and B. Below the Markov network deriving from the MLN is visualized:
- A node for each grounding of each predicate that is a Boolean random variable, evaluating to 1 for interpretations comprising that ground predicate, 0 otherwise.
- A feature (a clique) for every grounding of every formula, so an edge between any two nodes of the clique where the two are in the same logic formula. The edges have the color of the relative formula they came from.
We see thereby that a MLN is a nice template maker of Markov networks.
A ground Markov network is a template filled with relative values for each node.
The most accurate among you will notice… there are two groundings for friendship between A and B, and even two groundings for self-friendship!
Well, trust me or ask your therapist: friendship is not symmetric and self-esteem is always a big deal!
THE BEST STATISTICAL THINKING MODEL
The idea behind the MLNs is to create a distribution of possible worlds, also said interpretations (not necessarily true interpretations) on the knowledge base that vary based on how ‘good’ they are or, in other terms, how much ‘heavy’ they are.
The more highly weighted constraints an interpretation satisfies, the more is ‘good’. So, in the example, the more people practice sport, the more healthy they are, the likelier the relative world becomes.
As formula weights increase, a MLN increasingly resembles a purely logical knowledge base, becoming equivalent to one in the limit of all infinite weights. Yet remember here that the weights can be also negative.
To be precise, a single Markov logic network does not actually represent a single distribution, but a whole family of Markov networks and their distributions, depending on the data. But these distributions are all based on repetitions of the same template.
And here another interesting thing for the statisticians:
the peaks of these distributions consist in the modes of the distribution, they are the most probable true worlds (all their constraints are met).
The probability distribution over possible worlds x specified by the ground Markov network is given by the following formula where F is the number of formulas in the MLN, i-th w is the weight of the formula i-th, and n_i is the number of true groundings of the formula i in x:
Unfortunately, this doesn’t look even remotely practical: we still have an exponential number of nodes, and just to remind you again, satisfiability in first-order logic is undecidable and inference in probabilistic models is #P complete. So, what?
Typically, your ultimate data analysis goal will be finding the most probable state of the world given some evidence or finding marginal probabilities. So it doesn’t seem to be a happy ending story, because the probability of a formula is the sum of the probabilities of the worlds where it holds, and computing it by brute force requires time exponential in the number of possible ground atoms. But don’t be worried, there are available a lot of methods to actually perform the inference task on MLNs.
First of all, there are many useful pre-assumptions you can do: i) using unique names (different constants refer to different objects); ii) domain closure and finiteness (the only objects in the domain are those representable using the constant and function symbols); iii) known functions (the value of each function for each tuple of arguments is always a known constant). These assumptions in the logical representation ensure that the number of possible worlds is finite and that the Markov logic network will give a well-defined probability distribution. Further, they are quite reasonable in most practical applications and greatly simplify the use of MLNs.
Then, there are many (approximate) procedures to use, for example, Markov chain Monte Carlo inference, which samples a sequence of states of the net according to their probabilities and counts the fraction of sampled states the formula holds.
Also, keep in mind an entire family of lovely methods known as lifted inference methods. **** They reduce the space, or the sampling space, for the probabilistic inference, reducing the number of models to be analyzed. These methods speed up inference by reasoning at the first-order level about groups of indistinguishable objects or symmetries in the structure of the grounded graph.
JOYS AND SORROWS
So, Markov Logic Networks are actually qualified to be considered the best statistical thinking model? Let’s see…
💚 MLNs are a fantastic machine learning model using the magic-combo: the power of complex reasoning in first-order logic and the management of uncertainty of probabilistic graphical models.
💚 They are very compact. Compared with Markov networks they reduce the entire graph structure to groups of interrelated entities that really matter to your data analysis.
💔 Although MLNs are very compact probabilistic graphical models, they produce exponentially large graphs with an approach to reality that is complete but not computationally sustainable.
💚 You can really use MLNs in practice due to an extensive set of (approximate) algorithms and techniques helping to cope with all this verboseness.
💚 While in first-order logic contradictions are a problem, in MLNs contradiction is not a problem. Markov logic allows contradictions between formulas, which it resolves simply by weighing the evidence on both sides so you can smooth the ‘hardiness’ of logic formulas.
💔 ‘Ex falso sequitur quodlibet’, that is from the false you can derive everything. The fact that contradiction is not a problem is a problem. It is a behavior inherited from the first-order logic indeed. The first-order implication may not be what you want to represent your domain. Don’t forget that rains -> take_umbrella(x) implies that you take the umbrella even when it’s sunny! You must be very accurate and conscious in handling this kind of reasoning. Differently, you can easily be in trouble with the way of thinking of your model.
💚 MLNs can handle non-iid data that is non-independent and -identically distributed data by leveraging the power of first-order logic to compactly represent dependencies among objects and relations. This is paramount in many real applications.
💔 MLNs benefit from a fundamental element of first-order logic grammar: existential quantification. But although extensions exist, workable versions of MLNs just support finite domains, so existential quantification cannot really create new objects in the domain of discourse. These models are not able to ‘create’ new nulls, a standard mechanism in other formalisms. An example of the ruinous consequences from this aspect is that you could not even assert the simple statement that for every person there exists a father, that you don’t know already (from your logic 1.0 book).
💔 MLN semantics is not able to express induction. Very well-known examples of inductive definitions are the Fibonacci series and the factorial, and the plainest example of an inductive definition is transitive closure. If a system can’t even perform transitive closure of a binary relation you can be sure it won’t help you with reasoning! Obviously, all the daily life problems derived from this: evaluating contagion propagation within networks is just one example.
💔 MLNs don’t have recursion. You can imagine a reasoning model that is not able to express concepts that require recursion? You could not even express a reasoning tree or simply traverse a graph.
Keep going with the conversation in the section below! Let me know if you consider Markov logic networks the best statistical thinking model of Artificial Intelligence, or not. And more importantly, let me know why! 🙂
Follow me on Medium for more.
Let’s keep in touch also on Linkedin!
If you’re asking if also strange unidentified animals called ‘women’ have contributed to science, and even this kind of science, in particular, Markov Chain Monte Carlo methods, read here about a big recent loss:
Arianna Rosenbluth Dies at 93; Pioneering Figure in Data Science
REFERENCES
[1] M. Richardson and P. Domingos; Markov Logic Networks (2006); Machine Learning, Springer.
[2] P. Domingos, S. Kok, D. Lowd, H. Poon, M. Richardson, and P. Singla; Markov Logic (2008); Probabilistic Inductive Logic Programming, Lecture Notes in Computer Science.
[3] R. Braz, E. Amir, and D. Roth; Lifted first-order probabilistic inference (2005); Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.