In high school, some of us had love-hate relationships with geometry, especially coordinate and 3D geometry. Even more, calculus with geometry was frowned upon. Then came a boom in information technology followed by a craze for machine learning, artificial intelligence, and Data Science. The cause has inspired many to dig deeper into some mystics of mathematics and information geometry is one of them. Information geometry can be used in statistical manifold learning that has recently proven to be a useful tool for unsupervised learning on the high-dimensional dataset. It can also calculate the distance between two probability measures with applications in pattern matching, constructing alternative loss functions for the training of a neural network, belief propagation network, and optimization problems.
Information geometry is a mathematical tool for exploring the world of information using geometry. Information Geometry is also called Fisherian Geometry for the reason that will be obvious in the later part of the article.
Information geometry is the study of decision-making using geometry that may also include pattern-matching, modeling-fitting, etc. But why the geometric approach? Geometry allows studying invariance in a coordinate-free framework, provides a tool to think intuitively, and allows us to study equivariance. As an example, a centroid of a triangle is equivariant under affine transformation.
Let’s discuss some fundamentals to understand what is there in Information Geometry.
Fundamentals
To understand differential geometry, and hence information geometry, we need to see what a manifold is. In my previous article on manifold alignment, I discussed what a manifold is, but I will reproduce it below:
An n-dimensional manifold is a most general mathematical space with limits, continuity, and correctness as well as allows the existence of the continuous inverse function with n-dimensional Euclidean spaces. Manifolds locally resemble Euclidean space but they may not be Euclidean space. Essentially, a manifold is a generalization of Euclidean space.
Topological Spaces
Consider a space X is defined by a set of points x∈ X and a set of subsets of X called neighborhoods N(x) for each point. We have the following properties:
- If U is a neighborhood of x, and x∈ U, and V⊂X and U⊂V, then V is also the neighborhood of x.
- The intersection of two neighborhoods of x is also a neighborhood of x.
- Any neighborhood U of x includes a neighborhood V of x, such that U is a neighborhood of all points in V.
Any space X satisfying the above properties can be called topological space.
Homeomorphism
In manifold alignment, I discussed homomorphism, which can be presented axiomatically as:
Considering f: X→Y to be a function between two topological spaces, then X and Y are homeomorphic if f is continuous, bijective, and the inverse of f is also continuous. Consider a manifold 𝓜, and for all points x ∈ 𝓜, if U _i_s a neighborhood of x, _an_d for an integer n _s_uch that U is homeomorphic to ℝⁿ, then small n is the dimension of the manifold.
Chart
Homeomorphism denoted by the function κ: U→ κ (U) is called a chart where U may be an open subset of 𝓜. There can be many ways to construct a chart to define 𝓜. A collection of such charts are called an atlas. The idea is presented in Figure 1. Mathematically, an atlas would be defined by Equation 1. A specific example of a chart is a coordinate system which can be a function that maps points on a manifold.
At this point, it is easy to define a differentiable manifold: a manifold whose transition maps (or functions) are infinitely differentiable.
That’s all about manifolds in abstract Mathematics. What about its understanding of statistics and data science? Remember, we deal with probabilities in statistics. This leads to the notion of statistical manifolds. In a statistical manifold, every point p ∈ 𝓜 corresponds to a probability distribution over a domain 𝓧. One can think of this with a specific example of a manifold formed by a family of normal distribution.
Vectors and Tangents on a Manifold: how to define them on curved spaces
In ordinary geometry, vectors are straight lines connecting two points but in curved spaces, this may not be true. Vectors on curved spaces are defined as tangents to a curve at a particular point on the manifold. If u is a parameter that varies along the curve, then a curve may be defined as x(u), often dropping the u part and simply writing x. The vector in curved spaces becomes
which is defined at point p locally where u = 0. Note that the vector itself doesn’t live on the manifold but it has Euclidean notion. Like charts, there can be many possible tangents at the point p. Yeah, this is mind-boggling if you are merely considering a 2D plane, but even in a 3D like on a sphere, there could be multiple tangent lines at a point – then we would talk about a tangent plane on a point of the sphere (Figure 2). Analogously, we can talk about a tangent space at a point p for a manifold.
If we go from one chart to another, it is the same as the coordinate transformation from, say, cartesian coordinates to polar coordinates. Suppose there is a transformation function ϕ to transform x from one chart to another, then it can be written as x′ = ϕ(x).
Dual Space
The dual space V* of a vector space V is the space that contains all the linear functionals of V, i. e. all maps T: V↦F, where F is the field that V is the vector space of. Therefore, the dual space contains all linear mappings from V to F. While hoarding the internet for a better understanding of the dual space, I stumbled upon the following example³:
Imagine a 2-dimensional real vector space. Define a function that takes any vector and just returns its x-coordinate value. Define another function that takes any vector and just returns its y coordinate value. Let’s call the 1st function f1 and the second function f2. Let’s think a bit more than that. Treat these two functions as vectors. In particular, think of them as basis vectors in a funny vector space. You can add them together: Think of f1 + f2 as a function that takes any vector and returns the sum of the x-coordinate and the y-coordinate. You can multiply them by numbers: Think of 7 f1 as a function that takes any vector and returns the value of the x coordinate multiplied by 7. You can form any linear combo you like: Think of 3.5 f1 -5 * f2 as a function that takes any vector and returns the number 3.5 times the x coordinate minus 5 times the y coordinate. This is what dual space does.
Tensor
Tensors in the case of the manifold are the most general. They can be considered mathematical multi-linear beasts that eat vectors from tangent spaces and their dual space and spit out real numbers. The total number of vectors from the tangent space and its dual space that are fed to the tensor is called the rank of the tensor. The number of these that come from the dual space gives what is called the contravariant rank and the number which comes from the tangent space is called the covariant rank.
I will skip formal discussions of them, as honestly, I myself do not understand them – not just yet. But in essence, manifolds are geometric constructs and tensors are corresponding algebraic constructs.
Metric
A metric is a tensor field that induces an inner product on the tangent space at each point on the manifold. Any tensor field of the covariant rank two can be used to define a metric. Some sources call it the Riemannian Metric⁷.
Now, we look at something useful in Information Geometry after a long and convoluted list of terminologies.
Information Geometry
Information geometry is a branch of mathematics at the intersection of statistics and differential geometry that focuses on the study of probability distributions from a geometric point of view.
We first need to look at the Information metric, also known as the Fisher Information metric.
Fisher Information Metric
If we wish to find a suitable metric tensor at a point θ* where θ corresponds to one of a family of distribution p(x|θ), then we need to look at a notion of distance between p(x|θ) and its infinitesimal perturbation _p(x|θ + dθ)._ Then the relative difference is given by Equation 3. Of course
relative distance depends on the random variable x. If you do the math correctly, then the expectation of Δ, i.e., 𝔼(Δ) = 0. What about the variance? It turns out that the variance is non-zero. We could define d_l²=_𝔼[Δ²]. From the first principle, the length of an infinitesimal displacement between θ* _and θ* for a metric 𝓕 is given by dl² = 𝓕dθ_d_θ*. Solv_ing for dl²=_𝔼[Δ_²] = 𝓕dθdθ_ gives_
which is what we call Fisher Information Metric (FIM). FIM measures how much information an observation of the random variable X carries about the parameter θ on average if x∼p(x∣θ). There is another way to arrive at Equation 4 up to a factor of 1/2 for which I refer readers to reference 1. But if it excites you, the burgeoning of Quantum Information Science has applications of the Fisher Information Metric. On that, I will soon have a new series of articles. If you want to get notified of that, subscribe! Another application is the design of uninformative prior In Bayesian inference. Maybe I am going too far now.
I will conclude the article with the final thing that Fisher Information Matrix 𝓘, which is a matrix version of 𝓕 when we are dealing with multiple parameters, can be used in optimization similar to the gradient descent with update rule as
where η is the learning parameter, ∇J is the divergence of the scalar field J.
In this article, I tried to provide a brief overview of information geometry and related terms. A considerable amount of work in this discussion is omitted for clarity and I encourage readers to go through the reference materials I have provided below.
Was this helpful? Buy me a Coffee.
Love my writing? Join my email list.
Want to know more about STEM-related topics? Join Medium
References
- http://www.robots.ox.ac.uk/~lsgs/posts/2019-09-27-info-geom.html*
- https://math.stackexchange.com/questions/240491/what-is-a-covector-and-what-is-it-used-for
- https://qr.ae/pv34JS
- https://franknielsen.github.io/SPIG-LesHouches2020/Geomstats-SPIGL2020.pdf*
- https://www.cmu.edu/biolphys/deserno/pdf/diff_geom.pdf*
- https://math.ucr.edu/home/baez/information/information_geometry_1.html*
- https://mathworld.wolfram.com/RiemannianMetric.html*
Note: links with * at the end have been archived on http://web.archive.org/.