Invariant Information Measure

# Invariant Information Measure

written by Ge Yang

In a recent publication, I needed to connect the effect rank of a random matrix, which is a discrete distribution, with the spectral entropy of a continuous spectrum. The first step of doing so is to find a good entropy measure for continuous distributions. I tried to find a clean write-up that is easy to understand -- Jaynes' notes in particular uses non-standard notation that is quite confusing to read. To make it easier for others to check my math, I put together the following write-up.

This writeup is structured with the following parts:

1. Axioms of entropy that gives rise to Shannon entropy as a measure.
2. E.T. Jaynes' derivation of the invariant information measure via the limiting density of discrete distributions
3. Numerical results

# Axioms of An Entropy Measure

Shannon derives the entropy measure on a discrete probability distribution via the following axioms1

1. monotonically decreasing: if the probability $p(m_i)$ of an event $m_i$ increases, there is less information from a single observation of that event $m_i$.
2. non-negativity: $I(p)\geq 0$ information can not be negative.
3. $I(1)=0$: events that always occur do not communicate information
4. Addativity: $I(a \cdot b) = I(a) + I(b)$ if event $a$ and $b$ are independent from each other.

The negative log measure satistfies all four axioms on discrete distributions. This is called the Shannon entropy

$H(x) = - \sum_i p(x_i) \log p(x_i)$

where $p(x_i)$ are the probabilities of individual cases in a discrete distribution.

# Differential Entropy

Shannon extends the entropy from discrete distributions to continuous ones by simply replacing the sum with an integral. This is called the differential entropy.

Definition Let $X$ be a random variable with cumulative distribution function $F(x) = \mathrm Pr(X \le x)$. If $F(x)$ is continuous, the random variable is said to be continuous. Let $f(x) = F'(x)$ when the derivative is defined. If $\int_{-\infty}^\infty f(x) \mathrm d x = 1$, $f(x)$ is called the PDF for $X$. The set where $f(x)\gt 0$ is called the support of $X$.

Definition The differential entropy $h(X)$ of a continuous random variable $X$ with density $f(x)$ is defined as

$h(X) = - \int_S f(x) \log f(x) \mathrm d x,$

where $S$ is the support set of the random variable. This is often written as $h(f)$ as it only depends on the PDF $f(x)$.

## Issues with Differential Entropy

Differential entropy violates the non-negativity axiom above. This can be seen via a simple example. Consider a uniform distribution defined bewteen $[0, \frac 1 2]$ , $\mathcal U(0, \frac 1 2)$. The PDF is a constant $2$ over the entire support. The differential entropy is hence negative

$H_d(x) = \int_0^{\frac 1 2} -2 \log (2) \mathrm d x = - \log 2.$

Jaynes pointed out in his 1963 course note2 that this entropy measure is also problematic for two other reasons. First, this measure is dimensionally incorrect. Notice that $\mathrm d x$ has the dimensionality of space whereas the rest of the terms and the l.h.s should be dimensionless. Second, this expression is not invariant w.r.t changes of variable. Consider a distribution $p(x)$ defined over the interval $[a, b) \subset \mathbb R$. Upon a change of variable, say $u = 2 x$, the entropy measure changes

$H_u(X) = - \frac 1 4 \int_a^b p(u) \log p(u) \mathrm d u + \log 2.$

Hence to find an entropy measure that satisfies all four axioms, we need something else. Jaynes' course note uses non-standard notations and is quite difficult to understand. Below, I produce a cleaner derivation of the invariant meaure.

# Invariant Measure as Limiting Density of Discrete Probabilities

Our goal is to derive an invariant information measure that satisfies all four axioms. Note that here we are referring to the overall entropy measure, not the measure $m(x)$ below. Our overall strategy is to take a discrete probability distribution to the limit where $N\rightarrow \infty$.

Consider a probability density function $w(x)$ defined on the real-line that is normalized via the integral

$\int w(x) \mathrm d x = 1.$

Sample $N$ points, and there are $n_{a, b}$ points that fall into the interval $[a, b]$. The limit of the density can then be expressed as

$\lim_{N\rightarrow \infty} \frac {n_{a, b}}{N} = \int_a^b m(x) \mathrm d x.$

We can assume without a loss of generality that we only sample points between $[a, b]$, s.t. $n_{a, b} = N$. If we further assume the measure $m(x)$ to be a constant over the support $[a, b]$, then

$m(x) = \frac{1}{\vert b - a\vert}.$

Now to derive the invariant measure, if we assume that the passage to the limit is sufficiently well-behaved, the distance between near-by datapoints $x_i$ and $x_{i+1}$ becomes

$\lim_{N\rightarrow \infty} N \cdot \vert x_{i+1} - x_i\vert = m(x_i)^{-1},$

Which means

$\lim_{N\rightarrow \infty}\vert x_{i+1} - x_i \vert = \frac 1 {N m(x)}.$

We know how to define the probability for this discrete distribution using the probability density function $w(x)$ of the continuous distribution under consideration

$p(x_i) = \left\vert x_{i+1} - x_i\right\vert \cdot w(x_i).$

Now plug in the previous equation

$p(x_i) = \frac{w(x_i)}{N m(x_i)}.$

We can do a sanity check -- because we know for a uniform distribution between $\mathcal U(a, b)=\frac 1 {\vert b - a \vert}$,

$p(x_i) = \frac 1 N \equiv \frac 1 {\vert b - a \vert} \frac{\vert b - a \vert}{N}.$

We can now write down the entropy of this discrete distribution in its limit

\begin{aligned} \lim_{N\rightarrow \infty} H(x) &= - \sum^N p(x_i) \log p(x_i)\\ H(x) &= - \int_a^b w(x) \mathrm d x \log \frac{w(x)}{N m(x)}. \end{aligned}

If we pull the $N$ dependent term out

$H(x) = \log N - \int_a^b w(x) \log{\frac{w(x)}{m(x)}} \mathrm d x$

where $m(x) = \frac 1 {\vert b - a\vert}$.

The first term $\log N$ goes to infinity. We need to subtract this term from the expression

$H'(x) = - \int_a^b w(x) \log{\frac{w(x)}{m(x)}} \mathrm d x.$

In other words, for $N$ samples from a continuous distribution, the invariant information measure is

$H_\text{inv}(X) = H_m(x_1, \dots, x_N) - \log(N)$

# Numerical Results

Acknowledgements:

Leave the Acknowledgements here.

# References

1. Shannon, C. E. (1997) ‘The mathematical theory of communication. 1963’, M.D. computing: computers in medical practice, 14(4), pp. 306–317. Available at: https://www.ncbi.nlm.nih.gov/pubmed/9230594.
2. Jaynes, E. (1963) ‘Information Theory and Statistical Mechanics (Notes by the lecturer)’. PDF Available at: bradeis notes, DOI: semanticscholar.com (Accessed: 24 October 2021).