Invariant Information Measure

Invariant Information Measure

written by Ge Yang

In a recent publication, I needed to connect the effect rank of a random matrix, which is a discrete distribution, with the spectral entropy of a continuous spectrum. The first step of doing so is to find a good entropy measure for continuous distributions. I tried to find a clean write-up that is easy to understand -- Jaynes' notes in particular uses non-standard notation that is quite confusing to read. To make it easier for others to check my math, I put together the following write-up.

This writeup is structured with the following parts:

  1. Axioms of entropy that gives rise to Shannon entropy as a measure.
  2. E.T. Jaynes' derivation of the invariant information measure via the limiting density of discrete distributions
  3. Numerical results

Axioms of An Entropy Measure

Shannon derives the entropy measure on a discrete probability distribution via the following axioms1

  1. monotonically decreasing: if the probability p(mi)p(m_i) of an event mim_i increases, there is less information from a single observation of that event mim_i.
  2. non-negativity: I(p)0I(p)\geq 0 information can not be negative.
  3. I(1)=0I(1)=0: events that always occur do not communicate information
  4. Addativity: I(ab)=I(a)+I(b)I(a \cdot b) = I(a) + I(b) if event aa and bb are independent from each other.

The negative log measure satistfies all four axioms on discrete distributions. This is called the Shannon entropy

H(x)=ip(xi)logp(xi)H(x) = - \sum_i p(x_i) \log p(x_i)

where p(xi)p(x_i) are the probabilities of individual cases in a discrete distribution.

Differential Entropy

Shannon extends the entropy from discrete distributions to continuous ones by simply replacing the sum with an integral. This is called the differential entropy.

Definition Let XX be a random variable with cumulative distribution function F(x)=Pr(Xx)F(x) = \mathrm Pr(X \le x). If F(x)F(x) is continuous, the random variable is said to be continuous. Let f(x)=F(x)f(x) = F'(x) when the derivative is defined. If f(x)dx=1\int_{-\infty}^\infty f(x) \mathrm d x = 1, f(x)f(x) is called the PDF for XX. The set where f(x)>0f(x)\gt 0 is called the support of XX.

Definition The differential entropy h(X)h(X) of a continuous random variable XX with density f(x)f(x) is defined as

h(X)=Sf(x)logf(x)dx,h(X) = - \int_S f(x) \log f(x) \mathrm d x,

where SS is the support set of the random variable. This is often written as h(f)h(f) as it only depends on the PDF f(x)f(x).

Issues with Differential Entropy

Differential entropy violates the non-negativity axiom above. This can be seen via a simple example. Consider a uniform distribution defined bewteen [0,12][0, \frac 1 2] , U(0,12)\mathcal U(0, \frac 1 2). The PDF is a constant 22 over the entire support. The differential entropy is hence negative

Hd(x)=0122log(2)dx=log2.H_d(x) = \int_0^{\frac 1 2} -2 \log (2) \mathrm d x = - \log 2.

Jaynes pointed out in his 1963 course note2 that this entropy measure is also problematic for two other reasons. First, this measure is dimensionally incorrect. Notice that dx\mathrm d x has the dimensionality of space whereas the rest of the terms and the l.h.s should be dimensionless. Second, this expression is not invariant w.r.t changes of variable. Consider a distribution p(x)p(x) defined over the interval [a,b)R[a, b) \subset \mathbb R. Upon a change of variable, say u=2xu = 2 x, the entropy measure changes

Hu(X)=14abp(u)logp(u)du+log2.H_u(X) = - \frac 1 4 \int_a^b p(u) \log p(u) \mathrm d u + \log 2.

Hence to find an entropy measure that satisfies all four axioms, we need something else. Jaynes' course note uses non-standard notations and is quite difficult to understand. Below, I produce a cleaner derivation of the invariant meaure.

Invariant Measure as Limiting Density of Discrete Probabilities

Our goal is to derive an invariant information measure that satisfies all four axioms. Note that here we are referring to the overall entropy measure, not the measure m(x)m(x) below. Our overall strategy is to take a discrete probability distribution to the limit where NN\rightarrow \infty.

Consider a probability density function w(x)w(x) defined on the real-line that is normalized via the integral

w(x)dx=1.\int w(x) \mathrm d x = 1.

Sample NN points, and there are na,bn_{a, b} points that fall into the interval [a,b][a, b]. The limit of the density can then be expressed as

limNna,bN=abm(x)dx.\lim_{N\rightarrow \infty} \frac {n_{a, b}}{N} = \int_a^b m(x) \mathrm d x.

We can assume without a loss of generality that we only sample points between [a,b][a, b], s.t. na,b=Nn_{a, b} = N. If we further assume the measure m(x)m(x) to be a constant over the support [a,b][a, b], then

m(x)=1ba.m(x) = \frac{1}{\vert b - a\vert}.

Now to derive the invariant measure, if we assume that the passage to the limit is sufficiently well-behaved, the distance between near-by datapoints xix_i and xi+1x_{i+1} becomes

limNNxi+1xi=m(xi)1,\lim_{N\rightarrow \infty} N \cdot \vert x_{i+1} - x_i\vert = m(x_i)^{-1},

Which means

limNxi+1xi=1Nm(x).\lim_{N\rightarrow \infty}\vert x_{i+1} - x_i \vert = \frac 1 {N m(x)}.

We know how to define the probability for this discrete distribution using the probability density function w(x)w(x) of the continuous distribution under consideration

p(xi)=xi+1xiw(xi).p(x_i) = \left\vert x_{i+1} - x_i\right\vert \cdot w(x_i).

Now plug in the previous equation

p(xi)=w(xi)Nm(xi).p(x_i) = \frac{w(x_i)}{N m(x_i)}.

We can do a sanity check -- because we know for a uniform distribution between U(a,b)=1ba\mathcal U(a, b)=\frac 1 {\vert b - a \vert},

p(xi)=1N1babaN.p(x_i) = \frac 1 N \equiv \frac 1 {\vert b - a \vert} \frac{\vert b - a \vert}{N}.

We can now write down the entropy of this discrete distribution in its limit

limNH(x)=Np(xi)logp(xi)H(x)=abw(x)dxlogw(x)Nm(x).\begin{aligned} \lim_{N\rightarrow \infty} H(x) &= - \sum^N p(x_i) \log p(x_i)\\ H(x) &= - \int_a^b w(x) \mathrm d x \log \frac{w(x)}{N m(x)}. \end{aligned}

If we pull the NN dependent term out

H(x)=logNabw(x)logw(x)m(x)dxH(x) = \log N - \int_a^b w(x) \log{\frac{w(x)}{m(x)}} \mathrm d x

where m(x)=1bam(x) = \frac 1 {\vert b - a\vert}.

The first term logN\log N goes to infinity. We need to subtract this term from the expression

H(x)=abw(x)logw(x)m(x)dx.H'(x) = - \int_a^b w(x) \log{\frac{w(x)}{m(x)}} \mathrm d x.

In other words, for NN samples from a continuous distribution, the invariant information measure is

Hinv(X)=Hm(x1,,xN)log(N)H_\text{inv}(X) = H_m(x_1, \dots, x_N) - \log(N)

Numerical Results

Acknowledgements:

Leave the Acknowledgements here.

References


  1. Shannon, C. E. (1997) ‘The mathematical theory of communication. 1963’, M.D. computing: computers in medical practice, 14(4), pp. 306–317. Available at: https://www.ncbi.nlm.nih.gov/pubmed/9230594.
  2. Jaynes, E. (1963) ‘Information Theory and Statistical Mechanics (Notes by the lecturer)’. PDF Available at: bradeis notes, DOI: semanticscholar.com (Accessed: 24 October 2021).