In a recent publication, I needed to connect the effect rank of a random matrix, which is a discrete distribution, with the spectral entropy of a continuous spectrum. The first step of doing so is to find a good entropy measure for continuous distributions. I tried to find a clean write-up that is easy to understand -- Jaynes' notes in particular uses non-standard notation that is quite confusing to read. To make it easier for others to check my math, I put together the following write-up.
This writeup is structured with the following parts:
- Axioms of entropy that gives rise to Shannon entropy as a measure.
- E.T. Jaynes' derivation of the invariant information measure via the limiting density of discrete distributions
- Numerical results
Axioms of An Entropy Measure
Shannon derives the entropy measure on a discrete probability distribution via the following axioms1
- monotonically decreasing: if the probability of an event increases, there is less information from a single observation of that event .
- non-negativity: information can not be negative.
- : events that always occur do not communicate information
- Addativity: if event and are independent from each other.
The negative log measure satistfies all four axioms on discrete distributions. This is called the Shannon entropy
where are the probabilities of individual cases in a discrete distribution.
Shannon extends the entropy from discrete distributions to continuous ones by simply replacing the sum with an integral. This is called the differential entropy.
Definition Let be a random variable with cumulative distribution function . If is continuous, the random variable is said to be continuous. Let when the derivative is defined. If , is called the PDF for . The set where is called the support of .
Definition The differential entropy of a continuous random variable with density is defined as
where is the support set of the random variable. This is often written as as it only depends on the PDF .
Issues with Differential Entropy
Differential entropy violates the non-negativity axiom above. This can be seen via a simple example. Consider a uniform distribution defined bewteen , . The PDF is a constant over the entire support. The differential entropy is hence negative
Jaynes pointed out in his 1963 course note2 that this entropy measure is also problematic for two other reasons. First, this measure is dimensionally incorrect. Notice that has the dimensionality of space whereas the rest of the terms and the l.h.s should be dimensionless. Second, this expression is not invariant w.r.t changes of variable. Consider a distribution defined over the interval . Upon a change of variable, say , the entropy measure changes
Hence to find an entropy measure that satisfies all four axioms, we need something else. Jaynes' course note uses non-standard notations and is quite difficult to understand. Below, I produce a cleaner derivation of the invariant meaure.
Invariant Measure as Limiting Density of Discrete Probabilities
Our goal is to derive an invariant information measure that satisfies all four axioms. Note that here we are referring to the overall entropy measure, not the measure below. Our overall strategy is to take a discrete probability distribution to the limit where .
Consider a probability density function defined on the real-line that is normalized via the integral
Sample points, and there are points that fall into the interval . The limit of the density can then be expressed as
We can assume without a loss of generality that we only sample points between , s.t. . If we further assume the measure to be a constant over the support , then
Now to derive the invariant measure, if we assume that the passage to the limit is sufficiently well-behaved, the distance between near-by datapoints and becomes
We know how to define the probability for this discrete distribution using the probability density function of the continuous distribution under consideration
Now plug in the previous equation
We can do a sanity check -- because we know for a uniform distribution between ,
We can now write down the entropy of this discrete distribution in its limit
If we pull the dependent term out
The first term goes to infinity. We need to subtract this term from the expression
In other words, for samples from a continuous distribution, the invariant information measure is
Leave the Acknowledgements here.
- Shannon, C. E. (1997) ‘The mathematical theory of communication. 1963’, M.D. computing: computers in medical practice, 14(4), pp. 306–317. Available at: https://www.ncbi.nlm.nih.gov/pubmed/9230594.↩
- Jaynes, E. (1963) ‘Information Theory and Statistical Mechanics (Notes by the lecturer)’. PDF Available at: bradeis notes, DOI: semanticscholar.com (Accessed: 24 October 2021).↩