Think of it as a mismatch score.
np.sum(p * np.log(p / q)) asks:
At places where P believes probability is important, how badly does Q disagree?
Here is the intuition piece by piece.
Suppose at one x-location:
pis largeqis also large
Then p/q is near 1, log(p/q) is near 0, so that point
contributes almost nothing. This means, Q agrees with P there.
Now suppose:
pis largeqis very small
Then p/q is big, log(p/q) is positive and large, and
after multiplying by p, that point contributes a lot. This means,
P says this region matters, but Q is not paying enough attention,
so the penalty is large.
Now suppose:
pis tinyqis wrong there
Then even if q is very different, the contribution is small because
you multiply by p. This means, KL mostly cares about being wrong where P puts mass.
So the weighting by p is the key idea:
p_i log(p_i / q_i)
says:
How wrong is Q at point i, weighted by how much P cares about that point?
Why the log?
The log turns ratios into a sensible relative penalty.
- if
p = q, thenp/q = 1, andlog(1) = 0 - if
qis half ofp, you get a penalty - if
qis much smaller thanp, the penalty grows a lot - if
qis larger thanp, the term can become negative locally, but total KL is still nonnegative
So log is measuring relative surprise or relative mismatch, not just raw difference.
Very Simple Picture
Imagine P is the truth, and Q is your model.
Then KL is asking:
If the world really follows P, how much extra surprise do I get by pretending it follows Q instead?
- small KL means
Qis a good approximation ofP - large KL means
Qmisses important parts ofP
Physical Intuition With Curves
If P(x) is centered at 0 and Q(x) is shifted to the right:
- near
x = 0,Pis high - but
Qmay be low there - that creates a large penalty
That is why KL increases as the two curves separate.
One-Sentence Intuition
np.sum(p * np.log(p / q)) is:
the average log-mismatch of Q from P, where the averaging is done using P itself.