Dr. Alexander Fisher
Duke University
MM stands for “majorize-minimize” and “minorize-maximize”.
Key idea: it’s easier to optimize a surrogate function than the true objective.
Let \(f(\theta)\) be a function we wish to maximize. \(g(\theta | \theta_n)\) is a surrogate function for \(f\), anchored at current iterate \(\theta_n\), if
\(g\) “minorizes” \(f\): \(g(\theta | \theta_n) \leq f(\theta) \ \ \forall \ \theta\)
\(g(\theta_n | \theta_n) = f(\theta_n)\) (“tangency”).
Equivalently, if we wish to minimize \(f(\theta)\), \(g(\theta | \theta_n)\) is a surrogate function for \(f\), anchored at current iterate \(\theta_n\), if
\(g\) “majorizes” \(f\): \(g(\theta | \theta_n) \geq f(\theta) \ \ \forall \ \theta\)
\(g(\theta_n | \theta_n) = f(\theta_n)\) (“tangency”).
We wish to minimize \(f(x) = cos(x)\).
We need a surrogate \(g\) that majorizes \(f\).
\[ g(x | x_n) = cos(x_n) - sin(x_n)(x - x_n) + \frac{1}{2}(x - x_n)^2 \]
We can minimize \(g\) easily, \(\frac{d}{dx}g(x | x_n) = -sin(x_n) + (x - x_n)\).
Next, set equal to zero and set \(x_{n+1} = x\), \(x_{n+1} = x_n + sin(x_n)\).
Finding \(g\) is an art. Still, there are widely applicable and powerful tools everyone should have in their toolkit.
Objective function: \(f(x)\)
second order Taylor expansion of \(f\) around \(x_n\):
\[ f(x) = f(x_n) + f'(x_n) (x-x_n) + \frac{1}{2} f''(y) (x - x_n)^2 \]
Here, \(y\) lies between \(x\) and \(x_n\). If \(f''(y) \leq B\) where \(B\) is a positive constant, then
\[ g(x|x_n) = f(x_n) + f'(x_n) (x - x_n) + \frac{1}{2} B (x - x_n)^2 \]
This is the “quadratic upper bound”.
Equivalently, a function is convex if its epigraph (the points in the region above the graph of the function) form a convex set. For example, \(f(x) = |x|\) is convex by the epigraph test.
A function is concave iff its negative is convex.
\(f(x) \geq f(x_n) + f'(x_n) (x - x_n)\) because \(f''(x_n) \geq 0\)
For a convex function \(f\), Jensen’s inequality states
\[ f(\alpha x + (1 - \alpha) y) \leq \alpha f(x) + (1-\alpha) f(y), \ \ \alpha \in [0, 1] \]
\[ f(u + v) \leq \frac{u_n}{u_n + v_n} f\left(\frac{u_n + v_n}{u_n} u\right) + \frac{v_n}{u_n + v_n} f\left(\frac{u_n + v_n}{v_n}v\right) \]
Using \(f(x) = -\ln(x)\) show that Jensen’s inequality let’s us derive a minorization that splits the log of a sum.
Note: this minorization will be useful in maximum likelihood estimation under a mixture model
To see why this will be useful, recall: in a mixture model we have a convex combination of density functions:
\[ h(x | \mathbf{w}, \boldsymbol{\theta}) = \sum_{i = 1}^n w_i~f_i(x | \theta_i). \]
where \(w_i > 0\) and \(\sum_i w_i = 1\).
Content of this lecture based on chapter 1 of Dr. Ken Lange’s MM Optimization Algorithms.
Lange, Kenneth. MM Optimization Algorithms. Society for Industrial and Applied Mathematics, 2016.