Unable to load WebIO. Please make sure WebIO works for your Jupyter client. For troubleshooting, please see the WebIO/IJulia documentation.

The Hessian matrix appears in the optimization literature, but the intuition for how the Hessian and its inverse transform vectors is opaque to me. Let's review second order partial derivatives, and then try to build intuition for the Hessian matrix.

For the purpose of this intuition-building exercise, we'll work with functions $\Reals^2 \mapsto \Reals^1$. I'll also use partial derivative notations $\frac{\partial}{\partial y} f(x, y) = \frac{\partial f}{\partial y} = f_y$ interchangeably.

1. Partial Derivatives

Take the $\Reals^2 \mapsto \Reals^1$ function $f(x, y) = x^2 + 2y^2$.

A partial derivative is the change in an "output" variable (in our case, $f$) with respect to infinitesimal changes in an "input" variable (in our case, $x$ or $y$). For example, $\frac{\partial}{\partial y} f(x, y) = 4y$, which is to say, for any point in the domain, moving infinitsimally in the y direction changes f propotional to 4 times the y coordinate of the starting point point.

f(x, y) = x^2 + 2y^2

x = 6

xlim=[-10, x]
ylim=[-10, 10]

xs = LinRange(xlim..., 101)
ys = LinRange(ylim..., 101)
zs = [f(x, y) for x in xs, y in ys]

y = 4
dy = 4
f_y(y) = 4y
f_y (generic function with 1 method)

We can plot the function $f_y$ for every starting point:

We can do the exact same exercise with $f_x$:

f_x(x) = 2x
f_x (generic function with 1 method)
traces = GenericTrace[]

push!(traces, PlotlyJS.surface(x=xs, y=ys, z=zs,
        showscale=false, opacity=0.8))
push!(traces, PlotlyJS.surface(x=ylim, y=ylim, z=[[0, 0] [0, 0]],
        showscale=false, colorscale="Greys", opacity=0.3))
push!(traces, PlotlyJS.surface(x=xs, y=ys, z=[f_x(x) for x in xs, y in ys],
        showscale=false))

plot(traces, Layout(scene=scene))

So the way the second order partial derivative is defined is as a composition, e.g. $f_{xx} = \frac{\partial}{\partial x} \left( \frac{\partial}{\partial x}\left( f(x, y) \right) \right) $. As second derivatives do, it captures the [change in the [change in the [output variable]]] with respect to infinitesimal changes in the input variable. This notion coincides with the curvature of the function: a positive second derivative at a particular point indicates that the output variable is concave up at a that point, and a negative second derivative indicates the output variable is concave down at that point.

In the case of the function we've chosen, $f_{xx} = 2$ and $f_{yy} = 4$ which informs us that $f$ is concave up everywhere in the domain, which makes sense from looking at the plot.

However, we've omitted the "mixed" partial derivatives here: $f_{xy}$ and $f_{yx}$. We can compute them to both be zero for this particular function. What does that tell us?

It's clear from the graph above that infinitesimal changes in $x$ do not influence the value of $f_y$. Interpreting $f_{yx} = \frac{\partial}{\partial x} \left( \frac{\partial}{\partial y}\left( f(x, y) \right) \right)$ as $\frac{\partial}{\partial x} f_y $, it's clear $f_{yx} = 0$. But how can we interpret that in terms of the original function f?

Here is where we find the intuition. The mixed second order partial derivatives tell us how the slope along one coordinate axis changes as we move infinitesimally along an orthogonal coordinate axis. In the function we've chosen, the slice of the graph at any x coordinate is the same parabola (just vertically offset by $x^2$) and thus has the same slope for any y. The mixed second order partial derivatives could thus be said to relay information about the "pucker" of the graph of the function (I made that name up) which is the concavity of the graph with respect to two coordinate axes.

In order to see this more clearly, let's add a mixed term to our function, to produce non-zero mixed second order partial derivatives:

g(x, y) = x^2 + 2y^2 - x*y

zs = [g(x, y) for x in xs, y in ys]

g_x(x, y) = 2x - y
g_y(x, y) = 4y - x
g_y (generic function with 1 method)

The slope along the y coordinate changes as we vary x. This function is "puckered" up.

  • Note: For twice-continuously differentiable functions, $f_{xy}$ = $f_{yx}$. This is called the symmetry property of second derivatives. And leads to a symmetric Hessian matrix. We'll assume that property of our functions throughout.

2. Hessian Matrix

Having now built intuition for each of the entries in the Hessian Matrix, we can begin to build intuition for the matrix itself:

$$ \bold{H}_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} f_{xx} & f_{yx} \\ \\ f_{xy} & f_{yy} \end{bmatrix}$$

In some sense, the Hessian is a matrix-valued function we can evaluate for any point $\bold{v_0} = \begin{bmatrix} x_0 \\ y_0 \end{bmatrix}$ in the domain of $f$:

$$ \bold{H}_f(\bold{v_0}) = \begin{bmatrix} f_{xx}(\bold{v_0}) & f_{yx}(\bold{v_0}) \\ \\ f_{xy}(\bold{v_0}) & f_{yy}(\bold{v_0}) \end{bmatrix}$$

The Hessian shows up in the quadratic approximation (second order Taylor expansion) of multivariate functions around a particular point $\bold{v_0}$:

$$ Q_f(\bold{v_0}) = \color{green} f(\bold{v_0}) \color{black} + \color{blue} \nabla_f(\bold{v_0}) \cdot (\bold{v} - \bold{v_0}) \color{black} + \color{indigo} {1 \over 2}(\bold{v} - \bold{v_0})^\intercal \left[ \bold{H}_f(\bold{v_0}) \right] (\bold{v} - \bold{v_0})$$

And we might ask: how does this matrix scale $(\bold{v} - \bold{v_0})$? Let's try to build some intuition:

Suppose we multiply some vector against that Hessian Matrix:

$$\begin{bmatrix} f_{xx}(\bold{v_0}) & f_{yx}(\bold{v_0}) \\ \\ f_{xy}(\bold{v_0}) & f_{yy}(\bold{v_0}) \end{bmatrix} \cdot \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} f_{xx} \cdot x + f_{xy} \cdot y \\ f_{yy} \cdot y + f_{yx} \cdot x \end{bmatrix} = $$$$ \begin{bmatrix} ( \text{the rate of change of the slope in the x direction as you move in the x direction} ) ( \text{a distance in the x direction} ) + \\ ( \text{the rate of change of the slope in the x direction as you move in the y direction} ) ( \text{a distance in the y direction} ) \\ \\ ( \text{the rate of change of the slope in the y direction as you move in the y direction} ) ( \text{a distance in the y direction} ) + \\( \text{the rate of change of the slope in the y direction as you move in the x direction} ) ( \text{a distance in the x direction} ) \end{bmatrix} = $$$$ \begin{bmatrix} \text{the approximate slope in the x direction at the distance x from }x_0 \\ \text{the approximate slope in the y direction at the distance y from }y_0 \end{bmatrix} = $$

$$ \text{what a second-order approximation of } f \text{ suggests the gradient is at } \bold{v_0} + \begin{bmatrix} x \\ y \end{bmatrix} $$

However, let's note that the quadratic approximation of $f$ includes as quadratic form $(\bold{v} - \bold{v_0})^\intercal \left[ \bold{H}_f(\bold{v_0}) \right] (\bold{v} - \bold{v_0})$, not just a matrix-vector product. Applying this second product operation results in a scalar, which is the change in the value of the function as a result of extrapolating the (second-order) curvature of the function at $\bold{v_0}$ out towards $\bold{v} - \bold{v_0}$.

3. Newton Method

Newton's method for optimization is an iteration of

  • forming the quadratic approximation $Q_f(\bold{v_0})$ of a function around the current point.
  • jumping to the min/max of the paraboloid approximation for the next iteration.

This requires us to identify the min/max of the second-order approximation of $f$. Recall, the quadratic approximation (second order taylor expansion) of $f$ at a particular point $\bold{v_0}$ is

$$ Q_f(\bold{v_0}) = \color{green} f(\bold{v_0}) \color{black} + \color{blue} \nabla_f(\bold{v_0}) \cdot (\bold{v} - \bold{v_0}) \color{black} + \color{indigo} {1 \over 2}(\bold{v} - \bold{v_0})^\intercal \left[ \bold{H}_f(\bold{v_0}) \right] (\bold{v} - \bold{v_0})$$

Searching for the extrema of this approximation, we can take $\nabla \left[ Q_f(v_0) \right] = 0$.

  • $ \nabla \left[ \color{green} f(\bold{v_0}) \color{black} \right] = 0$
  • $ \nabla \left[ \color{blue} \nabla_f(\bold{v_0}) \cdot (\bold{v} - \bold{v_0}) \color{black} \right] = \nabla_f(\bold{v_0})$
  • $ \nabla \left[ \color{indigo} {1 \over 2}(\bold{v} - \bold{v_0})^\intercal \left[ \bold{H}_f(\bold{v_0}) \right] (\bold{v}- \bold{v_0}) \color{black} \right] = {1 \over 2} \left( \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v}+ \left[ \bold{H}_f(\bold{v_0}) \right]^\intercal \right) \bold{v}$

We can simplify ${1 \over 2} \left( \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v}+ \left[ \bold{H}_f(\bold{v_0}) \right]^\intercal \right) \bold{v}$ further, because $\left[ \bold{H}_f(\bold{v_0}) \color{black} \right] = \left[ \bold{H}_f(\bold{v_0}) \right]^\intercal$, because of the symmetry property of second derivatives, so this expression simplifies to ${1 \over 2} \left( 2 \left[ \bold{H}_f(\bold{v_0}) \right] \right) \bold{v} = \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v}$.

So all together, $\nabla \left[ Q_f(\bold{v_0}) \right] = \color{blue} \nabla_f(\bold{v_0}) \color{black} + \color{indigo} \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v}$.

Well, $\nabla_f(\bold{v_0}) + \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v} = 0$ when:

$$ \begin{aligned} \left[ \bold{H}_f(\bold{v_0}) \right] \bold{v} &= -\nabla_f(\bold{v_0}) \\ \bold{v} &= - \left[ \bold{H}_f(\bold{v_0}) \right]^{-1} \cdot \nabla_f(\bold{v_0}) \end{aligned} $$

So $\bold{v}$ is the vector which denotes the extremum of the quadratic approximation of $f$ around $\bold{v_0}$. The familiar Newton method expression pops out of our above derivation:

$$\bold{v_{k+1}} = \bold{v_k} - \left[ \bold{H}_f(\bold{v_k}) \right]^{-1} \nabla_f(\bold{v_k})$$

And now we know that this expression which seemed to come out of nowhere $\left[ \bold{H}_f(\bold{v_k}) \right]^{-1} \nabla_f(\bold{v_k})$ including an inverse Hessian (!!) is really just the $\bold{v}$ which solves $\nabla \left[ Q_f(v_0) \right] = 0$.

References