Introduction¤

This is an advanced API for minimisers and least-squares solvers.

Optimistix has a generalised approach to line searches, trust regions, and learning rates. This generalised approach uses two notions: a "search", and a "descent".

Searches

Consider a function \(f \colon \mathbb{R}^n \to \mathbb{R}\), that we would like to minimise. Searches consume information about values+gradients+Hessians etc of \(f\), and produce a single scalar value. This corresponds to the distance along a line search, the radius of a trust region, or the value of a learning rate.

Descents

Descents consume information about values+gradients+Hessians etc. of \(f\), along with this scalar value, and compute the update to make. This corresponds to -scalar * gradient for gradient descent, -scalar * (Hessian^{-1} gradient) for (Gauss--)Newton algorithms, the distance along the dogleg path with Dogleg, (Jacobian^T Jacobian + scalar * I)^{-1} gradient for Levenberg--Marquardt (damped Newton), and so on. [Although Jacobian^T Jacobian isn't actually materialised -- the implementation does the smart thing and solves a least-squares problem using QR.]

Examples

Gradient descent is obtained by combining a fixed learning rate with steepest descent.
Levenberg--Marquardt is obtained by combining a trust region algorithm with a damped Newton descent.
BFGS is obtained by combining a backtracking Armijo line search with a Newton descent.
etc.

Acceptance/rejection

The search is evaluated on every step of the solver. The descent is only evaluated on some steps.

Consider performing a backtracking search along a Newton descent direction. In this case, we don't need to re-solve the linear system Hessian^{-1} gradient at every step -- we typically fix these "for the duration of" the search.

Quote marks used above because we don't make a distinction between the steps of the overall optimiser, and the steps within e.g. a line search: they are flattened into a single loop. This ends up being an easier abstraction to work in generality, and a useful performance optimisation when handling batches of data.

Thus, we refer to "accepted" steps as being those at which we re-evaluate the descent, and "rejected" steps as being the others. The search gets to decided which steps are accepted and rejected. For example, learning rate accepts every step, whilst a backtracking search accepts those steps with a good enough improvement (those satisfying the Armijo condition).

Minimisation vs least-squares

Minimisation

For a minimisation problem "minimise \(f \colon \mathbb{R}^n -> \mathbb{R}\)", then the quantities that might be evaluated are typically evaluations \(f(y)\), gradients \(\nabla f(y)\), and (approximations to) the Hessian \(\nabla^2 f(y)\).
Least squares

For a least-squares problem, we typically start with a function producing residuals \(r \colon \mathbb{R}^n \to \mathbb{R}^m\), and seek to minimise \(0.5 \sum_i r(y)_i^2\).

This is simply a minimisation problem for \(f(y) = 0.5 \sum_i r(y)_i^2\). Evaluations \(f(y)\) can be obtained directly. Gradients \(\nabla f(y) = r(y) \nabla r(y)\), which can be computed efficiently as a vector-Jacobian product. Hessians may be approximated via the Gauss--Newton approximation \(\nabla^2 f(y) \approx (\nabla r(y))^{T} (\nabla r(y))\).

API

All searches inherit from optimistix.AbstractSearch, and all descents inherit from optimistix.AbstractDescent. See the searches and descents pages.

The varying evaluation/gradient/Hessian/Jacobian information is passed to these as an optimistix.FunctionInfo. See the function info page.

Custom solvers

Finally, the really cool thing about these abstractions is how these can now be mix-and-match'd! For example,

from collections.abc import Callable
import optimistix as optx

class HybridSolver(optx.AbstractQuasiNewton):
    rtol: float
    atol: float
    norm: Callable
    descent: optx.AbstractDescent = optx.DoglegDescent()
    hessian_update: optx.AbstractQuasiNewtonUpdate = optx.BFGSUpdate(use_inverse=True)
    search: optx.AbstractSearch = optx.LearningRate(0.1)

will at each step:

form a quadratic approximation to the target function, using the approximate Hessian that is iteratively built up by the BFGS algorithm;
then build a piecwise-linear dogleg-shaped descent path (interpolating between the steepest descent and Newton descent directions);
make a fixed step of length 0.1 down the length of this path.

Moreover, this can be used to solve either minimisation or least squares problems.

As such these abstractions make it possible to build very flexible optimisers, to try out whatever works best on your problem.

Info

Optimistix makes heavy use of these abstractions internally. For example, take a look at the source code for optimixis.LevenbergMarquardt: it works by subclassing optimistix.AbstractGaussNewton (which provides the overall strategy), and then setting a choice of descent and search. You can define custom solvers in exactly the same way.