Tech Blog

Quantile loss function for machine learning

jose_circle

Quantile loss function for machine learning

Motivation

It is not always sufficient for a machine learning model to make accurate predictions. For many commerical applications, it is equally important to have a measure of the prediction uncertainty.

We recently worked on a project where predictions were subject to high uncertainty. The client required for their decision to be driven by both the predicted machine learning output and a measure of the potential prediction error. The quantile regression loss function solves this and similar problems by replacing a single value prediction by prediction intervals.

This post introduces the powerful quantile loss regression, gives an intuitive explanation of why it works and solves an example in Keras.

The quantile regression loss function

Machine learning models work by minimizing (or maximizing) an objective function. An objective function translates the problem we are trying to solve into a mathematical formula to be minimized by the model. As the name suggests, the quantile regression loss function is applied to predict quantiles. A quantile is the value below which a fraction of observations in a group falls. For example, a prediction for quantile 0.9 should over-predict 90% of the times.

Given a prediction yip and outcome yi, the regression loss for a quantile q is

L(yip, yi) = max[q(yi − yip),  (q − 1)(yi − yip)]

For a set of predictions, the loss will be the average.

A mathematical derivation of the above formula can be found in Quantile Regression article in WikiWand. If you are interested in an intuitive explanation, read the following section. If you are just looking to apply the quantile loss function to a deep neural network, skip to the example section below.

Intuitive explanation

Let’s start the intuitive explanation by considering the most commonly used quantile, the median. If q is substituted with 0.5 in the equation above, the mean absolute error function is obtained which predicts the median. This is equivalent to saying that the mean absolute error loss function has its minimum at the median.

A simple example is probably the easiest approach to explain why this is the case. Consider three points on a vertical line at different distances from each other: upper point, middle point and lower point. In this one-dimensional example, the absolute error is the same as the distance. The hypothesis to be confirmed is that the mean absolute error is minimum at the median (middle point). To check our hypothesis, we will start at the middle point and move upward getting closer to the upper point but further, by the same distance, to both the middle and lower points. This will obviously increase the mean absolute error (i.e. mean distance to the three points). The same applies if moving downwards. Our hypothesis is confirmed as the middle point is both the median and the minimum of the mean absolute loss function. If instead of having a single upper point and a single lower point, we had one hundred points above and below, or any other arbitrary number, the result still stands.

In the regression loss equation above, as q has a value between 0 and 1, the first term will be positive and dominate when under-predicting, yi > yip, and the second term will dominate when over-predicting, yi < yip. For q equal to 0.5, under-prediction and over-prediction will be penalized by the same factor, and the median is obtained. The larger the value of q, the more under-predictions are penalized compared to over-predictions. For q equal to 0.75, under-predictions will be penalized by a factor of 0.75, and over-predictions by a factor of 0.25. The model will then try to avoid under-predictions approximately three times as hard as over-predictions, and the 0.75 quantile will be obtained.

Keras example

Time to consolidate the theoretical knowledge with an example. We will use Tensorflow 2.0 Keras API. If you have not had the time to try Tensorflow 2.0, we suggest having a look at this blog.

In our example, we will exploit the (perhaps surprising) ability of deep neural networks (DNNs) to approximate any continuous function, provided the DNN has sufficient parameters (i.e. neurons) and training data. The x − y relationship to be learned by the DNN is the following:

y = x + sin(pi x/2) + N(μ = 0, σ2 = 0.22),

where the last term represents a randomly obtained sample from a normal distribution with zero mean and 0.2 standard deviation. Figure 1 plots the first two terms on the right side of the equation, which can be approximated by a DNN to any desired accuracy level. The final term is unpredictable but it can be modeled through quantile regression.

Figure 1: Predictable terms in the x − y relationship.

Let’s start by defining a random generator of instances that satisfy the x − y relationship described above. This is achieved by the function get_data below, which provides the attribute matrix and output vector for a number of instances (num). The variables x_train and y_train will be used for training the model, and x_test and y_test for testing the model accuracy.

import numpy as np
import tensorflow as tf


def f_predictable(x):
    return x+np.sin(np.pi*x/2)


def f(x, std=0.2):
    return f_predictable(x)+np.random.randn(len(x))*std


def get_data(num, start=0, end=4):
        x = np.sort(np.random.rand(num)*(end-start)+start)
        y = f(x)
        return x.reshape(-1, 1), y

x_train, y_train = get_data(num=20000)
x_test, y_test = get_data(num=1000)

The loss function for a quantile q, the set of predictions y_p, and the actual values y are:

def quantile_loss(q, y, y_p):
        e = y-y_p
        return tf.keras.backend.mean(tf.keras.backend.maximum(q*e, (q-1)*e))

Our example Keras model has three fully connected hidden layers, each with one hundred neurons. For our example, we use the model to predict the quantiles 0.023, 0.5 and 0.977. The code below is for quantile 0.977. Although the model requirements decrease for quantiles closer to the median (i.e. quantile 0.5), we were actually surprised by the amount of neurons (300) and data points (20,000) required to obtain accurate results.

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(100, activation='relu', input_dim=1))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='linear'))
# The lambda function is used to input the quantile value to the quantile
# regression loss function. Keras only allows two inputs in user-defined loss
# functions, actual and predicted values.
quantile = 0.977
model.compile(optimizer='adam', loss=lambda y, y_p: quantile_loss(quantile, y, y_p))
model.fit(x_train, y_train, epochs=20)
prediction = model.predict(x_test)

Figure 2 shows the predictions for quantiles 0.023, 0.5 and 0.977. The 0.5-quantile prediction is an accurate approximation of the predictable terms in the x − y relationship (see Figure 1). This is not surprising as the median does not get affected by a normal distribution with mean (and median) of zero. The normal distribution does influence the other quantiles and approximately 95% of the instances are between the 0.023-quantile and 0.977-quantile predictions. Given the properties of the normal distribution, the 0.023 and 0.977 quantiles are two standard deviations below and above the median. Hence, four standard deviations (0.8) should be the vertical distance between the 0.023 and 0.977 quantiles, which is modeled well (see Figure 2). Increasing the prediction accuracy is as simple as adding more neurons and more instances for training, provided you have the patience and computational power!

Figure 2: Predictions for quantiles 0.023, 0.5 and 0.977 and actual values (test instances).

Next steps

To learn more about loss functions for machine learning, have a look at this blog. If you would like us to cover any other machine learning topics or have any machine learning related questions, just contact us.