Quantile loss function for machine learning
Quantile loss function for machine learning
It is not always sufficient for a machine learning model to make accurate predictions. For many commerical applications, it is equally important to have a measure of the prediction uncertainty.
We recently worked on a project where predictions were subject to high uncertainty. The client required for their decision to be driven by both the predicted machine learning output and a measure of the potential prediction error. The quantile regression loss function solves this and similar problems by replacing a single value prediction by prediction intervals.
This post introduces the powerful quantile loss regression, gives an intuitive explanation of why it works and solves an example in Keras.
The quantile regression loss function
Machine learning models work by minimizing (or maximizing) an objective function. An objective function translates the problem we are trying to solve into a mathematical formula to be minimized by the model. As the name suggests, the quantile regression loss function is applied to predict quantiles. A quantile is the value below which a fraction of observations in a group falls. For example, a prediction for quantile 0.9 should over-predict 90% of the times.
Given a prediction yip and outcome yi, the regression loss for a quantile q is
L(yip, yi) = max[q(yi − yip), (q − 1)(yi − yip)]
For a set of predictions, the loss will be the average.
A mathematical derivation of the above formula can be found in Quantile Regression article in WikiWand. If you are interested in an intuitive explanation, read the following section. If you are just looking to apply the quantile loss function to a deep neural network, skip to the example section below.
Let’s start the intuitive explanation by considering the most commonly used quantile, the median. If q is substituted with 0.5 in the equation above, the mean absolute error function is obtained which predicts the median. This is equivalent to saying that the mean absolute error loss function has its minimum at the median.
A simple example is probably the easiest approach to explain why this is the case. Consider three points on a vertical line at different distances from each other: upper point, middle point and lower point. In this one-dimensional example, the absolute error is the same as the distance. The hypothesis to be confirmed is that the mean absolute error is minimum at the median (middle point). To check our hypothesis, we will start at the middle point and move upward getting closer to the upper point but further, by the same distance, to both the middle and lower points. This will obviously increase the mean absolute error (i.e. mean distance to the three points). The same applies if moving downwards. Our hypothesis is confirmed as the middle point is both the median and the minimum of the mean absolute loss function. If instead of having a single upper point and a single lower point, we had one hundred points above and below, or any other arbitrary number, the result still stands.
In the regression loss equation above, as q has a value between 0 and 1, the first term will be positive and dominate when under-predicting, yi > yip, and the second term will dominate when over-predicting, yi < yip. For q equal to 0.5, under-prediction and over-prediction will be penalized by the same factor, and the median is obtained. The larger the value of q, the more under-predictions are penalized compared to over-predictions. For q equal to 0.75, under-predictions will be penalized by a factor of 0.75, and over-predictions by a factor of 0.25. The model will then try to avoid under-predictions approximately three times as hard as over-predictions, and the 0.75 quantile will be obtained.
Time to consolidate the theoretical knowledge with an example. We will use Tensorflow 2.0 Keras API. If you have not had the time to try Tensorflow 2.0, we suggest having a look at this blog.
In our example, we will exploit the (perhaps surprising) ability of deep neural networks (DNNs) to approximate any continuous function, provided the DNN has sufficient parameters (i.e. neurons) and training data. The x − y relationship to be learned by the DNN is the following:
y = x + sin(pi x/2) + N(μ = 0, σ2 = 0.22),
where the last term represents a randomly obtained sample from a normal distribution with zero mean and 0.2 standard deviation. Figure 1 plots the first two terms on the right side of the equation, which can be approximated by a DNN to any desired accuracy level. The final term is unpredictable but it can be modeled through quantile regression.
Let’s start by defining a random generator of instances that satisfy the x − y relationship described above. This is achieved by the function
get_data below, which provides the attribute matrix and output vector for a number of instances (
num). The variables
y_train will be used for training the model, and
y_test for testing the model accuracy.
import numpy as np import tensorflow as tf def f_predictable(x): return x+np.sin(np.pi*x/2) def f(x, std=0.2): return f_predictable(x)+np.random.randn(len(x))*std def get_data(num, start=0, end=4): x = np.sort(np.random.rand(num)*(end-start)+start) y = f(x) return x.reshape(-1, 1), y x_train, y_train = get_data(num=20000) x_test, y_test = get_data(num=1000)
The loss function for a quantile
q, the set of predictions
y_p, and the actual values
def quantile_loss(q, y, y_p): e = y-y_p return tf.keras.backend.mean(tf.keras.backend.maximum(q*e, (q-1)*e))
Our example Keras model has three fully connected hidden layers, each with one hundred neurons. For our example, we use the model to predict the quantiles 0.023, 0.5 and 0.977. The code below is for quantile 0.977. Although the model requirements decrease for quantiles closer to the median (i.e. quantile 0.5), we were actually surprised by the amount of neurons (300) and data points (20,000) required to obtain accurate results.
model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(100, activation='relu', input_dim=1)) model.add(tf.keras.layers.Dense(100, activation='relu')) model.add(tf.keras.layers.Dense(100, activation='relu')) model.add(tf.keras.layers.Dense(1, activation='linear')) # The lambda function is used to input the quantile value to the quantile # regression loss function. Keras only allows two inputs in user-defined loss # functions, actual and predicted values. quantile = 0.977 model.compile(optimizer='adam', loss=lambda y, y_p: quantile_loss(quantile, y, y_p)) model.fit(x_train, y_train, epochs=20) prediction = model.predict(x_test)
Figure 2 shows the predictions for quantiles 0.023, 0.5 and 0.977. The 0.5-quantile prediction is an accurate approximation of the predictable terms in the x − y relationship (see Figure 1). This is not surprising as the median does not get affected by a normal distribution with mean (and median) of zero. The normal distribution does influence the other quantiles and approximately 95% of the instances are between the 0.023-quantile and 0.977-quantile predictions. Given the properties of the normal distribution, the 0.023 and 0.977 quantiles are two standard deviations below and above the median. Hence, four standard deviations (0.8) should be the vertical distance between the 0.023 and 0.977 quantiles, which is modeled well (see Figure 2). Increasing the prediction accuracy is as simple as adding more neurons and more instances for training, provided you have the patience and computational power!