Model Fit: Overfitting vs Underfitting: The Governing path of Machine Learning
Let’s face it, even before we were properly exposed to data science we had probably heard both of these terms: overfitting and underfitting. The reason these two terms shall be regarded as the guiding philosophy of machine learning is that every machine learning model in existence conforms to the trade-off between both of these, which in turn dictates their performance and therefore every machine learning algorithm seeks to create models that offer the best trade-off between them.
But why do we care about it?
Whenever we model any data using machine learning, the end objective is that the trained model should be able to correctly predict the output label when a previously unseen set of inputs is provided to it. So how does the model achieve this? It does so by learning a decision mapping or simply a function from the set of inputs to the output label during the training process.
Therefore, the validity and performance of our model can only be realized when it is evaluated using previously unseen data. So how can we say if a model will give good predictions for previously unseen data or not? It depends on the ‘generalizability’ of the model i.e. if the decision mapping learned by the model during the training remains valid for the previously unseen data as well so that it produces correct predictions for them, the model can be regarded as generalizable.
As we would learn, both overfitting and underfitting are hindrances towards a model’s generalizability; a perfectly generalized model wouldn’t suffer from any overfitting or underfitting. Although in reality, it’s impossible to achieve a perfectly generalized model with no overfitting and underfitting. Instead, we rely on a trade-off between them where we strive to reduce both of them to a point where we achieve a ‘best fit’, the maximum possible generalizability for a model.
What is a model?
Representation of a statistical model [source]
Before understanding overfitting and underfitting, we must understand what a model is. In the realm of statistics and data science,
A model can be understood as an abstract representation of the real world which is created only using the data that we are provided, which is otherwise called a ‘sample’.
As an analogy, if we want to make a generic model of a tangible physical classroom, then each physical aspect of the classroom such as the no. of benches, the no. of desks, the dimensions of the whiteboard, etc., is the information or the data associated with it which we can use to model it.
A model can also be thought of as a mathematical function that maps a set of inputs to an output. This set of inputs and the output are different ‘aspects of our model and through machine learning, we attempt to establish a relationship between the set of inputs and the output. As an example, given the number of benches and desks in a classroom, we can easily establish a relationship that will compute the number of students who can attend the class simultaneously.
The Notion of Overfit, Good fit, and Underfit [source]
So how does this notion extend to the idea of overfitting and underfitting? Let’s consider a scenario where a student needs a tailor-made school uniform. For this purpose, the tailor needs some information about the student’s physique first, so that the uniform will fit the student properly. But there’s a catch, although the tailor is extremely skilful and can take very accurate measurements of the student’s physique and tailor the dress perfectly as per that data, the type of fabric used by the tailor is found to have shrunk by some amount when washed. Due to this, there’s always a degree of uncertainty regarding how well the dress will fit the student since the amount of shrinkage can’t be predetermined.
So how can the tailor accommodate this uncertainty while tailoring the uniform so that it still fits the student? If the tailor decides to make the uniform very loosely fit as compared to the measurements, then even after shrinking the uniform will loosely fit or ‘underfit’ the student. If the tailor decides to make the uniform absolutely as per the measurements, then the uniform is bound to be tightly fit or ‘overfit’ the student after shrinking. So what’s the solution? The tailor shall leave only a judicious amount of margin for shrinkage while tailoring the uniform so that even after washing it offers a perfect fit or ‘best fit’ for the student.
Understanding Overfitting and Underfitting With Regression Models
Let us perform a simple experiment. To understand the notion of underfitting and overfitting, we will try to fit a few regression models to a set of data points. Our very first step will be to import a few Python libraries to enable us to fit the regression models and plot them:
Let us get ourselves some sample data points to fit the model upon:
Here’s how they appear in a scatterplot:
Scatter plot of 10 data points in a 2D plane
We will, first of all, fit a basic linear regression line to these data points and calculate the regression formula, represented by ‘a’ here, using the fit model’s coefficients and y-intercept. Thereafter, we plot the regression line along with the data points:
Upon executing the above piece of code we obtain the following result:
Linear Regression Model fitting the 10 data points
A basic linear regression line like the one shown above seems to model our data points just alright, but we can see it doesn’t do a very good job of capturing the overall trend; we obtain an r-squared score of 0.85. But can we do better? Let’s try to fit a polynomial regression line of degree 2 and plot it:
The above piece of code generates the following plot:
Polynomial Regression Model of degree 2 fitting the 10 data points
This time we obtain an r-squared score of 0.87. Although we don’t have a major improvement in the score, now it appears that upon increasing the degree of the polynomial, the r-squared score is increasing and the overall trend of the data is getting captured more accurately. Thus, one may argue that we should keep increasing the degree of the polynomial as it is improving the score. We’ll try to find out what happens then. Let’s try to fit a polynomial regression line of degree 9 and plot it:
Upon executing the above piece of code we obtain the following plot:
Polynomial Regression Model of degree 9 fitting the 10 data points
Our model produces an r-squared score of 0.99 this time! That appears to be an astoundingly good regression model with such an impressive score. Or is it the case?
As we read earlier that the goodness of any model is determined by its generalizability. So to determine how generalized this model is, let’s add five additional observations to our synthetic dataset which supposedly belong to the same distribution as the original data sample. It is worth noting that the model hasn’t been trained on these data points. Here’s how the model behaves for the newly added data points:
Polynomial Regression Model of degree 9 being tested for 5 additional data points
So what just happened? We had obtained the best r-squared score for the polynomial regression line of degree 9 earlier, yet it failed to model any of these new points, and we have received a negative r-squared score this time, which indicates it is an extremely inefficient model. Thus, it can be concluded that it is not a generalized model; although it performs supremely in an attempt to model the training data, it is unable to model any new data point on which it hasn’t been trained on. Thus, this model can be regarded as an overfitting model or a high variance model.
According to Wikipedia, overfitting refers to
“the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.”
As evident in our experiment, the polynomial regression model with degree 9 conformed to the training data to such an extent that it lost its capability to generalize or in other words, the model picked up the random fluctuations in the overall trend i.e. the noise in the data. Hence, the model was unable to predict the previously unseen points correctly as it hasn’t learned the general trend from the data but only picked up the noise. This particular situation is regarded as overfitting.
In this particular case, the model kept coming up with complex and even complex decision rules with the objective of modelling all the training data points perfectly. But in this process, totally disregards the notion of generalizing, and hence, those decision rules fail to model the unseen data points.
Now one may come up with the intuition that complex decision rules always lead to overfitting and hence one shall stick to the simplest possible decision rules with the hope that it will help the model generalize very well. But is it the case? Let’s find out.
We will try to model the same five additional observations using the linear regression model this time and by logic, they should be mapped extremely well as the model should be more generalized this time. But instead, this happens:
Linear Regression Model being tested for 5 additional data points
We obtain an r-squared score of 0.65 this time and as evident, the new data points still haven’t been modelled correctly by the linear regression model, and instead of an improvement in the performance, it has degraded by some amount. Thus, this model can be regarded as an underfitting model or a highly biased model.
According to Wikipedia,
Underfitting occurs when a statistical model or machine learning algorithm cannot adequately capture the underlying structure of the data.
As evident in our experiment as well, linear regression was never able to model the data correctly, it didn’t fit the training data well and it also failed to model the unseen data.
Usually, the model is found to be underfitting when there’s not enough training data for the model to learn from or when the model itself is unable to capture the trend from the data due to its underlying nature.
So what is the solution then? The only possible solution to this dilemma is that we meet somewhere in between where the model neither overfits nor underfits and we have a model that has a “good fit”. If we try to model our five unseen data points using the polynomial regression model of degree 2, we obtain the following result:
Polynomial Regression Model of degree 2 being tested for 5 additional data points
We obtain an r-squared score of 0.90! This time the model is able to correctly predict and model the overall trend of the data, which is confirmed by the increase in the r-squared score after the addition of the five unseen data points.
The decision mapping learned by this particular model is generalized enough so that it can map the data points that it hasn’t been trained upon as well.
Bias-Variance Tradeoff Curve
Overfitting and underfitting are two governing forces that dictate every aspect of a machine learning model. Although there’s no silver bullet to evade them and directly achieve a good bias-variance tradeoff, we are continually evolving and adapting our machine learning techniques on the data level as well as algorithmic level so that we can develop models that will be less prone to overfitting and underfitting.