By Mona Eslamijam
This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
We train machine learning models to predict values such as the weather, stock prices, the class of an image, or the sentiment of a social media post. However, often, machine learning models fail to meet the performance levels that we expect of them.
There are several solutions to improve the accuracy of machine learning models. One popular method is “boosting,” an ensemble learning technique that brings together several ML models that perform poorly alone but stronger together.
Weak learners and strong learners
Before we get into boosting, it is worth visiting the concept of “weak” and “strong” learners. Weak learners are ML models that perform poorly, sometimes only slightly better than random guessing. There can be several reasons for an ML model becoming a weak learner. For example, there might not be enough training data or the model may not be complex enough.
In contrast, a strong learner makes mostly correct predictions with high confidence (the desired accuracy and confidence may vary depending on the application). Our goal in machine learning is to create strong learners.
Boosting takes several weak learners and combines them to create a strong learning system. Here, we’ll discuss some of the popular boosting methods.
Boosting is closely related to “bagging,” another ensemble method. Bagging (short for “bootstrap aggregating”) trains several weak learners on different bootstrap samples drawn from the training data (bootstrap samples are random samples taken with replacement). This results in the ML models learning different patterns. After training, when the ML model is presented with a new input, it runs it through all the weak learners and uses a majority voting system to make a final prediction. In a classification problem, the bagging model will choose the class that receives the most vote from the weak learners.
Boosting is like bagging but with the difference that it trains a sequence of weak learners that try to correct the mistakes of their predecessors. There are several different popular boosting techniques.
Boosting
Like bagging, boosting trains a series of weak learners on samples drawn from the training dataset. However, unlike bagging, boosting methods draw their samples “without replacement.” This means that the same example can’t be drawn twice from the training dataset when gathering a sample.
The weak learners are trained sequentially. First, the boosting algorithm draws a subset of training examples from the training dataset and trains a weak learner on them. The ML model will correctly classify some examples and misclassify others.
The algorithm then draws a second set of samples (without replacement) to train the second ML model. But this time, it also adds 50 percent of the examples that were misclassified by the first weak learner.
The boosting algorithm selects the examples that the first and second learners disagree on to train the third learner.
When all three learners are trained, the boosting model makes predictions by using majority voting, like bagging models. In comparison to bagging, boosting does a better job of reducing bias and variance in ML models.
AdaBoost (adaptive boosting)
AdaBoost is a variation of the original boosting algorithm that assigns weights to the training examples. In AdaBoost, the first learner is trained on the entire training dataset and all examples are assigned equal weights.
Since the ML model is a weak learner, it will misclassify some of the examples after training. The weights of those examples are increased, and the weights of the correctly classified examples are decreased. The second learner is trained on the new weights, with more emphasis on the misclassified examples.
The same self-correction process is repeated for the third (and later) models. Once all the models are trained, classification is done through majority voting.
Gradient boosting
Gradient boosting is another variant of the boosting algorithm that, like AdaBoost, trains the learners iteratively. Each new learner tries to overcome the shortcomings of the previous one.
However, instead of using misclassification, gradient boosting uses prediction errors to guide each new learner. This means that for each example that the first learner misclassifies, the boosting algorithm calculates the distance between the prediction and the ground truth and calculates the gradient. The algorithm uses these gradients to determine the weights of the training examples. Gradient boosting is more flexible in changing individual weights.
The downside of gradient boosting is that it can become inefficient and computationally expensive as your training dataset grows. It is also prone to overfitting if not configured properly. This is why AdaBoost still remains very popular.
XGBoost (Extreme Gradient Boosting)
XGBoost is like gradient boosting but with some performance enhancements. XGBoost takes advantage of parallel computing, distributed computing, cache optimization, and out-of-core processing to handle large datasets while keeping performance at an optimal level.
XGBoost also uses implementation tricks to make the processing of large datasets more efficient. For example, it divides the dataset into blocks that can be processed on separate machines at the same time.
XGBoost has become very popular in applied machine learning and Kaggle competitions because of its performance advantages.
For more information, you can refer to Machine Learning with PyTorch and Scikit-Learn, a good introductory book on machine learning that also has a chapter on boosting methods in machine learning.
About the author
Mona Eslamijam is a business analytics (MSc) graduate from the University of Texas at Dallas.