Effective implementation of machine learning requires well-thought-out groundwork. Simply imposing or adapting a model to available data is insufficient to ensure success. Employing cross-validation techniques can boost model accuracy. These methods leverage statistical methodologies to enable predictions about the performance level of a given machine learning model.
Discover several cross-validation techniques by reading extensive explanations and working code included in this article.
Types of Cross-validation
- K-Fold Cross-Validation
- Repeated-Measures Interval-Based Validation with Hold-Out
- Stratified K-Fold Cross-Validation
- Avoid Cross-Validation Altogether
- Implementation of Leave-One-Out Cross-validation Structure
- Monte Carlo Shuffle-Split: A Game of Chance
- Rolling Cross-Validation Based on Date Ranges
K-Fold Cross-Validation Testing
Using this method involves dividing the complete dataset into k folds or sections of equal size. The term “k-fold” comes from the number of sections or folds, where the value of k can range from 3 to 5.
The model is trained using K-1 fold cross-validation, where the first fold acts as a validation set. This continues in a loop until each fold has been used as validation once. Then, all the folds are combined to create a validation set, and the remaining data is used for training purposes.
The above diagram depicts five folds, which indicate the iteration process repetitions. During an iteration, one fold is allocated as the test/validation set, while k-1 folds (which amounts to four folds) serve as the training set. The precision of validation data from the k-models is utilized to compute the final accuracy.
It’s not recommended to apply this validation technique on imbalanced datasets that do not have the same proportion of training examples for each category, as it may produce inaccurate outcomes. For instance, if a dataset has more data points for one category than the others, then the validation method may fail to accurately evaluate the model’s performance.
Holdout Group Validation Method
Holdout Cross-Validation or Train-Test Split involves randomly partitioning the complete dataset into two different subsets: a Validation Set and a Training Set. Typically, around 70% of the data is used to create the Training Set, while the remaining 30% is used for the Validation Set. This method of breaking down the data into two halves enables the model to be built initially on the Training Set and subsequently tested more efficiently.
K-Fold Cross-Validation with Stratified Sampling
Since the k-fold validation involves partitioning the data into uniformly distributed k-folds, it’s not suitable for unbalanced datasets. However, stratified k-fold, an improved version of k-fold cross-validation, can resolve this limitation by catering to unequal datasets. In this approach, the dataset is split into k equal folds, but the ratio of target variable occurrences remains consistent across all folds. Therefore, it is more effective for skewed datasets than time series.
As depicted by the data provided above, the original dataset had an unequal number of males and females, which skewed the target variable distribution. This imbalance is sustained during the stratified k-fold cross-validation method.
To conduct comprehensive cross-validation, a dataset consisting of n samples must be separated into a validation set containing p samples and a training set containing n-p samples. This procedure should be repeated until every sample from the dataset has been employed in both the validation and training sets.
Although the technique yields desirable outcomes, it is computationally impractical and time-consuming, as previously mentioned. Nonetheless, it isn’t recommended for an imbalanced dataset because the model may become biased in favour of one class or another, if the training dataset only comprises instances of a single class.
The Implementation of Leave-One-Out Cross-Validation Approach
In this technique, a training set containing n samples and a validation set comprising only one sample are employed. This aligns with the leave-p-out cross-validation method, in which p equals one in this specific scenario.
To provide further clarity, consider the following example:
Suppose the dataset includes 1000 individual entries. During each iteration, only one entry will be designated as validation data, and the remaining 999 entries will be used to train the model. This cycle will be repeated until every data point in the dataset has been leveraged as a test case.
Using the leave-one-out cross-validation technique on large datasets is not feasible due to the significant amount of computation required. Nevertheless, this methodology has several advantages, such as being user-friendly and requiring no customisation of parameters. Furthermore, it provides a trustworthy and unbiased assessment of the model’s effectiveness.
Cross-Validation Employing Monte Carlo Sampling
Similar to shuffle split cross-validation and repeated random subsampling cross-validation, the Monte Carlo approach entails splitting the entire dataset into testing and training data. The ratio of the split is entirely dependent on the user and can range from 70% to 30% or 60% to 40%. The only criterion for each cycle is that the percentage split between the training and testing data must differ.
In this step, fitting the model to the training dataset is crucial to evaluate its performance accurately. To obtain a reliable assessment of the model’s effectiveness, this procedure should be repeated several times, ideally ranging in the hundreds or thousands, and the average test errors should be calculated across each iteration. After that, the performance of the model on the test dataset can be ascertained.
Throughout every iteration, the proportion of data employed for training versus that employed for testing differs. The test errors have been smoothed out by averaging them for you.
Sequential Data (Forward Chaining / Rolling Cross-Validation Approach)
Before discussing the details of the rolling cross-validation technique, it is imperative to establish the definition of time-series data.
Time-series data refers to the representation of the fluctuations of a variable over time. This form of data provides an understanding of the various factors that influence a variable’s behaviour across time, including readings such as economic indicators, weather reports, and stock market data. By examining fluctuations in the data, one can make informed decisions about the variable and draw conclusions about potential future behaviour.
Cross-validation can be difficult when dealing with time-series datasets. Since it is not feasible to assign data instances randomly to either the train or test set, an alternative method must be employed. This method focuses on cross-validating datasets that are primarily determined by time, thereby providing a better understanding of how the data is likely to behave in the future.
When handling time-series data, in which the order of the data is crucial, it is vital to partition the dataset into training and validation sets that are arranged chronologically. This approach is commonly known as forward chaining or rolling cross-validation.
To begin with:
During the initial stages of the training process, a small section of the entire dataset should be used. The accuracy of predictions can be evaluated by projecting to future time intervals. Subsequently, data points can be predicted by utilising the information from the projected data, which will be incorporated into the next training dataset. The process is ongoing.
Explore these applications and experiment with cross-validation using these seven techniques.
Thank you for your code, and happy experimenting!