In order to ensure the successful implementation of machine learning, thorough preparation is necessary. Simply applying a model to existing data or fitting it to such data is not enough to guarantee success. To increase the accuracy of the model, cross-validation methods can be used. By utilising these techniques, it is possible to use statistics to make predictions about how well a machine learning model will work.
Learn about the many cross-validation methods available with thorough descriptions and working code in this post.
- K-Fold Cross Validation
- Interval-based, Repeated-Measures Validation with a Hold-Out
- K-Fold Cross-Validation with Stratification
- Don’t bother with a cross-validation
- The use of a leave-one-out cross-validation design
- A Game of Chance in Monte Carlo (shuffle-split)
- Date ranges (rolling cross-validation)
K-fold validation cross-testing
By employing this technique, the entire dataset is partitioned into k equally-sized sections, or folds. The name “k-fold” originates from the fact that there are k sections, where k can be any positive number ranging from 3 to 5.
A model is trained using K-1 folds cross-validation, in which the first fold is used as a validation set. This process is repeated until each fold has been used as a validation set once, after which all of the folds are used together as a validation set and the remaining data is used as a training set.
The figure above illustrates five folds, each representing a repetition of the iteration process. For each repetition, one fold will be used as the test/validation set, while the remaining k-1 folds (totalling four folds) will be used as the training set. The accuracy of the validation data from the k-models can then be taken and used to calculate the final accuracy.
It is not advisable to use this validation method on datasets that are not balanced in terms of the ratio of training data for each class, as this could lead to inaccurate results. For example, if a dataset contains more data points for one class than another, then the validation method may not be able to accurately measure the performance of the model.
Validation with a holdout group
Holdout Cross-validation, also referred to as a Train-Test Split, is a process of randomly dividing the entire dataset into two distinct subsets: a Training Set and a Validation Set. Generally, the Training Set consists of approximately 70% of the data, whereas the Validation Set contains about 30%. This strategy of dividing the data into two parts allows for the model to be constructed first on the Training Set, then tested more rapidly.
K-Fold Cross-Validation with Stratification
Due to the fact that data is divided into k-folds with a uniform probability distribution, k-fold validation cannot be used for unbalanced datasets. In comparison, stratified k-fold, an enhanced version of the k-fold cross-validation technique, is suitable for this purpose. In this method, the dataset is split into k equal folds, and the ratio of occurrences of the target variables is maintained constant in all folds. This makes it applicable for skewed datasets, unlike time series.
The target variable distribution is heavily skewed, as the original dataset had a disproportionate number of males to females, as evidenced by the data above. This imbalance in the distribution of target variable occurrences is maintained throughout the stratified k-fold cross-validation process.
Don’t bother with a cross-validation
In order to perform an exhaustive cross-validation, a dataset consisting of n samples should be split into a validation set of p samples and a training set of n-p samples. This process should be repeated until each and every sample from the dataset has been utilised in both the validation and training sets.
Despite the considerable amount of time required to process, the method produces desirable results. Nevertheless, it is deemed to be computationally impractical and therefore not advisable for an imbalanced dataset. This is because the model would become prejudiced towards one class or the other if the training collection only contains examples from a single class.
The use of a leave-one-out cross-validation design
This approach uses a training set containing n samples and a validation set of only one sample. This is analogous to the leave-p-out cross-validation method where p is equal to one in this particular case.
Here’s an illustration that might help:
For this data collection, a total of 1000 individual records will be utilised. A single instance will be used for validation in each iteration, while the remaining 999 records will be used for training. This process will be repeated until each data point in the dataset has been employed as a test case.
Due to the extensive amount of processing required, using the leave-one-out cross-validation approach on large datasets is not feasible. Additionally, this method is advantageous because it is straightforward to use and does not necessitate the configuration of individualised settings. Moreover, it provides a reliable and impartial evaluation of the efficacy of the model.
Cross-Validation using Monte Carlo Sampling
The Monte Carlo method involves partitioning the entire dataset into training and testing data, which is analogous to shuffle split cross-validation and repeated random subsampling cross-validation. The ratio of the split is entirely up to the user; it can range from 70% to 30% or 60% to 40%. The only requirement between cycles is that the percentage split between training and testing should be variable.
In this iteration, it is necessary to fit the model to the training data set in order to assess its performance. To accurately gauge the efficacy of the model, it is important to repeat this process multiple times, ideally hundreds or thousands of times, and then take the average of the test errors across each iteration. Once this is done, the performance of the fitted model on the test dataset can be determined.
In each iteration, the percentage of data used for training against data used for testing has changed. The test flaws have been averaged out for you.
Sequential data (forward chaining / rolling cross-validation technique)
It is necessary to define time-series data before delving into the specifics of the rolling cross-validation method.
Time series data is a type of data that represents the evolution of a variable over a period of time. This data provides insight into the factors that influence a variable’s behaviour over a period of time and can include readings such as weather reports, economic indicators, and stock market data. By inspecting the trend in the data, one can make informed decisions about the variable and draw conclusions about future behaviour.
Cross-validation can be a challenge when working with time series datasets. Since it is not possible to randomly allocate data instances to the test set or the train set, an alternate method must be employed. This method involves cross-validating datasets where time is the primary determinant, thus allowing for a better understanding of how the data will behave in the future.
When dealing with time series data, where the order of the data is important, it is necessary to divide the dataset into training and validation sets, which are then organised chronologically. This process is also referred to as forward chaining or rolling cross-validation.
First of all:
At the commencement of the training process, a small subset of the entire dataset should be used. The accuracy of the predictions can be verified by extrapolating to future time intervals. Subsequent data points can then be predicted by employing the information in the projected data, which will be included in the succeeding training dataset. The process is still in progress.
Play around with these programs and try your hand at cross-validation using these seven methods.
Thanks for your code, and have fun!