Thanks to the latest data science developments, mathematical concepts can now be applied to studying data behavioural patterns. The standardisation of data sets is frequently utilised as a remedy in various fields and is often the primary step taken in data mining.
In this research piece, we’ll examine the influence of normalisation techniques on the performance of logistic regression classifiers in data science. Our focus will be on studying how these techniques impact the effectiveness of logit regression classifiers in data science.
Logistic regression is a reliable approach for dealing with classification issues. To analyse how logistic regression models perform, two widely employed normalisation techniques – min-max scaling and z-score normalisation – are applied to the source data. The models’ efficiency is evaluated using two key metrics: accuracy and model lift.
Methodology
Typical procedures involve the creation of a data model in Python, which is then verified for accuracy.
Dataset
In order to classify whether a patient has diabetes or not, a binary logistic regression model is constructed in Python using a Jupyter Notebook.
To start our analysis, we’ll take a look at the condition of diabetes. This ailment is categorised by a cluster of metabolic disorders that cause elevated levels of glucose in one’s bloodstream. Diabetes is typically classified into three separate categories:
- Type 1 diabetes tends to affect children more than any other age group, and it may be caused by a combination of genetic factors and viruses.
- The most common type of diabetes amongst sufferers is type 2 diabetes, which is characterised by either a lack of insulin production or the body’s inability to use insulin effectively.
- Expectant mothers who develop diabetes during pregnancy are diagnosed with gestational diabetes. In some cases, the body doesn’t produce adequate amounts of insulin to counteract the increase in blood sugar levels.
The data used in our study was obtained from the Machine Learning Repository and was originally created by the National Institute of Diabetes and Digestive and Kidney Disease. This dataset, known as the Pima Indian Diabetes dataset, consists of eight medical diagnostic attributes and one target variable (Outcome), totalling 768 (34.9%) female diabetes patients, with 268 positive cases.
A calculated independent t-test disclosed notable differences between insulin levels in diabetic and non-diabetic individuals across all eight independent variables examined. Analysis of the data demonstrated that diabetic individuals had a significantly higher mean blood sugar concentration than non-diabetics, averaging at 142.2 micrograms per decilitre (95% CI: 138.6 micrograms per deciliter, 145.7 micrograms per deciliter). The significance threshold was recorded at t (766) = 15.67, with a p-value of 0.001.
An Overview of Logistic Regression
Machine learning techniques can be categorised into two primary types: supervised and unsupervised learning. The unsupervised techniques focus on clustering and regression, such as Principal Component Analysis, in order to reduce the dimensionality of data (PCA). With these unsupervised techniques, pre-labelled data is not necessary to predict or classify the model output since the patterns in the input data are sufficient for such tasks.
To employ supervised learning techniques, the data needs to be categorised and separated into two distinct datasets: a training dataset and a testing dataset. The most frequently used supervised techniques are classification and regression. Classification involves constructing a model to identify a pictured animal, while regression involves forecasting the cost of a house based on factors such as number of rooms, locality, and area. However, supervised classification algorithms are not equipped to identify animals that were not included during the training stage. In order to assess the model’s accuracy on unseen data, the dataset must be split into two. A confusion matrix can then be generated post-fitting to further evaluate its performance.
When trying to determine the value of a dependent variable based on one or more known variables (e.g. using the number of hours of study to predict grades), linear regression, a type of supervised regression, is commonly employed. With multiple linear regression (MLR), this idea is expanded by examining how different independent variables can be used to forecast a dependent variable (e.g. using the number of hours of study, extracurricular activities, and sick days taken to predict grades). Logistic regression, on the other hand, provides a probability for a binary outcome, such as distinguishing between spam email and non-spam email, while MLR produces a continuous value, such as the market value of a property or a grade.
Despite its name, logistic regression is an example of a supervised classification method. It assigns values to either the positive class (labelled 0, or regular email) if the probability of an event occurring exceeds 50%, or the negative class (labelled -1, or spam email) if the probability is less than or equal to 50%. The logistic function is a type of sigmoid function, with a restricted output range of [0,1].
Logistic regression can be highly useful in the medical field, particularly in determining whether a condition has malignant or benign qualities. To effectively apply logistic regression, a dataset containing multiple variables is split into two components: the training dataset, used to train the model, and the test dataset, used to assess its performance. By taking into account various markers, medical professionals can decide whether a patient’s cancer is benign or malignant. Multi-nominal logistic regression is used to classify individuals into more than two categories, such as married, single, or divorced; nonetheless, binary logistic regression is generally the preferred approach for this task.
Several assumptions about the data must be met to utilise logistic regression.
- The logistic functions of the independent variables are fundamental to achieving accurate conditional probabilities.
- It is essential that all independent variables are measured with reliability.
- The results should be self-explanatory.
- The error distribution must be binomial.
- Linear relationships should not exist between the independent variables.
- No extraneous elements should exist, and all necessary ones must be considered.
Datasets often contain nominal data, which logistic regression, as well as linear regression, are unable to model. To address this problem, a dummy variable may need to be introduced. Preventing perfect multicollinearity may require removing a dummy variable from the data, and having highly correlated independent variables (>0.8) can lead to confusion between the estimated dependent variable and the interpretation of independent coefficients in a logistic regression model.
Normalisation Techniques
This article will provide a detailed examination of the three normalisation techniques utilised on the dataset: min-max, z-score, and resilient scaling. The purpose of implementing these methods was to evaluate their impact on the logistic regression model. It is worth mentioning, however, that there are many other normalisation methods one can choose from.
Min-max Scaling
In maximal-minimal scaling, characteristics or outputs are transformed linearly from one set of values to another in order to ensure that the variables fall within the range of -1 to 1. The linear transformation y = (x – min(x))/(max(x) – min(x)) is commonly used to resize photographs. This calculation involves taking the set of observed values of x, subtracting the minimum and maximum values (min and max), and ensuring that the domain of X is an interval of [max(x) – min(x)] (x). This normalisation technique’s significant advantage is that all data relationships are preserved.
Z-score Normalisation
Standardisation is a commonly used technique for normalising data. This process involves calculating the mean and standard deviation for each attribute, and then transforming each value of attribute X using the formula y = (x – mean(X))/std, where mean(X) and std(X) are its mean and standard deviation respectively. This method is beneficial as it helps reduce the impact of outliers in the data.
Robust Scalability
Quantile rescaling is a method that eliminates the median value and normalises the data. The interquartile range (IQR), also known as the spread between the first and third quartiles (i.e., the 25th and 75th percentiles) is used in this method.
Illustration of Activity Structure
The logistic regression model used in this study involved four different forms of data presentation. The original dataset and three additional datasets generated from normalisation using three different techniques were used as inputs to create four distinct prediction models. To assess the accuracy of the predictions, the three normalised datasets were compared against the original dataset.
Results
The objective of the conducted experiments was to compare the impact of three different normalisation methods on the accuracy of a logistic regression model. In order to do so, the dataset was partitioned into two sets, with 614 entries designated for the training set and the remaining 80% (154 entries) reserved for the test set. The normalisation techniques employed on the diabetes dataset consisted of Min-max, Z-score, and Robust scaling, with the same allocation of entries between training and test sets being used for each method.
The evaluation of the classification system or classifier performance can be done via the utilisation of a classification matrix such as the one generated for the original dataset, and the three additional matrices created for each of the normalisation techniques applied. Confusion matrices, like these, are typically used to assess classifier performance, with the rows of the matrix denoting expected values and columns representing actual values.
Accuracy
Prediction accuracy is a valuable metric for evaluating model performance, which can be calculated by determining the proportion of the validation dataset that is correctly classified. The accuracy is determined by calculating the sum of the diagonal values in a confusion matrix, which represents the number of correctly labelled samples.
Two logistic regression models were assessed for their efficacy using three different normalisation techniques. Although normalisation is a technique used to enhance the accuracy of machine learning algorithms, none of the three normalisation strategies proved to work ideally with logistic regression. Irrespective of training dataset size, the normalisation techniques resulted in similar levels of accuracy for both models.