Recent advancements in the area of data science have enabled us to apply mathematical principles to the study of data behavioural patterns. Normalisation of datasets is commonly used to overcome issues in multiple disciplines, and is often the first step taken in data mining processes.
This research article will explore how normalisation techniques affect the effectiveness of logistic regression classifiers in the field of data science. Specifically, we will look at the impact that normalisation techniques have on the success of logistic (or logit) regression classifiers in data science.
In order to address classification problems, logistic regression is an effective approach to consider. Two of the most commonly used normalisation techniques, min-max scaling and z-score normalisation, are applied to the original data in order to analyse the effect on the performance of logistic regression models. The accuracy and model lift are the two primary metrics used to evaluate the effectiveness of the models.
Common processes include developing a data model in Python and verifying its veracity.
To determine whether a patient has diabetes, a binary logistic regression model is built in Python inside a Jupyter Notebook.
Let us begin by examining the condition of diabetes, which is comprised of a group of metabolic disorders characterised by elevated levels of glucose in the blood. These disorders can be divided into three distinct categories:
- Children are disproportionately affected by type 1 diabetes. It might be caused by both genetics and viruses.
- Insulin resistance or insufficient insulin production characterises type 2 diabetes. Among diabetics, this is the most typical form.
- Women who are pregnant and have diabetes are said to have gestational diabetes. Some people just don’t have the capacity to produce enough insulin.
The Pima Indian Diabetes dataset was sourced from the Machine Learning Repository, which was originally compiled by the National Institute of Diabetes and Digestive and Kidney Disease. This dataset features eight medical diagnostics attributes and one target variable (Outcome) for 768 (34.9%) female patients with diabetes, of which there are 268.
An independent t-test revealed significant differences between the insulin levels of diabetic and non-diabetic individuals in all eight independent variables studied. Data analysis showed that the mean blood sugar concentration of diabetic individuals was significantly higher than that of non-diabetics, with an average of 142.2 micrograms per deciliter (95% CI: 138.6 micrograms per deciliter, 145.7 micrograms per deciliter). The significance threshold was t (766) = 15.67, p 0.001.
Introducing logistic regression
Machine learning techniques can be divided into two main categories: supervised and unsupervised. Unsupervised learning techniques, such as Principal Component Analysis, are focused on clustering and regression in order to decrease the number of dimensions of the data (PCA). With these unsupervised techniques, it is not necessary to have pre-labelled data to be able to forecast or classify the model output, as the patterns in the input data are enough to do so.
Before supervised learning methods can be used, the data must be categorised and divided into two distinct datasets: a training dataset and a testing dataset. The most commonly used supervised learning methods are classification and regression. Classification involves creating a model to identify which animal is pictured, while regression entails predicting the price of a house based on its number of rooms, neighbourhood, and area. However, supervised classification algorithms are unable to recognise the presence of an animal that was not included during the training phase. To assess the model’s performance on data it has not yet seen, the dataset must be split in two. Once the model has been fitted, a confusion matrix can be generated to further analyse its performance.
It is commonplace to use linear regression, a form of supervised regression, to calculate the value of a dependent variable based on one or more known variables (e.g. hours of study leading to an estimation of grades). Multiple linear regression (MLR) takes this concept further by examining how various independent variables can be used to estimate a dependent variable (e.g. hours of study, participation in extracurricular activities, number of days sick in a given period all used to predict grades). In contrast, logistic regression provides the probability of a binary outcome, such as whether or not an email is spam, whereas MLR produces a continuous value, such as the market price of a property or a grade.
Despite its deceptive name, logistic regression is actually a supervised classification technique. If the probability of an occurrence is greater than 50%, it is assigned to the positive class (labelled 0, or regular email) and if the probability is 50% or less, it is assigned to the negative class (labelled -1, or spam email). The logistic function is a special type of sigmoid function, whose output is restricted to the range of [0,1].
Logistic regression is a powerful tool that has a range of potential medical applications, such as determining whether a condition is malignant or not. To effectively utilise logistic regression, a patient database with several variables is split into two parts – the training dataset, which is used to train the model, and the test dataset, which is used to assess its performance. By taking into account various markers, health professionals can determine whether a patient’s cancer is benign or malignant. For classifying individuals into more than two categories, such as married, single, or divorced, multinomial logistic regression is used; however, binary logistic regression is the more common approach for this purpose.
In order to use logistic regression, a number of presumptions about the data must be true.
- The logistic functions of the independent variables form the basis for the accurate conditional probabilities.
- All of the independent variables have been measured reliably.
- The findings stand on their own.
- The distribution of errors is binomial.
- There is no linear connection between the independent variables.
- There are no extraneous elements, and all necessary ones are taken into account.
Logistic regression is unable to model nominal data, which is frequently present in datasets. Similarly, linear regression is also not suitable for this purpose. To avoid this issue, the use of a fake variable may be necessary. A certain way to prevent perfect multicollinearity is to eliminate a dummy variable from the data type. If the correlation coefficients of multiple independent variables are high (>0.8), it can lead to confounding of the estimated dependent variable and the interpretation of independent coefficients in a logistic regression model.
Methods of Normalisation
The three normalisation techniques employed on the dataset will be explored in detail. The selected techniques — min-max, z-score, and resilient scaling — were implemented to assess their effects on the logistic regression model. It is important to note, however, that there are numerous other methods of normalisation available.
In this approach, characteristics or outputs are linearly transformed from one set of values to another. To ensure that the variables are in the range of -1 to 1, adjustments are made. It is a standard practice to resize photographs by applying the linear transformation y = (x – min(x))/(max(x) – min(x)). This linear transformation is calculated by taking the set of all the values of x that have been observed (x) and subtracting the minimum and maximum values, min and max, from it. This is done to ensure that the domain of X is an interval of [max(x) – min(x)] (x). One of the major advantages of this normalisation procedure is that all data relationships are preserved.
Normalisation using the Z-score
Standardisation is a widely used method for normalising data. The process involves calculating the mean and standard deviation for each characteristic, and then transforming each value of the attribute X according to the equation y = (x – mean(X))/std, where mean(X) and std(X) are the mean and standard deviation of the attribute, respectively. This approach is beneficial as it helps to reduce the impact of outliers in the data.
Quantile rescaling is a technique that eliminates the middle value and normalises the data. The interquartile range (IQR) is the spread between the 25th and 75th quantiles, which are also referred to as the first and third quartiles.
Activity structure depiction
In the logistic regression model, the data was presented in four different forms. As part of the model, the original dataset and datasets created after normalisation by three different processes were used as the inputs. This led to the creation of four distinct models for making forecasts. To compare the accuracy of the prediction, the three normalised datasets were evaluated against the original dataset.
The purpose of the experiments conducted was to compare the effects of three varied normalising methods on the accuracy of a logistic regression model. In order to do so, two groups were created from the dataset; 614 entries were reserved for the training set and the remaining 80%, which amounted to 154 records, were allocated to the test set. The normalisation techniques that were applied to the diabetes dataset included Min-max, z-score, and robust scaling. Moreover, the same split of entries between the training and test sets was utilised for each normalising method.
It is possible to observe one classification matrix for the original dataset and three additional matrices for the three different normalisation procedures. Confusion matrices, such as these, are utilised to evaluate the performance of a classification system or classifier. Specifically, the rows of a classification matrix illustrate the expected values, whereas the columns indicate the actual values.
A useful measure of model performance is the accuracy of its predictions. This can be determined by calculating the proportion of the validation dataset that is correctly identified. The number of samples necessary to evaluate the sum of the diagonal values in a matrix is equal to the number of samples that are correctly labelled.
The efficacy of two logistic regression models was evaluated by applying three different normalisation techniques. Normalisation is a process used to enhance the accuracy of a machine learning algorithm; however, none of the three normalisation strategies worked optimally with logistic regression. Regardless of the length of the training datasets, the normalisation approaches resulted in comparable accuracy levels for the two models.