Information Science Data Normalisation Guidelines Using Python’s Scikit-Learn

Data science and machine learning beginners may face queries around data normalisation and its importance. Essentially, data normalisation is a pre-processing method used in machine learning to facilitate feature transformation. The primary objective of this transformation is to change the scale of feature values to a compatible range, thus improving model performance and rendering training stability.

Note that normalisation is necessary in machine learning only when the range of attributes differs significantly.

Upon reviewing the dataset, it becomes apparent that variables are quantified using varying measurements. Consequently, this disrupts the data distribution, which can be tackled through data normalisation also recognized as standardisation, feature scaling, or normalisation. The objective is to calibrate the measurements to ensure they are uniform or with the same distribution.

Various data normalisation techniques, their relevance, and applications in machine learning will be deliberated upon.

What is the need for normalisation before model fitting?

In the pre-processing of data for machine learning algorithms, data normalisation has a crucial role. This is because certain algorithms are sensitive to the scaling of input features, and normalising the data facilitates feature scaling. Standardising the data ensures that all features are on the same scale, enabling the algorithms to effectively process and leverage the data. Let’s investigate the significance of data normalisation while employing machine learning algorithms.

Distance-based algorithms including K-Nearest Neighbours (KNN), K-Means, and Support Vector Machines (SVM) utilise data point distances to assess similarity. Several factors contribute to these algorithms’ efficacy, such as Gradient Descent, which scales data for optimisation, evident in Linear and Logistic Regression. The algorithm can converge faster when data points of varied scales are standardised. Meanwhile, tree-based algorithms do not prioritize the feature size as a decision tree may split a node into two based solely on a particular feature’s value, which determines the division direction.

Can you describe scikit-learn in Python in your own words?

Initially created by David Cournapeau in 2007 as part of Google’s Summer of Code project, scikit-learn (sklearn) became available to the public as an open-source tool in 2010, facilitating data visualisation and coding for programmers. With sklearn, developers can save time on data loading and concentrate on the modelling process.

Advantages offered by Python’s scikit-learn library

  1. The Scikit-learn package is a valuable resource for generating forecasts and visually displaying data.
  2. The API documentation can assist users who intend to integrate algorithms into their platform.
  3. The library is freely available for unrestricted use. It is distributed under a license analogous to the Berkeley Software Distribution (BSD) license, without legal or licensing requirements. Thus, it is accessible to all users regardless of their operating system.

Data normalisation techniques in machine learning

The prevalent normalisation techniques in machine learning are:

  • Min-max
  • Z-score
  • Logarithmic transformation based on arithmetic progression

The subsequent operations aid in implementing the aforementioned strategies:

  • The fit(data) method is a beneficial scaling technique that calculates mean and standard deviation for a feature.
  • The transform(data) function enables scaling utilizing the mean and standard deviation calculated by the .fit() method.
  • The fit_transform() process can accomplish both fitting and transformation.

Pareto optimization

In the min-max scaling technique, values are relocated and resized to fit into one of two ranges, either 0–1 or 1–1.

Here, min refers to the minimum value in the column, whereas max represents the maximum value in the column. The x is the raw value, and x’ stands for the normalized value.

Example:

The only existing numbers are 14, 9, 24, 39, and 60.

The minimum value is 9, and the maximum value is 60. The following are the normalised minimum and maximum values accordingly:

The following are the normalised minimum and maximum values for each number:
For 9: (9 – 9) / (60 – 9) = 0 / 51= 0.00
For 14: (14 – 9) / (60 – 9) = 5 / 51 = 0.098
For 24: (24 -9) / (60 – 9) =15 / 51 = 0.29
For 39: (39 – 9) / (60 – 9) = 30 / 51 = 0.58
For 60: (60 -9) / (60 – 9) = 51 / 51 = 1.00

The utilization of min-max normalization narrows the range of possible values to a scale of 0 to 1. In particular, for the mentioned scenarios, the minimum value is 0.0 and the maximum value is 1.0 as a result of the normalization process. To effectively accomplish min-max normalization, the MinMaxScaler package in the sklearn.preprocessing module is applied.

To receive minimum and maximum values, for later scaling, utilize fit(X[, y]).
To scale the features of X to the designated feature range, apply transform(X).

Normalization using the Z-score

Standardization, also known as normalization by z-score, involves rescaling the characteristics of a particular dataset to exhibit the standard characteristic of a normal distribution which has a mean of 0 and a standard deviation of 1. The following formula should be utilized to calculate the sample standard score or z-score, with 0 indicating the mean and 1 indicating the standard deviation.

For this purpose, the raw value (x), the normalized value (x’), the average value (u), and the standard deviation (sd) are all relevant.
Based on these values, u can be calculated by adding 14, 9, 24, 39, and 60, and by dividing the sum by 5: u = (14+9+24+39+60) / 5 = 146/ 5 = 29.
The standard deviation is obtained by summing the squared differences between each value and the mean, and dividing by the total number of values.

The standard deviation can be calculated using the following formula: sd = sqrt( [(14 – 29)^2 + (9 – 29)^2 + (24 – 29)^2 + (39 – 29)^2 + (60 – 29)^2] / 5 )
This results in the following calculation:
sd = sqrt( [(-15)^2 + (-20.0)^2 + (-5.0)^2+ (10.0)^2 + (31)^2] / 5 )
= sqrt( [225.0 + 400.0 + 25.0+100+961 ] / 5 )
= sqrt( 1711.0 / 5 )
= sqrt(342.0)
= 18.49

As a result, the adjusted values using the z-score are:
For 14: (14 – 29.0) / 18.49 = -0.811
For 9: (9 – 29.0) / 18.49 = +1.08
For 24: (24 – 29.0) / 18.49 = -0.27
For 39: (39 – 29.0) / 18.49 = +0.54
For 60: (60 – 29.0) / 18.49 = +1.72

A normalized number with a positive z-score indicates that x is above the mean, while a negative z-score indicates that x is below the mean.

The symbols ‘x’, ‘x’ prime, ‘u’, and ‘sd’ represent the raw value, normalized value, average value, and standard deviation, respectively.
Based on these values, ‘u’ can be obtained by adding 14, 9, 24, 39, and 60, and dividing the sum by 5: u = (14+9+24+39+60) / 5 = 146/ 5 = 29.
The standard deviation is determined by summing the squared differences between each value and the mean, and dividing by the total number of values.

The following formula can be used to calculate the standard deviation: sd = sqrt( [(14 – 29)^2 + (9 – 29)^2 + (24 – 29)^2 + (39 – 29)^2 + (60 – 29)^2] / 5 )
This results in the following calculation:
sd = sqrt( [(-15)^2 + (-20.0)^2 + (-5.0)^2+ (10.0)^2 + (31)^2] / 5 )
= sqrt( [225.0 + 400.0 + 25.0+100+961 ] / 5 )
= sqrt( 1711.0 / 5 )
= sqrt(342.0)
= 18.49

Consequently, after applying the z-score, the adjusted values are:
For 14: (14 – 29.0) / 18.49 = -0.811
For 9: (9 – 29.0) / 18.49 = +1.08
For 24: (24 – 29.0) / 18.49 = -0.27
For 39: (39 – 29.0) / 18.49 = +0.54
For 60: (60 – 29.0) / 18.49 = +1.72

If a normalized value has a positive z-score, it suggests that x is greater than the mean. Conversely, if it has a negative z-score, x is less than the mean.

Example:

14, 9, 24, 39, 60

The logarithmic values for each number are:
For 14: log(14) = 1.14
For 9: log(9) = 0.95
For 24: log(24) = 1.38
For 39: log(39) = 1.59
For 60: log(60) = 1.77

Comparing log, z-score, and min-max scaling

In cases where the data does not conform to a Gaussian or normal distribution, min-max normalization is often the preferred method for normalizing algorithms that are not distribution-dependent, such as K-Nearest Neighbours (KNN) and Neural Networks. However, it’s essential to keep in mind that outliers can significantly impact the normalization process.

Although standardization can be beneficial for data with a Gaussian distribution, it is not always necessary to perform the process precisely. Unlike normalization, standardization does not establish any upper or lower limits for the data, and it will not have any effect on extreme values in the dataset.

When a dataset includes incredibly large outliers, log scaling is recommended.

In instances where the size of characteristics varies significantly, several machine learning algorithms can be compromised. To prevent this, it’s crucial to normalize the features to the same size before utilising the algorithms. This article has presented three techniques for achieving this: min-max, z-score, and log scaling. By implementing one of these normalization methods, the features can be rescaled to fit the given dataset.

Join the Top 1% of Remote Developers and Designers

Works connects the top 1% of remote developers and designers with the leading brands and startups around the world. We focus on sophisticated, challenging tier-one projects which require highly skilled talent and problem solvers.
seasoned project manager reviewing remote software engineer's progress on software development project, hired from Works blog.join_marketplace.your_wayexperienced remote UI / UX designer working remotely at home while working on UI / UX & product design projects on Works blog.join_marketplace.freelance_jobs