Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that allows machines and other systems to learn from their experiences and develop new skills and knowledge without requiring manual programming. This is done by analysing patterns from available historical data, similar to the way humans learn from their experiences and store and recall information about the world. Therefore, having access to the right training data is the most essential element for an ML system to be successful.
It is evident from the above figure that data collection and data cleaning are two of the most integral steps in the machine learning pipeline. In this article, we will explore the reasons behind their importance, as well as how to implement these processes in Python. By understanding the importance of data collection and cleaning in the machine learning process, organisations can ensure that the insights they derive from the data are accurate and actionable.
As to why it’s crucial to gather information, why.
Data is essential for the functioning of machine learning. Predictive models are constructed using data to identify and benefit from patterns and insights that have been previously observed. As more data is gathered regarding past occurrences, patterns become more apparent and can be predicted.
Data gathering for machine learning
The explosion of data generated by entities such as Google, Facebook, e-commerce websites, and more, is unprecedented. This data has tremendous potential, but only if it is utilised properly. With careful planning and analysis, this data can be leveraged to provide insight into consumer behaviour and preferences, identify trends, and inform better decision-making.
Gathering data from various sources is an essential part of machine learning. Data can be presented in a wide range of formats, such as text, tables, photographs, and videos. The most commonly used data types for predictive models include categorical data, numerical data, time-series data, and text data.
Permit me to elaborate on each of these points.
Information broken into into categoriesWhen working with this data, it is essential to take into account that other classes might be employed to define the same thing. Categorical information may include examples such as gender categorization.
Measured informationQuantitative data collection is the only acceptable form when it comes to gathering information. An example of this type of data collection is examining the gender distribution of a school’s student body throughout the different grade levels.
Temporal informationThis data is collected by taking multiple measurements at different points in time. When represented graphically, the x-axis of the graph typically indicates the passage of time. Time series data can encompass any measurement taken over a period of time, such as temperature readings, stock market fluctuations, system logs, or weekly weather patterns.
Information contained in text formRecords that consist of written works such as articles, blogs, and postings are often difficult for computers to interpret due to the text-based format. To solve this issue, these records are typically converted into a mathematical form for more effective interpretation.
In order to successfully incorporate data into a machine learning model, it is essential to thoroughly preprocess the data prior to implementation. Preprocessing of the data entails the application of a specific set of feature engineering methods, tailored to the type of data features present.
Use of Python for Data Preprocessing
Data preprocessing is the process of transforming unstructured or raw data into a format that can be consumed by a machine learning algorithm. It is an essential step in the development of a successful machine learning model, as it helps to ensure that the data is in a suitable format for the algorithm to process. Preprocessing techniques such as data cleaning, normalisation, and feature selection can be used to improve the quality and accuracy of the model. Additionally, preprocessing can also increase the speed at which the model is able to learn and make predictions.
- Data gathering
- Preparation of Data
- Picking a Role Model
- Adjusting the settings
At the beginning of the data collection process, raw data is obtained, however it is not in the optimum condition. In order to make effective use of this data, it must undergo certain preprocessing procedures. These procedures may include:
- Data partitioning into training, validation, and test sets
- The Treatment of Missing Data
- Managing data with unusual elements
- The management of category information and characteristics
- Preparing the dataset for analysis
A Collection of Python Modules
Python’s library support for preparing data is robust. Some notable libraries are listed below.
PandasThis library, built with the assistance of NumPy, is a free and open-source Python library that is designed to help clean and analyse data. Its features make it both highly efficient and highly adaptable, allowing it to provide speedy results while still accommodating a variety of data structures.
Pandas is an essential part of the machine learning pipeline, utilised directly after the data collecting phase. This library offers a wide range of operations, including but not limited to:
- Purifying Data
- Input data
- Normalisation of Data
- Visualisation of Data
- Analysing Information
- Acquiring a unified view by combining and merging data frames
- Allows for a wide range of data types, including CSV, Excel, HTML, JSON, and more, to be loaded and saved.
Pandas Have Several Benefits.
- Allows users to import information from a variety of file types
- Facilitates the management of incomplete data
- Quick and easy data modification
- Features for working with time-based data
- Allows for the effective management of large datasets
Scikit-learnIn order to effectively process data for machine learning models, Scikit-learn offers a comprehensive set of tools which are both user-friendly and efficient. Popular processors such as OneHotEncoder, StandardScaler and MinMaxScaler are widely used for data manipulation.
Below is a short rundown of Sklearn’s supplementary preprocessing tools.
Classification, regression, and clustering algorithms are all available through Scikit-learn. The comprehensive documentation provided makes incorporating algorithms or performing pre-processing tasks a straightforward process.
Stages in Preparation for Data Analysis
Here we will examine the significance of Pandas and Sklearn in the preparation of data.
Separation of training and examination phasesIt is essential to arrange data into distinct groups before providing it to a model. A model should be trained utilising training data, validated with validation data, and tested with test data. Generally, the ratio between the number of samples used for training and those used for testing is 80:20. Scikit-learn has an internal method of segregating training data from evaluation data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
The Treatment of Missing DataFrequently, raw data sets contain invalid values such as Not a Number (NaN) or missing values. To ensure more accurate predictions, it is essential to process the data by removing these invalid values. A popular technique for replacing invalid values is to use the mean, mode, and median of the data set. The sklearn.impute module’s SimpleImputer provides all of the necessary functionality to achieve this.
import from sklearn.impute SimpleImputer
fill value = np.nan, strategy = “mean” in the imputer = SimpleImputer formula.
Replacement for X = imputer.fit Transform (df)
Since this imputer produces a NumPy array, a dataframe conversion is required.
Input: X = pd.DataFrame(X, columns=df.columns)
Filling up blanks is often impossible. As a result, the dropna function in the Pandas module may be used to remove these rows.
How to Deal with Categorical InformationCategorical data must be converted into numerical values before it can be used in many operations. This process is often referred to as encoding and can be accomplished in two different ways: label encoding and one-hot coding. Label encoding involves assigning a numerical value to each category, while one-hot coding assigns a binary indicator value to each category. In both cases, the result is a numerical representation of the original categorical data.
It is necessary to consider a column for height in the dataset, and label encoding can be used to assign numerical values to the options of Tall, Medium, and Short. For example, 0 can be assigned to Tall, 1 to Medium, and 2 to Short. Preprocessing modules from the Sklearn library can be utilised to accomplish this task.
Input/output encoding using a single operation
Categorical characteristics can be represented as a binary array through the use of a hot encoder. This process creates a new column for each category within the feature. As an example, the column for gender may consist of two labels: male and female. Using the hot encoder, a separate column will be generated for each label, male and female. A value of 1 will be entered in the column corresponding to the respective label while the other column will be assigned a value of 0, and vice versa.
Numerical and categorical data are two of the most commonly encountered forms of data, and the aforementioned techniques are often used to prepare them for further analysis. However, there are many more involved processes that can be employed for the preparation of data.
Some examples will be shown below.
Approaching exceptional cases
It is prudent to address outliers prior to using them as inputs for a model, as this enables the model to operate more accurately. Outliers are data points that are markedly different from the average or median value of a set of data. Researchers employ their own criteria for determining what constitutes an anomaly. Outliers can be categorised as either moderate or severe, depending on the degree of their deviation from the mean. Generally, severe outliers are not incorporated into models.
Analysing a Symmetric Curve with Standard Deviation
Setting the border for a symmetric curve with outliers in a Gaussian distribution requires thinking about the standard deviation.
Applying methods that are immune to outliers
It is best practice to utilise methods that are not impacted by outliers when attempting to eliminate them. Popular methods for this include k-nearest neighbour, support vector machine, decision tree, ensemble methods, and the Naive Bayes classifier. All of these approaches are well-known for their ability to operate effectively despite the presence of outliers.
Changes to the Data
Datasets can contain features that vary significantly in terms of scale, measurement, and scope. If such data is entered into a model without first scaling it, the results of the modelling process can be unsatisfactory due to the fact that models only take into account the magnitude of the data and not the units of measurement. Therefore, data needs to be standardised or normalised in order to ensure consistency. Standardisation and normalisation are two types of scaling that can be used for this purpose.
Conformity to Standards: The values of the attribute in question are distributed evenly around its mean, resulting in a mean of zero and a standard deviation of one. This is illustrated in the diagram below. This shows that the values are symmetrically distributed, with approximately the same number of values on either side of the mean.
Standardisation This method of scaling involves transforming numbers into a range between 0 and 1. It is also commonly referred to as min-max scaling, which involves mapping the minimum and maximum values of a given dataset to 0 and 1, respectively. The result is a scaling of the data which preserves its original distribution, allowing for more effective comparison between different datasets.
The Gaussian Distribution
Linear and logistic regression are two of the most widely used Machine Learning techniques which necessitate that the features of the data have a normal distribution. To ensure that this requirement is met, certain transformation techniques are implemented.
- The logarithmic function (np.log(df[‘column name’])) is used in the logarithmic transformation.
- To do an inverse transformation, we utilise the inverse function, which is written as 1/(df[“column name”]).
- Np.sqrt(df[“column name”]) is the square root transformation of the data frame.
- It employs an exponential function, df[“column name”])**(1/1.2), to effect the change.
Concerning Uneven Datasets
A more precise and impartial model can be attained by creating a more even distribution of data. This can be accomplished in two ways: increasing the number of observations for the minority class, or decreasing the number of observations for the majority class. By doing this, each group will have approximately the same amount of data.
Some different methods of resampling are shown here.
Low Sample size: When dealing with a large dataset containing millions of data points, this method can be particularly effective. This method involves excluding data points from the majority group in order to create an equal number of members from both the majority and minority groups. However, this approach carries the risk of discarding valuable information that could potentially enhance the performance of the model.
Excessive sampling: By utilising this technique, we can boost the representation of the underserved group by supplementing it with additional data. This is a useful method when the amount of available data is restricted, although it has the potential to cause overfitting in certain cases.
SMOTE: For researchers conducting studies that focus on underrepresented groups, they may create their own data through the Synthetic Minority Oversampling Technique (SMOTE). This method involves randomly selecting a minority class point and then determining its k-nearest neighbours. Once these points and their closest neighbours have been identified, artificial points are created and inserted between them.
Data preprocessing is not limited to text data only. In reality, data can come in a wide range of formats, including images, time series, and more. Despite the differences in format, however, the initial steps in any machine learning process remain the same: data acquisition and preparation. Data cleaning is a fundamental step that should never be overlooked, as skipping it can lead to inaccurate results from the ML model. Consequently, it is essential to process raw data appropriately.