Machine Learning (ML) is an Artificial Intelligence (AI) subcategory, which imparts machines and other systems the ability to incorporate new abilities and knowledge without needing manual programming, through the analysis of distinctive patterns found from past data that are similar to how humans learn from experience, store, and recall information. Therefore, the key aspect that ensures an ML system’s success is accessing the right training data.
The above-mentioned diagram highlights that data collection and cleansing are two of the critical steps in the machine learning pipeline. In this article, we will delve into the reasons backing their importance, and also explore ways to implement these procedures using Python. Having a clear understanding of the significance of data collection and cleansing in the machine learning process, organisations can guarantee that the resultant data insights are both accurate and actionable.
Why Gathering Information is Crucial
Data is indispensable for machine learning to operate. Using data, predictive models can be designed to uncover and make use of patterns and insights that were observed previously. As we collect more data about past incidents, patterns become more apparent and can be anticipated.
Collecting Data for Machine Learning
The vast amount of data produced by entities like Google, Facebook, e-commerce websites, and many more is incomparable. This data has immense potential if put to optimal use. By means of cautious planning and examination, this data can be utilised to gain insights into consumer behaviour and preferences, recognise trends, and facilitate improved decision-making.
Structures of Data
Collecting data from a multitude of origins for machine learning is an indispensable task. Data can be represented in various formats like text, tables, images, and videos. The data types most frequently utilised for foretelling models comprises of categorical data, numerical data, time-series data, and text data.
Allow me to give more details on each of these aspects.
Categorising DataWhile working with data, it is crucial to acknowledge that different groups might be used to define the same thing. Categorical data may comprise examples such as classifying genders.
Measured DataWhen it comes to data collection, quantitative data is the only acceptable form. An instance of this type of data collection is evaluating the gender distribution of students in a school across different levels.
Time-Based DataThis data gets collected by taking several readings at distinct time intervals. The x-axis of the graph typically indicates the passage of time when represented graphically. Time series data can encompass any measurement taken over a period of time, such as temperature readings, stock market fluctuations, system logs, or weekly weather patterns.
Textual DataDocuments that comprise written works like articles, blogs, and posts are usually challenging for computers to comprehend, owing to the text-based format. To address this issue, these documents are usually altered into a mathematical format for more effective analysis.
Incorporating data into a machine learning model necessitates thorough preprocessing of the data before implementation. Preprocessing involves the use of a specified set of feature engineering methods, customised to the type of data characteristics at hand.
Employing Python for Data Preprocessing
Data preprocessing is the process of transforming unstructured or raw data into a format that can be handled by a machine learning algorithm. It is a crucial step in constructing a prosperous machine learning model, as it guarantees that the data is in a suitable structure for the algorithm to analyse. Preprocessing techniques like data cleansing, normalisation, and feature selection can be utilised to enhance the quality and accuracy of the model. Additionally, preprocessing can quicken the rate at which the model can learn and make predictions.
- Collecting Data
- Data Processing
- Selection of a Role Model
- Modifying the Settings
At the commencement of the data collection process, raw data is obtained, but it may not be in an optimal state. To utilise this data effectively, it must undergo some preprocessing procedures. Such procedures might encompass:
- Dividing Data into Training, Validation, and Test Sets
- Handling Missing Data
- Dealing with Data Containing Anomalies
- Handling Categorical Information and Characteristics
- Getting the Dataset Ready for Analysis
A Set of Python Modules
Python’s library support for data preparation is extensive. Here are a few noteworthy libraries:
PandasPandas, an open-source Python library developed with the help of NumPy, facilitates data cleaning and analysis. With its versatile features, it delivers swift performance and accommodates an array of data structures.
Pandas is a vital element of the machine learning pipeline, used right after the data collection phase. This library provides numerous functionalities, including but not limited to:
- Cleansing Data
- Inputting Data
- Data Normalisation
- Data Visualization
- Information Analysis
- Obtaining a Unified View through Data Frame Concatenation and Merging
- Facilitates loading and saving of an array of data types, including CSV, Excel, HTML, JSON, among others.
Benefits of Pandas:
- Enables Users to Import Data from Different File Formats
- Eases Handling of Incomplete Data
- Swift and Effortless Data Manipulation
- Functionalities for Time-Series Data Manipulation
- Enables Efficient Handling of Big Data
Scikit-learnFor adept manipulation of data for machine learning models, Scikit-learn provides a broad range of user-friendly and efficient tools. Common processors like OneHotEncoder, StandardScaler, and MinMaxScaler are frequently utilized for data manipulation.
Outlined below are some of Sklearn’s additional preprocessing tools.
Scikit-learn offers classification, regression, and clustering algorithms. Its comprehensive documentation makes incorporating algorithms or undertaking preprocessing tasks an effortless process.
Steps Involved in Preparing for Data Analysis
In this section, we will explore the importance of Pandas and Sklearn in data preparation.
Segmentation of Training and Validation PhasesBefore feeding data to a model, it is crucial to sort it into distinct groups. A model must be trained using training data, evaluated using validation data, and tested with test data. Typically, the ratio of the number of training samples to test samples is 80:20. Scikit-learn has an inbuilt method to segregate training and validation data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Handling of Missing DataRaw datasets often contain invalid values such as Not a Number (NaN) or missing values. To enhance accuracy in predictions, it is critical to preprocess the data by removing these invalid values. A commonly used technique for replacing invalid values is to substitute them with the mean, mode, or median of the dataset. All the necessary functionality to achieve this is provided by the SimpleImputer module in sklearn.impute.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy=”mean”)
Replacement for X = imputer.fit Transform (df)
Since this imputer produces a NumPy array, you need to convert it to a dataframe.
Input: X = pd.DataFrame(X, columns=df.columns)
Filling in missing data is sometimes impossible. Therefore, the dropna function in the Pandas module can be used to remove such rows.
Dropped_df = df.dropna()
Handling Categorical InformationCategorical data often needs to be converted into numerical values before it can be utilised in many operations. This process is known as encoding, and there are two different encoding techniques: label encoding and one-hot encoding. Label encoding assigns a numerical value to each category, while one-hot encoding assigns a binary indicator value to each category. In both cases, the output is a numerical representation of the original categorical data.
Suppose a dataset contains a column on height, and label encoding is used to assign numerical values to the options of Tall, Medium, and Short. For instance, Tall can be assigned the value 0, Medium the value 1, and Short the value 2. Preprocessing modules from the Sklearn library can perform this task.
Input/Output Encoding using a Single Operation
The one-hot encoder represents categorical features as a binary array. This operation generates a new column representing each category within the feature. For example, the gender column may contain two categories: male and female. Using the one-hot encoder, a separate column will be created for each label: male and female. A value of 1 will be inserted in the corresponding column of the respective label, while the other column will be assigned a value of 0, and vice versa.
Numerical and categorical data are the two most commonly encountered types of data, and the aforementioned techniques are frequently used to prepare them for further analysis. However, many more complex data preparation techniques can be implemented. Some examples are provided below.
Dealing with Edge Cases
Removing outliers before using them as inputs for a model is recommended as it enhances the accuracy of the model. Outliers refer to data points that substantially differ from the average or median value of a dataset. Researchers use their own set of criteria to determine what qualifies as an abnormality. Outliers can be classified into moderate or severe, depending on the degree to which they deviate from the mean. Typically, severe outliers are not included in models.
Analysis of Symmetric Curves using Standard Deviation
Determining the boundary for a Gaussian distribution with outliers on a symmetric curve necessitates consideration of the standard deviation.
Implementing Techniques Resilient to Outliers
When attempting to eliminate outliers, it is recommended to use techniques that are not influenced by them. Popular techniques for this include k-nearest neighbour, support vector machine, decision tree, ensemble methods, and the Naive Bayes classifier. All of these methods are renowned for their ability to function effectively even in the presence of outliers.
Modifications to the Data
Data sets may contain features that differ significantly in scale, measurement, and scope. If such data is fed into a model without scaling, the resulting modelling outcomes may be unsatisfactory because models only consider the magnitude of the data and disregard its units of measurement. Therefore, to ensure consistency, data must be standardised or normalised. Standardisation and normalisation are two types of scaling that can be utilised for this purpose.
Standardisation: When the values of an attribute are equally distributed around its mean, the mean becomes zero and the standard deviation becomes one. The diagram below depicts this scenario. Symmetrical distribution of values is illustrated, with roughly the same number of values on either side of the mean.
Standardisation: This scaling technique transforms numbers into a range of 0 to 1. It is also known as min-max scaling because it maps the minimum and maximum values of a dataset to 0 and 1, respectively. As a result, the data is scaled while its original distribution is preserved, allowing for more meaningful comparisons between various datasets.
Linear and logistic regression are two of the most popular Machine Learning techniques that mandate normal distribution of data features. To meet this prerequisite, specific transformation techniques must be utilised.
- The logarithmic transformation employs the logarithmic function (np.log(df[‘column name’])).
- For an inverse transformation, the inverse function is employed, denoted by 1/(df[“column name”]).
- The square root transformation of the data frame is np.sqrt(df[“column name”]).
- The exponential function, df[“column name”])**(1/1.2), is utilised to execute the transformation.
About Non-Uniform Datasets
Generating a more balanced distribution of data can lead to a more precise and unbiased model. There are two ways to achieve this: increasing the number of observations for the minority class, or decreasing the number of observations for the majority class. This will ensure that each group has roughly the same amount of data.
Shown below are several methods of resampling.
Small Sample Size: For large datasets with millions of data points, this method can be highly beneficial. The technique involves removing data points from the majority group to establish an equivalent number of members from both the majority and minority groups. However, this approach risks disregarding valuable information that could potentially improve the model’s performance.
Oversampling: This method can improve the representation of the underrepresented group by augmenting it with additional data. It is particularly useful when the quantity of available data is limited, though it may potentially result in overfitting in certain situations.
SMOTE: Researchers studying underrepresented groups may generate their own data using the Synthetic Minority Oversampling Technique (SMOTE). This technique involves randomly selecting a minority class point and identifying its k-nearest neighbors. Artificial points are then created and inserted between these points and their nearest neighbors.
Data preprocessing is not restricted to textual data only; it can come in various formats including images, time series, and more. Despite the variations in format, the initial stages of any machine learning process are consistent: obtaining and preparing data. Data cleaning is a crucial step that should not be overlooked because disregarding it can result in inaccurate outcomes from the ML model. As a result, it is critical to preprocess the raw data properly.