It is impossible for a machine to interpret the human language as it is translated into a series of binary digits, which are then interpreted as instructions for displaying content that is legible to humans, such as text, audio, and images. Similarly, before sending data to any Machine Learning (ML) model, we must ensure that it is in the correct format and contains only numerical variables. Categorical variables, however, also hold important pieces of information, and can be encrypted into numerical form by using libraries such as Pandas and Scikit-learn.
Methods for storing and transmitting information on categorised variables
Usually referred to as “strings” or “categories,” categorical variables have a limited size. Some instances are as follows:
- The city in which a person resides, such as San Francisco, Chicago, Las Vegas, Seattle, etc.
- The division in which one works, such as HR, Merchandising, or Marketing.
- The highest level of education attained by an individual, whether it a Bachelor’s, Master’s, Doctorate, etc.
Ordinal data and nominal data are examples of categorical information.
- Statistics Based on Ordinal variables: It is essential to recognise and maintain the established order of values when encoding ordinal data. Failing to do so could lead to inaccurate conclusions and inaccurate data. For instance, when considering a person’s educational background, the degrees must be assigned in the appropriate order, rather than randomly. It is imperative to maintain the hierarchical structure of the data in order to ensure accuracy.
- Facts and figures: It is not currently possible to arrange the various classifications in a reasonable manner. Nominal data is assumed to have some type of hierarchy, and only the presence or absence of a certain feature needs to be taken into account. For example, in the above situation, it would be beneficial to include the city in which the person resides. It does not matter if they prefer Chicago or Las Vegas; their city of residence should still be included.
In light of our current understanding of categorical variables, we can utilise the Pandas and Scikit-learn libraries in Python to explore our options for encoding them.
Use the Find/Replace Functionality
Search and replace is the most straightforward approach for encoding categorical data. The replace() function can be used to substitute all instances of an old character with a new one in a string.
We can use the Pandas library to address the issue of textual labels representing numerical values in the “number of cylinders” column of our data set. As the maximum number of cylinders any one vehicle may have is 4, any numerical values entered in this column will be rounded down to the nearest 4. To do this, we can use the replace function to instantly switch out the textual labels such as “two” or “one” for the numerical counterparts.
We are constructing a mapping dictionary that will convert all string values to their associated integer values. This method is highly beneficial due to the necessity of preserving the order of ordinal data.
The highest degree in the previously mentioned example of “a person’s degree” can be associated with the highest numerical value, while the lowest degree can be associated with the smallest numerical value.
Data encoding for labels
By employing this strategy, every label will be assigned a distinct number based on its alphabetical position. Utilising the Scikit-learn package, we can implement this procedure.
In order to address the limitation of label encoding, one-hot encoding is often employed as an alternative. This technique involves converting each group into a distinct column, with the value of either 1 or 0 assigned to it. This process is commonly referred to as the creation of dummy datasets.
Pandas for transforming categorised information into numeric form
The techniques used to transform Pandas’ category data into numerical form are listed below.
- One Approach: Using get dummies ()
- Replacement (Method 2) ()
Scikit-learn for doing the “categorical to numerical” conversion
Scikit-learn provides a number of methods for transforming category information into numerical form.
- The First Approach: Encoding the Labels
- Encoding with a single pulse is the second technique.
In other words, which encoding method should you use?
The proper encoding method to utilise depends on how well we understand our material. At that point, we’ll have to settle on a certain model.
It is important to note that if we were to utilise the support vector machine (SVM) technique, the amount of time required for training could become significantly longer. This is due to the fact that the SVM approach is relatively slow and, with over fifteen categorical features, the complexity of the model would be heightened, thus further increasing the training time.
The following are some important considerations to keep in mind while deciding on an encoding method:
Employ a find-and-replace strategy
- This method is useful when maintaining the established order is paramount.
- In cases where the data is ordinal or quantitative, assigning numerical values to the range of sizes can be a useful technique. For example, if the variable has values of small, medium, and large, it can be assigned values of 1, 2, and 3 respectively.
Incorporate one-hot encoding
- In cases where the characteristics of the underlying category are not ordinal, a technique known as one-hot encoding can be used. An example of this is the city in which a person currently resides; as this is not an ordinal category, one-hot encoding can be used to represent the data in the relevant columns.
- As the number of different categories decreases, the complexity and duration of training the model increases, as more characteristics are added. Consequently, it is important to consider the trade-off between the accuracy of the model and the effort required to train the model.
- If a categorical characteristic has an inherent order or hierarchy, it is classified as ordinal. For example, ordinal positions within an army typically use large integers, with the highest rank being designated as ‘1’.
- When there are a lot of different types to choose from.
In this article, we will explore the various strategies for encoding categorical data, outlining the advantages and drawbacks of each approach as well as their most frequent applications. It is essential to be aware of the benefits and potential shortcomings of each technique, as encoding is an integral part of the feature engineering process which is fundamental for a successful model.