Task classification is the utilisation of Machine Learning (ML) methods for categorising data pieces according to the inferences made. This approach is utilised by various Google products such as Gmail and Twitter for gauging the sentiment of messages, and Google Lens for determining the species of plants based on their respective images.
Classifications with several categories
The main categories of multiclass categorization are:
- Separation into two classes, or binary categorization
- Classification by multiple criteria
- Using a number of labels to determine a category
Sorting into two groups, or “bins,”
A binary classification is a form of categorization in which the data can only be placed into two distinct buckets. For example, it could be used to determine whether a patient has tuberculosis or not (yes or no, represented as 1 or 0 respectively), or to classify a movie review as either positive or negative (again, represented as 1 or 0).
Classification by multiple criteria
Multiclass classification is a technique used to divide data into more than two distinct groups. For example, if we needed to recognise a digit in a picture and label it with a number between zero and nine, we would use multiclass classification. Similarly, if we offered ten distinct courses, we would use multiclass classification to assign a single category label to each data point. Additionally, if we needed to rank songs in order of popularity from one to five, we could also utilise multiclass classification.
Using a number of labels to determine a category
It is possible to form multiple interpretations from the same set of data. For example, a single image could contain both a residence and a plant. In order to accurately identify each object, it is necessary to develop distinct labelling systems and apply appropriate machine learning algorithms. The initial step in this process is to select the most suitable categorization option for the given scenario.
We’ll go deep into an investigation of multiclass classification for machine learning tasks here.
Python multiclass classification using machine learning techniques
Machine learning is a powerful tool for multiclass classification, whereby algorithms are trained using previously classified data in order to detect patterns. This can be achieved using a variety of machine learning algorithms, including logistic regression and support vector machines (SVMs), which are particularly effective for binary categorization.
Although these models are limited to binary classification, we can utilise Support Vector Machines (SVM) to accommodate our multiclass classification problem. We can do this by using either a one-on-one or all-versus-one approach, in which multiple binary classifiers are employed.
In the following section, we will explore further into these machine learning (ML) techniques. Additional ML techniques that we will look into include Naive Bayes classification, decision trees, and K Nearest Neighbours (KNN). We will thoroughly investigate each of these methods.
Using a winner-take-all approach
A multi-class classification task involving four classes (i.e. dogs, cats, cows, and pigs) can be simplified by decomposing it into four binary classification problems using a straightforward technique. This technique involves separating the four classes into two different groups, and then performing two binary classifications on the two groups. For example, one binary classification would involve categorising pictures of dogs and cats, while the second binary classification would involve categorising pictures of cows and pigs. This approach can be used to simplify multi-class classifications involving any number of classes.
There are four distinct classes, so we can utilise four distinct binary classifiers to accurately determine whether or not a given image contains a particular subject. For example, one classifier may be able to confidently identify whether a picture contains a dog, while a second classifier can be used to similarly determine if the image includes a cat. This pattern can be repeated for the other two classes.
Finally, the prediction class with the highest confidence value will be chosen. To achieve this, individual models of binary classification can be used, such as logistic regression. Fortunately, the implementation of this can be done quickly and conveniently using the sklearn library.
The following instructions will build a synthetic dataset that may be used to test out the aforementioned models. The
function will generate a dataset with the inputs of our choice – features, classes, and how many samples we want.
from sklearn.datasets import make_classification
x, y = make_classification(n_samples=3000, n_features=12, n_informative=5, n_redundant=5, n_classes=4, random_state=36)
In order to implement the one-vs-all strategy, we can utilise sklearn’s logistic regression model with the ‘multi class’ option set to ‘ovr.’ This approach is based on the concept of “one versus the rest,” which is abbreviated to “ovr” and provides an effective solution to the problem.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class='ovr')
predictions = model.predict(X)
You can see how to fit the model and use it to generate predictions on the dataset in the above code snippet.
Matches are played one-on-one.
This strategy is similar to the preceding one, but with a key difference; instead of a single binary classifier, we will be training independent binary classifiers for each class, taking into account all possible combinations.
It is possible to employ a strategy of binary classification in order to differentiate between the four groups of domesticated animals: dogs, cats, cows, and pigs. To do this, we would set up and train six separate binary classifiers, as outlined below.
Indictor #1: Comparing Dogs and Cats
Separator 2: Cow vs. Dog
As a Third Criterion for classification: Pig vs. Dog
Fourth Separator: Dog versus Pig
Discriminator 5: Battle of the Sexes: Cats vs. Pigs
Sorting Rule No. 6: Swine versus Bovine
The number of binary classifiers required for a ‘n’-class multiclass classification is calculated by dividing n multiplied by (n-1) by two. It is important to note that this method compares more models than the one-against-all approach. A variety of different classifiers may be utilised, ranging from logistic regression to support vector machines, to k-nearest neighbour and beyond.
The multiclass module of the sklearn package contains the necessary functions for this implementation.
from sklearn.multiclass import OneVsOneClassifier
As we continue our discussion, let us explore some of the well-known algorithms that have proven to be successful in various contexts, such as business tasks, Kaggle challenges, hackathons, and more. By leveraging these algorithms, we can gain insight and a better understanding of the data present.
Classifiers based on decision trees
A decision tree is a predictive model that is used to identify a set of decisions and their associated outcomes. In this approach, the dataset is divided into subsets at each level of the decision tree, based on predetermined criteria. The aim of this partitioning and division of data is to group together samples that are highly similar in terms of their statistical properties. This method is used to identify patterns and correlations in the data, which can then be used to make informed predictions.
In order to choose the most suitable division of information, we must consider both the entropy and the knowledge associated with the data. A decision tree seeks to create a division that reduces the entropy level of the data, which is a representation of how random the dataset is.
The root nodes of a decision tree are the initial points from which decisions must be made. Subsequent nodes, which require choices to be taken, are known as decision nodes. The eventual outcome of the tree is provided by the leaf nodes at the end of the tree.
In this example, we will use the sklearn package to create a dataset with three categories and train a decision tree classifier.
x, y = make_classification(n_samples=500, n_features=5, n_informative=4, n_redundant=1, n_classes=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1121218)
In order to utilise the model, we may bring it in from sklearn and run it.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
y_pred = model.predict(X_test)
The model’s projections are now at hand. To evaluate the efficacy, we may create a confusion matrix, exactly as we would for a binary classification.
For multiclass classification in Python, the interpretation of the confusion matrix will be distinct from that of binary classifiers. The Scikit-learn metrics module provides the necessary functions to carry out this type of classification. This can be demonstrated through the example below.
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
fig, axe = plt.subplots(figsize=(8, 5))
cmp = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred),display_labels=["class_1", "class_2", "class_3"],)
An additional substantial benefit of using a decision tree is that it allows us to visualise the process of arriving at a decision. To accomplish this, we must install the pydotplus package and execute the “export graphviz” function, thereby enabling us to export the graph from the model that has been trained.
!pip install pydotplus
from sklearn import tree
from IPython.display import Image, display
dot_data = tree.export_graphviz(model, out_file=None,filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Iris and newsgroups are just two examples of the kind of datasets you may test this on.
Multiclass classification evaluation metrics
Previously, we explored the use of the confusion matrix to address challenges that involve multiple classes. At present, let us take a look at several other metrics.
When it comes to binary classification, metrics derived from the confusion matrix are particularly essential. These metrics include accuracy, recall, and F1 score, and they can be applied universally across all categories. Moreover, accuracy, recall, and F1 score can be calculated independently for each category.
We trained a model using a simulated dataset consisting of 4 classes to have a better grasp of the situation.
Let’s check out the formulas for determining class A metrics.
A total of 59 + 13 + 7 + 14 = 93 samples were included for this analysis.
Class A samples (those that really belong to class A) = 59
The total number of false positives (10 + 9 + 16 = 35) is the number of samples incorrectly labelled as belonging to group A.
Accuracy is defined as the ratio of true positives to total positive results.
Accuracy Is Calculated As: 0.627 * (59+35)/(59)
In short, that is the mathematical explanation for it. Despite being rather straightforward, it is an imperative part of the process. The “classification report()” function provided by the Scikit-learn library simplifies the task of gathering these statistics for all categories.
from sklearn.metrics import classification_report
Verifying that the precision value for class 0 or class A is the same as what we had estimated with a little approximation is straightforward.
In this article, we have discussed the fundamentals of multiclass classification, its distinction from binary classification, and the approaches for utilising machine learning models for this purpose. To illustrate the concepts, we have demonstrated the use of a decision tree classifier. However, there is a broad range of other classifiers, including random forests, XGBoost, LightGBM, and more, that can be utilised to carry out multiclass classification.
In order to maximise performance, ensemble methods leverage a combination of multiple types of decision trees. Deep learning models such as TensorFlow and PyTorch can be employed for multi-class classification tasks when the dataset contains images. The construction of text categorization models also follows similar principles.