The image above was created using AI. More specifically, this was the first image generated by Dall-E 3, when given the prompt “Robots Reading Hate Tweets”.
In the previous blog post, we discussed the goals of this project: to create a machine learning program that can classify hate speech based on the specific identity groups it targets. We also covered the first step in this process, data preprocessing, where we cleaned and organized the dataset and prepared it for model training.
In this second part, we’ll dive into the roster of models we will test and compare, all of which are imported from the Scikit-learn Python library. This is the stage where analyze various machine learning algorithms to determine which ones are most effective (or even applicable) for classifying hate speech by identifier. Each model brings its own advantages, and in this post, I’ll briefly explain some of the more commonly used ones.
Importing the Models
To begin, let’s import all the classifiers we plan to use (yes, it is a huge chunk of code):
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import OneClassSVM
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OutputCodeClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifierCV
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import NearestCentroid
from sklearn.svm import NuSVC
from sklearn.linear_model import Perceptron
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.mixture import GaussianMixture
lg = LogisticRegression(penalty='l1', solver='liblinear')
sv = SVC(kernel='sigmoid', gamma=1.0)
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth=5)
knn = KNeighborsClassifier()
rfc = RandomForestClassifier(
n_estimators=1000, random_state=2 # Square root of the number of features
)
etc = ExtraTreesClassifier(
n_estimators=1000, random_state=2 # Square root of the number of features
)
abc = AdaBoostClassifier(n_estimators=50, random_state=2)
bg = BaggingClassifier(n_estimators=50, random_state=2)
gbc = GradientBoostingClassifier(n_estimators=50, random_state=2)
etc_single = ExtraTreeClassifier()
dtc_single = DecisionTreeClassifier()
ocsvm = OneClassSVM()
mlp = MLPClassifier()
rnc = RadiusNeighborsClassifier()
knn_single = KNeighborsClassifier()
chain = ClassifierChain(base_estimator=LogisticRegression())
moc = MultiOutputClassifier(estimator=LogisticRegression())
occ = OutputCodeClassifier(estimator=LogisticRegression())
ovo = OneVsOneClassifier(estimator=LogisticRegression())
ovr = OneVsRestClassifier(estimator=LogisticRegression())
sgd = SGDClassifier()
ridge_cv = RidgeClassifierCV()
ridge = RidgeClassifier()
pac = PassiveAggressiveClassifier()
gpc = GaussianProcessClassifier()
vc = VotingClassifier(estimators=[('lr', lg), ('rf', rfc), ('svc', sv)])
bnb = BernoulliNB()
cccv = CalibratedClassifierCV(base_estimator=LogisticRegression())
cccv_iso = CalibratedClassifierCV(base_estimator=RidgeClassifier())
gnb = GaussianNB()
lp = LabelPropagation()
ls = LabelSpreading()
lda = LinearDiscriminantAnalysis()
linsvc = LinearSVC()
logreg_cv = LogisticRegressionCV(
solver='lbfgs', # Handles multiclass problems well
Cs=10, # A range of regularization strengths
max_iter=2000, # Increase the maximum number of iterations
random_state=2, # Ensures reproducibility
cv=5 # Number of folds for cross-validation
)
nc = NearestCentroid()
nusvc = NuSVC()
perc = Perceptron()
qda = QuadraticDiscriminantAnalysis()
svc = SVC(
kernel='rbf', # Radial basis function kernel
C=1, # Regularization parameter
gamma='scale', # Kernel coefficient
max_iter=-1, # Increase the maximum number of iterations
random_state=2 # Ensures reproducibility
)
gm = GaussianMixture()
Luckily for you (and tediously for me), all of the classifiers in this list work and are implementable for this specific task. We’ve imported a wide range of models from sklearn—a popular machine learning library in Python. These models include classifiers from various categories: decision trees, support vector machines, ensemble methods, and more. Let’s break down some of the common ones:
Logistic Regression
Logistic Regression is one of the most widely used algorithms for binary classification problems. It estimates the probability of a binary event using a logistic function, and it’s particularly useful when dealing with large datasets and simple linear decision boundaries.
lg = LogisticRegression(penalty='l1', solver='liblinear')
Here, we’ve created a LogisticRegression model using an L1 regularization penalty, which helps prevent overfitting by pushing some coefficients toward zero. The liblinear solver is well-suited for smaller datasets.
Decision Trees and Extra Trees
Decision Tree Classifiers split the dataset into subsets based on feature values, building a tree where each internal node represents a feature and each leaf node represents a class label. They’re intuitive and interpretable, making them a good starting point for many classification tasks.
dtc = DecisionTreeClassifier(max_depth=5)
etc = ExtraTreesClassifier(n_estimators=1000, random_state=2)
In addition to a standard DecisionTreeClassifier, we’re using ExtraTreesClassifier, which is an ensemble method (which is similar to Random Forest, which is another algorithm we test in this project). It builds multiple trees (in this case, 1000) and averages their predictions for more robust results.
Support Vector Machines (SVM)
SVM is a powerful algorithm for both linear and non-linear classification problems. It works by finding a hyperplane that best separates different classes in the feature space. The kernel trick is used to transform data into a higher-dimensional space when the data is not linearly separable.
svc = SVC(kernel='rbf', C=1, gamma='scale', random_state=2)
Here, we’re using the SVC with a Radial Basis Function (RBF) kernel, which helps capture complex, non-linear relationships between the features. The C parameter controls the tradeoff between a smooth decision boundary and correctly classifying the training points.
K-Nearest Neighbors (KNN)
K-Nearest Neighbors is a simple and intuitive algorithm that classifies new points based on their proximity to existing labeled data. KNN works well with small datasets and is non-parametric, meaning it makes no assumptions about the underlying data distribution.
knn = KNeighborsClassifier()
Neural Networks (MLPClassifier)
Neural networks are inspired by the human brain and consist of layers of interconnected nodes. For text classification, we often use Multi-Layer Perceptrons (MLP), which are a type of fully connected feedforward neural network.
mlp = MLPClassifier()
The MLPClassifier uses backpropagation to adjust the weights during training. It’s particularly useful for capturing complex patterns in data but can be computationally expensive.
Ensemble Methods: Random Forest, AdaBoost, Gradient Boosting
Ensemble methods combine multiple models to improve overall performance. The idea is that by combining the strengths of several weak learners, we can build a more robust classifier.
rfc = RandomForestClassifier(n_estimators=1000, random_state=2)
abc = AdaBoostClassifier(n_estimators=50, random_state=2)
gbc = GradientBoostingClassifier(n_estimators=50, random_state=2)
Random Forest builds multiple decision trees and averages their predictions. AdaBoost trains a sequence of weak learners, with each new learner focusing on the mistakes of the previous one. Gradient Boosting builds models sequentially, where each new model tries to correct the errors of its predecessor.
Training and Comparing the Models
Once we’ve selected and defined our models, we’ll train them using our preprocessed dataset and evaluate their performance on the test set. This will allow us to compare the efficacy of different models in classifying hate speech by identifier.
In the next blog post, we’ll go deeper into model evaluation, tuning hyperparameters, and drawing conclusions about which models perform best for our classification task. Stay tuned for Part 3, where we’ll explore the results of our model training and evaluation!
