Classifying Hate Speech by Targeted Identifiers: Part 1

The image above was created using AI. More specifically, this was the first image generated by Dall-E 3, when given the prompt “Classifying Hate Speech by Targeted Identifiers”.

Since Elon Musk’s acquisition of X in 2022, instances of slurs against Black Americans, gay men, and women jumped to 3876, 3964, and 17,937 times a day, respectively. In the digital age, hate speech is an ongoing problem, and platforms are tasked with identifying and managing harmful content. But what if we could take it a step further? Rather than just flagging hate speech, what if we could pinpoint which identity group it targets—whether it’s race, religion, gender, or other identifiers? This project aims to do just that: build a hate speech classifier that categorizes the specific demographic or identity group under attack.

This first post will discuss the overall goal and walk through the initial stage of data preprocessing. Our focus is on cleaning and organizing the data in preparation for machine learning models.

Project Goal

The broader objective of this project is twofold:

  1. To test and compare the efficacy of different machine learning (ML) text classification methods. By evaluating multiple approaches, we can understand which models work best for this specific task.
  2. To create an effective program that classifies hate speech based on the targeted identifier, refining model hyperparameters and ensuring it can reliably categorize hate speech according to the identity group being attacked.

Loading and Processing the Dataset

We begin by loading the dataset, which comes from UC Berkeley’s D-Lab. The dataset contains user-generated content labeled with a hate_speech_score and indicators of whether the content targets particular groups. We do so as follows:

import pandas as pd
import numpy as np
import random
df = pd.read_parquet("hf://datasets/ucberkeley-dlab/measuring-hate-speech/measuring-hate-speech.parquet")

Here, we use the pandas library to load the dataset from a .parquet file, which contains a large amount of text data related to hate speech. The data includes a hate_speech_score, which helps us determine whether a piece of text is considered hate speech, and additional columns that indicate whether specific demographic groups are targeted.

However, not all of this data is interesting to us– the dataset is almost too informational, including labels such as annotator political affiliation. What is useful is that the dataset provides multiple demographic identifiers. We need to focus on specific ones, such as race, religion, gender, and sexuality. To do this, we define the valid categories our classifier will use:

valid_categories = ['target_sexuality', 'target_religion', 'target_race', 'target_origin', 'target_gender', 'target_disability', 'target_age']

However, our dataset is organized in a way that instances of hate speech that fit into multiple categories (yes, this is possible), are labelled with a list of all pertinent categories under the target column. As such, we will need to make 7 different “mini-datasets,” each comprising of all non-hate-speech rows, and all hate-speech rows with the corresponding target identifier:

data_target_sexuality = []
data_target_religion = []
data_target_race = []
data_target_origin = []
data_target_gender = []
data_target_disability = []
data_target_age = []

Each list will store instances of text labeled according to which group it targets, or if it’s “not hate speech.” Now, we iterate through the dataset and organize it:

for i in range(len(df)):
if df['hate_speech_score'][i] <= 0:
data_target_sexuality.append([df['text'][i], 'not_hate_speech'])
data_target_religion.append([df['text'][i], 'not_hate_speech'])
data_target_race.append([df['text'][i], 'not_hate_speech'])
data_target_origin.append([df['text'][i], 'not_hate_speech'])
data_target_gender.append([df['text'][i], 'not_hate_speech'])
data_target_disability.append([df['text'][i], 'not_hate_speech'])
data_target_age.append([df['text'][i], 'not_hate_speech'])
else:
for column_name in valid_categories:
if df[str(column_name)][i] == True:
eval(f"data_{column_name}").append ([df['text'][i], column_name])

Basically, the following is happening:

  1. If the hate_speech_score is zero or below, the text is categorized as “not hate speech” across all categories.
  2. If the score indicates hate speech, the code checks which identifier (race, religion, gender, etc.) is targeted and appends the text to the appropriate list for that demographic.

For simplicity, let’s focus on one category, such as target_age. We will create a new DataFrame for that particular target group:

df = pd.DataFrame(data_target_age, columns=["text", "type"])
print(df)
df.describe(include='all')

Now, to build an effective machine learning model, we need to split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X = df['text']

y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Finally, we need to convert the text data into a numerical format that machine learning models can process. We use a technique called Term Frequency-Inverse Document Frequency (TF-IDF):

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train) X_test_tfidf = vectorizer.transform(X_test)

This step transforms the text data into vectors that represent the importance of words in each piece of text. Words that are more frequent in hate speech will receive higher importance.

Conclusion

In this first post, we set up the foundation for classifying hate speech by targeted identifiers. We covered the project’s goal of creating a nuanced classifier and walked through the data preprocessing stage, where we loaded the dataset, organized the data by categories, and vectorized the text. In the next parts, we’ll explore model selection and training to classify hate speech based on these identifiers.

Stay tuned!

Leave a comment