Tuesday, November 1, 2022

Part C: Prepare The Dataset

 

Part C: Prepare The Dataset

Multi-Topic Text Classification with Various Deep Learning Models

Author: Murat Karakaya
Date created….. 17 09 2021
Date published… 15 03 2022
Last modified…. 15 03 2022

Description: This is the Part C of the tutorial series “Multi-Topic Text Classification with Various Deep Learning Models that covers all the phases of text classification:

  • Exploratory Data Analysis (EDA),
  • Text preprocessing
  • TF Data Pipeline
  • Keras TextVectorization preprocessing layer
  • Multi-class (multi-topic) text classification
  • Deep Learning model design & end-to-end model implementation
  • Performance evaluation & metrics
  • Generating classification report
  • Hyper-parameter tuning
  • etc.

We will design various Deep Learning models by using

  • the Keras Embedding layer,
  • Convolutional (Conv1D) layer,
  • Recurrent (LSTM) layer,
  • Transformer Encoder block, and
  • pre-trained transformer (BERT).

We will cover all the topics related to solving Multi-Class Text Classification problems with sample implementations in Python / TensorFlow / Keras environment.

We will use a Kaggle Dataset in which there are 32 topics and more than 400K total reviews.

If you would like to learn more about Deep Learning with practical coding examples,

  • Please subscribe to the Murat Karakaya Akademi YouTube Channel or
  • Do not forget to turn on notifications so that you will be notified when new parts are uploaded.
  • Follow my blog on muratkarakaya.net

You can access all the codes, videos, and posts of this tutorial series from the links below.



        Photo by Jeffrey F Lin on Unsplash

PART C: PREPARE THE DATASET

You can watch this tutorial using the below links in English or Turkish:

So far, we have just observed some properties in the raw data. Using these observations, we are ready to preprocess the text data for a classifier model.

Remember the raw dataset

After operations in Part A, the raw dataset statistics are as follows:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 422281 entries, 0 to 427230
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 422281 non-null object
1 text 422281 non-null object
2 words 422281 non-null int64
dtypes: int64(1), object(2)
memory usage: 29.0+ MB
time: 92.3 ms (started: 2022-03-01 12:16:13 +00:00)
data.describe()
png
time: 37.1 ms (started: 2022-03-01 12:16:13 +00:00)

Shuffle Data

It is a really good and useful habit that, before doing anything else, as a first step in the preprocessing shuffle the data!

Actually, I will shuffle the data at the last step of the pipeline. But it does not harm shuffling the data twice :))

data= data.sample(frac=1)time: 106 ms (started: 2022-03-01 12:16:13 +00:00)

Convert Categories From Strings to Integer Ids

Observe that the categories (topics/class)of the reviews are strings:

data["category"]27032               beyaz-esya
396362 temizlik
54487 cep-telefon-kategori
34124 beyaz-esya
367759 sigortacilik
...
363263 sigortacilik
338102 otomotiv
285630 mekan-ve-eglence
39221 beyaz-esya
343939 otomotiv
Name: category, Length: 422281, dtype: object



time: 10.8 ms (started: 2022-03-01 12:16:13 +00:00)

Create integer category ids from text category feature:

data["category"] = data["category"].astype('category')
data.dtypes
category category
text object
words int64
dtype: object



time: 56.2 ms (started: 2022-03-01 12:16:13 +00:00)
data["category_id"] = data["category"].cat.codes
data.tail()
png
time: 29.3 ms (started: 2022-03-01 12:16:13 +00:00)data.dtypescategory       category
text object
words int64
category_id int8
dtype: object



time: 8.03 ms (started: 2022-03-01 12:16:13 +00:00)

Build a Dictionary for id to text category (topic) look-up:

id_to_category = pd.Series(data.category.values,index=data.category_id).to_dict()
id_to_category
{0: 'alisveris',
1: 'anne-bebek',
2: 'beyaz-esya',
3: 'bilgisayar',
4: 'cep-telefon-kategori',
5: 'egitim',
6: 'elektronik',
7: 'emlak-ve-insaat',
8: 'enerji',
9: 'etkinlik-ve-organizasyon',
10: 'finans',
11: 'gida',
12: 'giyim',
13: 'hizmet-sektoru',
14: 'icecek',
15: 'internet',
16: 'kamu-hizmetleri',
17: 'kargo-nakliyat',
18: 'kisisel-bakim-ve-kozmetik',
19: 'kucuk-ev-aletleri',
20: 'medya',
21: 'mekan-ve-eglence',
22: 'mobilya-ev-tekstili',
23: 'mucevher-saat-gozluk',
24: 'mutfak-arac-gerec',
25: 'otomotiv',
26: 'saglik',
27: 'sigortacilik',
28: 'spor',
29: 'temizlik',
30: 'turizm',
31: 'ulasim'}



time: 385 ms (started: 2022-03-01 12:16:13 +00:00)

Build another Dictionary for category (topic) to id look up:

category_to_id= {v:k for k,v in id_to_category.items()}
category_to_id
{'alisveris': 0,
'anne-bebek': 1,
'beyaz-esya': 2,
'bilgisayar': 3,
'cep-telefon-kategori': 4,
'egitim': 5,
'elektronik': 6,
'emlak-ve-insaat': 7,
'enerji': 8,
'etkinlik-ve-organizasyon': 9,
'finans': 10,
'gida': 11,
'giyim': 12,
'hizmet-sektoru': 13,
'icecek': 14,
'internet': 15,
'kamu-hizmetleri': 16,
'kargo-nakliyat': 17,
'kisisel-bakim-ve-kozmetik': 18,
'kucuk-ev-aletleri': 19,
'medya': 20,
'mekan-ve-eglence': 21,
'mobilya-ev-tekstili': 22,
'mucevher-saat-gozluk': 23,
'mutfak-arac-gerec': 24,
'otomotiv': 25,
'saglik': 26,
'sigortacilik': 27,
'spor': 28,
'temizlik': 29,
'turizm': 30,
'ulasim': 31}



time: 8.54 ms (started: 2022-03-01 12:16:14 +00:00)

Check the conversions:

print("alisveris id is " , category_to_id["alisveris"])
print("0 is for " , id_to_category[0])
alisveris id is 0
0 is for alisveris
time: 2.76 ms (started: 2022-03-01 12:16:14 +00:00)

Check the number of categories

It should be 32 as we observed in the raw dataset above:

number_of_categories = len(category_to_id)
print("number_of_categories: ",number_of_categories)
number_of_categories: 32
time: 2.82 ms (started: 2022-03-01 12:16:14 +00:00)

Finally check the dataset columns and rows of the modified data frame:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 422281 entries, 27032 to 343939
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 422281 non-null category
1 text 422281 non-null object
2 words 422281 non-null int64
3 category_id 422281 non-null int8
dtypes: category(1), int64(1), int8(1), object(1)
memory usage: 10.5+ MB
time: 104 ms (started: 2022-03-01 12:16:14 +00:00)

Reduce the Size of the Total Dataset

Since using a large dataset for testing your pipeline would take more time, you would prefer to take a portion of the raw dataset as below:

#limit the number of samples to be used in code runs
#Total Number of Reviews is 427230
data_size= 427230
data= data[:data_size]
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 422281 entries, 27032 to 343939
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 422281 non-null category
1 text 422281 non-null object
2 words 422281 non-null int64
3 category_id 422281 non-null int8
dtypes: category(1), int64(1), int8(1), object(1)
memory usage: 10.5+ MB
time: 96.5 ms (started: 2022-03-01 12:16:14 +00:00)
data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 422281 entries, 27032 to 343939
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 422281 non-null category
1 text 422281 non-null object
2 words 422281 non-null int64
3 category_id 422281 non-null int8
dtypes: category(1), int64(1), int8(1), object(1)
memory usage: 10.5+ MB
time: 88.2 ms (started: 2022-03-01 12:16:14 +00:00)

Split the Raw Dataset into Train, Validation, and Test Datasets

To prevent data leakage during preprocessing the text data, we need to split the text into Train, Validation, and Test datasets.

Data leakage refers to a common mistake that we can make by accidentally sharing some information between the test and training datasets. Typically, when splitting a dataset into testing and training sets, the goal is to ensure that no data is shared between these two sets. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

In our case, since we want to classify reviews, we have to not use test reviews in preprocessing the text, especially during text vectorization and dictionary (vocabulary) generation.

Thus, before beginning the text preprocessing we will split the datasets as train, validation, and test.

NOTE: Even though in the fit() method, we have an argument validation_split for generating a holdout (validation) set from the training data, we can not use this parameter since we will use the tf.data.Dataset API to create the data pipeline. Because the argument validation_split is not supported when training from Dataset objects. Specifically, this feature (validation_split) requires the ability to index the samples of the datasets, which is not possible in general with the Dataset API.

Split Train & Test Datasets

# save features and targets from the 'data'
features, targets = data['text'], data['category_id']

all_train_features, test_features, all_train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
random_state=42,
shuffle = True,
stratify=targets
)
time: 228 ms (started: 2022-03-01 12:16:14 +00:00)

Reduce the size of the Train Dataset

You might want to decrease the train dataset size to observe its impact on a Deep Learning model. Notice that I still keep the test data size fixed.

print("All Train Data Set size: ",len(all_train_features))
print("Test Data Set size: ",len(test_features))
All Train Data Set size: 337824
Test Data Set size: 84457
time: 3.08 ms (started: 2022-03-01 12:16:14 +00:00)
reduce_ratio = 0.02

reduced_train_features, _, reduced_train_targets, _ = train_test_split(
all_train_features, all_train_targets,
train_size=reduce_ratio,
random_state=42,
shuffle = True,
stratify=all_train_targets
)
time: 189 ms (started: 2022-03-01 12:16:14 +00:00)print("Reduced Train Data Set size: ",len(reduced_train_features))
print("Test Data Set size: ",len(test_features))
Reduced Train Data Set size: 6756
Test Data Set size: 84457
time: 2.8 ms (started: 2022-03-01 12:16:14 +00:00)

Split Train & Validation Datasets

train_features, val_features, train_targets, val_targets = train_test_split(
reduced_train_features, reduced_train_targets,
train_size=0.9,
random_state=42,
shuffle = True,
stratify=reduced_train_targets
)
time: 16.6 ms (started: 2022-03-01 12:16:14 +00:00)print("Train Data Set size: ",len(train_features))
print("Validation Data Set size: ",len(val_features))
print("Test Data Set size: ",len(test_features))
Train Data Set size: 6080
Validation Data Set size: 676
Test Data Set size: 84457
time: 7.57 ms (started: 2022-03-01 12:16:15 +00:00)

Summary

In this part, we prepared the datasets and took several actions and decisions:

  • we converted categories from strings to integer ids
  • we built look-up dictionaries for id to text and text to id conversion
  • we split the dataset into Train, Validation, and Test sets.

In the next part, we will apply the text preprocessing by using the TF Data Pipeline and the Keras TextVectorization layer.

Do you have any questions or comments? Please share them in the comment section.

Thank you for your attention!