This article will discuss different cleaning techniques that are essential to obtain maximum performance from textual data
For the demonstration of the text cleaning methods, we will use the text dataset named ‘metamorphosis’ from Kaggle.
Let’s start with importing the required Python libraries for the cleaning process.
Now let’s load the dataset.
note that for the above code cell to work, you need to put the local directory path of the data file.
Splitting the text data into words by whitespace
Here we are seeing that the punctuation is preserved (e.g. armour-like and wasn’t), which is nice. We can also see that end-of-sentence punctuation is kept with the last word (e.g., thought.), which is not great.
So, this time let’s try splitting the data using non-word characters.
Here we see that words like ‘thought.’ have been converted into ‘thought’. But the problem is that the words like ‘wasn’t’ are converted into two words like ‘wasn’ and ‘t’. We need to fix it.
In Python, we can use string.punctuation to get a bunch of punctuations at once. We will use that to remove punctuation from our text
So now we will split the words by whitespace and then remove all the punctuations that have been recorded in the data
Here we can see that we don’t have the words like ‘thought.’ but we also have words like ‘wasn’t’ which is correct.
Sometimes the text also contains characters that are not printable. We need to filter those out too. To do this, we can use Python ‘string.printable’ which gives us a bunch of characters that can be printed. So, we will remove the characters which are not present in this list.
Now let’s make all the words into lowercase. This will reduce our vocabulary. But this has some disadvantages also. After doing this, two words such as ‘Apple’ as in company and ‘apple’ as a fruit will be considered the same entity.
Also, words with one character won’t contribute to most of the NLP tasks. So we will be removing those too.
In NLP, frequent words such as ‘is’, ‘as’, ‘the’ do not contribute much to the model training. So such words are known as stop words and removing them in the text-cleaning process is suggested
Now, at this point, we will reduce words with the same intent to a single word. For example, we will reduce the words ‘running’, ‘run’, and ‘runs’ to the word ‘run’ only since all three words give the same meaning to the model during training. This can be performed using PorterStemmer class in nltk library.
Stemmed words may or may not have a meaning. If you want your words to have a meaning, then rather than using the stemming technique, you can use a lemmatization technique which guarantees that the words will have meaning after transformation.
Now let’s remove the words which are not made of alphabets alone
At this stage, the textual data seems decent enough to be used for the word embedding techniques. But also note that there might be some additional steps to this process for some special kinds of data (for example, HTML code).
I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.
Have a great day!
Comments