Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation

In this article, I would walk through an ultimate guide to prepare materials for string data type machine learning. Or we call text cleaning for NLP before implementing machine learning. By the end of this piece, you can start using Python to clean up string data in your project.

In this article, I would walk through an ultimate guide to prepare materials for string data type machine learning. Or we call text cleaning for NLP before implementing machine learning. By the end of this piece, you can start using Python to clean up string data in your project.

Table of Contents: Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation

What is Text Cleaning for NLP in Machine Learning

Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language straightforwardly and clearly. Basically it includes three main parts on whatever projects you might be going to do.

  • Text Cleaning and impurity removal
  • Tokenization & Group by semantic
  • Feature and Target Data Structure

Case Normalization

Different capitalization among different words, or same words can confuse the computer to understand your content and semantics. So first thing first, here is the code sample to set a standard capitalization of all words

abc  = content.lower()

Eliminate Unicode Characters

There are quite a lot of unicode characters, such as emoji, URLs, email address, etc in the content piece. In fact, these characters need to be handled well otherwise it also can confuse computers. Below are two regular expression code samples to handle all URLs and email addresses.

URLs:

content = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", content)

Emails:

content = re.sub(r"([A-z0-9._%+-]+@[A-z0-9.-]+\.[A-z]{2,4}

", "", content)

Tokenize all content as an individual value in a list

After having removed all unnecessary unicode characters, we need to create a list from the content by tokenising it.

Here is a sample using Python

word_tokens = word_tokenize(content)

Handle stopwords

In language, there are quite a lot of words that can’t add value to the meaning and most of them are kind of language grammar. So handling stopwords is the task to further reduce confusion to computers by reducing these kinds of words.

Here is the sample using NLTK that is one of the most popular packages in Python

   nltk.download('stopwords')

   nltk.download('punkt')

   nltk.download('wordnet')

   nltk.data.path.append('nltk_data/corpora/stopwords')

   nltk.data.path.append('nltk_data/tokenizers/punkt')

   nltk.data.path.append('nltk_data/corpora/wordnet')

   stopwords123 = set(stopwords.words('english'))

If you are interested in using NLTK building functions and App deployed on Cloud, but you might not be sure how to deploy the NLTK data source, please subscribe to us in the following format at the end of this article.

Last but not least, we can write a code to remove all stopwords from the list we just now tokenized above. And in the list, it just remains the core value messages and content.

Lemmatization

In human language, it has past, present and future tense. Also, it has a first person, second person and third person angle to express the meaning. So, Lemmatization is a method to handle this for the purpose to eliminate unnecessary words or typos.

   lemmatizer = WordNetLemmatizer()

   for lemmatizedContent in content:

       lemmatizer.lemmatize(lemmatizedContent)

POS Tagging for Semantic Processing

For tabulating the POS totals of large bodies of content, we have to tag and group the words that are based on semantic purpose. In human language, we have different types of words, such as nouns, verb, adjective, adverb, etc. Furthermore, it might be the same word with variations in a speech.

Thus, we need POS tagging to differentiate the speech semantics and purpose. It’s for machines to understand this purpose in order not to confuse it. 

Data preprocessing using Pandas, Numpy and Scikit Learn

For more details regarding this topic, please refer to the article I released previously

Tips for Data Preprocessing Using Python and Scikit Learn

Full Python Scripts of Text Cleaning for NLP in Machine Learning (Includes how to handle NLTK data files in Web App or DApp or functions deployed on Cloud)

If you are interested in full python scripts of Text Cleaning for NLP in Machine Learning Preparation, please subscribe to our newsletter by adding the message ‘NLP Text Clean + NLTK Data deployment + Full scripts and scraping API free token. We would send you the script when the up-to-date app script is live.

I hope you enjoy reading Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation. If you did, please support us by doing one of the things listed below, because it always helps out our channel.