Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation
In this article, I would walk through an ultimate guide to prepare materials for string data type machine learning. Or we call text cleaning for NLP before implementing machine learning. By the end of this piece, you can start using Python to clean up string data in your project.
In this article, I would walk through an ultimate guide to prepare materials for string data type machine learning. Or we call text cleaning for NLP before implementing machine learning. By the end of this piece, you can start using Python to clean up string data in your project.
Table of Contents: Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation
- What is Text Cleaning for NLP in Machine Learning
- Case Normalization
- Eliminate Unicode Characters
- Tokenize all content in string data type
- Handle stopwords
- Lemmatization
- POS Tagging for Semantic Processing
- Data preprocessing using Pandas, Numpy and Scikit Learn
- Full Python Scripts of Text Cleaning for NLP in Machine Learning (Includes how to handle NLTK data files in Web App or DApp or functions deployed on Cloud)
What is Text Cleaning for NLP in Machine Learning
Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language straightforwardly and clearly. Basically it includes three main parts on whatever projects you might be going to do.
- Text Cleaning and impurity removal
- Tokenization & Group by semantic
- Feature and Target Data Structure
Case Normalization
Different capitalization among different words, or same words can confuse the computer to understand your content and semantics. So first thing first, here is the code sample to set a standard capitalization of all words
abc = content.lower()
Eliminate Unicode Characters
There are quite a lot of unicode characters, such as emoji, URLs, email address, etc in the content piece. In fact, these characters need to be handled well otherwise it also can confuse computers. Below are two regular expression code samples to handle all URLs and email addresses.
URLs:
content = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", content)
Emails:
content = re.sub(r"([A-z0-9._%+-]+@[A-z0-9.-]+\.[A-z]{2,4}
", "", content)
Tokenize all content as an individual value in a list
After having removed all unnecessary unicode characters, we need to create a list from the content by tokenising it.
Here is a sample using Python
word_tokens = word_tokenize(content)
Handle stopwords
In language, there are quite a lot of words that can’t add value to the meaning and most of them are kind of language grammar. So handling stopwords is the task to further reduce confusion to computers by reducing these kinds of words.
Here is the sample using NLTK that is one of the most popular packages in Python
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.data.path.append('nltk_data/corpora/stopwords')
nltk.data.path.append('nltk_data/tokenizers/punkt')
nltk.data.path.append('nltk_data/corpora/wordnet')
stopwords123 = set(stopwords.words('english'))
If you are interested in using NLTK building functions and App deployed on Cloud, but you might not be sure how to deploy the NLTK data source, please subscribe to us in the following format at the end of this article.
Last but not least, we can write a code to remove all stopwords from the list we just now tokenized above. And in the list, it just remains the core value messages and content.
Lemmatization
In human language, it has past, present and future tense. Also, it has a first person, second person and third person angle to express the meaning. So, Lemmatization is a method to handle this for the purpose to eliminate unnecessary words or typos.
lemmatizer = WordNetLemmatizer()
for lemmatizedContent in content:
lemmatizer.lemmatize(lemmatizedContent)
POS Tagging for Semantic Processing
For tabulating the POS totals of large bodies of content, we have to tag and group the words that are based on semantic purpose. In human language, we have different types of words, such as nouns, verb, adjective, adverb, etc. Furthermore, it might be the same word with variations in a speech.
Thus, we need POS tagging to differentiate the speech semantics and purpose. It’s for machines to understand this purpose in order not to confuse it.
Data preprocessing using Pandas, Numpy and Scikit Learn
For more details regarding this topic, please refer to the article I released previously
Full Python Scripts of Text Cleaning for NLP in Machine Learning (Includes how to handle NLTK data files in Web App or DApp or functions deployed on Cloud)
If you are interested in full python scripts of Text Cleaning for NLP in Machine Learning Preparation, please subscribe to our newsletter by adding the message ‘NLP Text Clean + NLTK Data deployment + Full scripts and scraping API free token. We would send you the script when the up-to-date app script is live.
I hope you enjoy reading Ultimate Guide to Text Cleaning for NLP in Machine Learning Preparation. If you did, please support us by doing one of the things listed below, because it always helps out our channel.
- Support and Donate to our channel through PayPal (paypal.me/Easy2digital)
- Subscribe to my channel and turn on the notification bell Easy2Digital Youtube channel.
- Follow and like my page Easy2Digital Facebook page
- Share the article on your social network with the hashtag #easy2digital
- You sign up for our weekly newsletter to receive Easy2Digital latest articles, videos, and discount codes
- Subscribe to our monthly membership through Patreon to enjoy exclusive benefits (www.patreon.com/louisludigital)