Introduction to Text Mining : 7 techniques for text analytics

Text data is all around us! We see it in news articles, Twitter tweets, Facebook posts and so on. Data is also present in documents (PDF, Word etc) and this could be contractual documents, legal documents, policy documents etc.

Text Mining or processing text data becomes important as there are actionable insights that could be derived from such data. In the past such data was ignored, but due to emergence of high end low cost computing through technologies such as Big Data, it is now possible to analyze the unstructured data and incorporate this into data science models.

As an example, an insurance company could include data from policy documents, claim documents, customer’s interaction with call center agents and augment the “Churn” prediction models using this text data.

Retail customers could include product reviews by customers as part of their pricing strategy or campaign strategy.

And the list can go on for various such use cases across multiple industries. However it is important to understand that unlike structured data that comes from traditional systems such as Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM) systems, text data could come from non-traditional sources such as social media.

Another challenge with text data is that while humans can understand the context behind the sentence, it is difficult for a computer to understand the same (atleast the current ones dont!). So there are techniques and methods to extract information out of raw text data so that the computers can understand and intelligently adopt the text data into building data science models.

In this article, I would like to explain the various techniques through an example. I took this article published as an editorial in Times of India as an example to showcase how we can use text mining techniques to analyze this data. I will also post the entire code used for this article on github.

For Natural Language Processing, there are a few libraries in Python such as NLTK, Spacy. Even the deep learning library such as Tensorflow provides capabilities for Text Processing. Here, we will use NLTK and Spacy for text mining.

Technique 1: Tokenization

The first technique is called as Tokenization. Since the entire document or article is composed of sentences (grouped together as paragraphs) and sentences are in turn composed of words, Tokenization is the process of breaking down documents into sentences and then each sentence into words. The first paragraph of the article is

“Three IITs, India’s premier higher education institutions (HEIs), figure in the top 200 institutions across the world ranked according to employability of students in the 2022 QS Graduate Employability Rankings. No Indian HEI is in the top 100. In comparison, an HEI each from China and Hong Kong have challenged Anglosphere hegemony in the top 10. That no Indian HEI has been able to breach the top 100 aptly sums up the employability crisis of Indian graduates.”

This can be broken down into four sentences

Sentence Breakdown

Further, each sentence can be broken down into words. The first sentence, for example, can be broken down into the following words.

['Three', 'IITs', ',', 'India', '’', 's', 'premier', 'higher', 'education', 'institutions', '(', 'HEIs', ')', ',', 'figure', 'in', 'the', 'top', '200', 'institutions', 'across', 'the', 'world', 'ranked', 'according', 'to', 'employability', 'of', 'students', 'in', 'the', '2022', 'QS', 'Graduate', 'Employability', 'Rankings', '.']

This has even punctuation marks, commas, brackets, numbers also identified as words. We need to eliminate these before we can progress further. Also, it is important to change the case of the words.

Technique 2: Lower case and removal of punctuation marks, numbers etc

We can simply remove the numbers and punctuation marks by checking if each word is alphanumeric or not. After this operation, the list for the above mentioned sentence is

'three iits india s premier higher education institutions heis figure in the top institutions across the world ranked according to employability of students in the qs graduate employability rankings'

Technique 3: Removal of Stop Words

Stopwords are commonly used words in any language. These are words such as “a”, “an”, “the”. Usually these dont have a lot of meaning and need to be removed. NLTK and Spacy provide list of stopwords which can be applied to the text to remove such common words and bring out only the relevant words.

After removing stop words, the list looks like this. It is now evident that the sentence now looks difficult to read though some meaning can still be derived out of it.

'three iits india premier higher education institutions heis figure top institutions across world ranked according employability students qs graduate employability rankings'

Technique 4: Stemming

In any language, especially English, words are formed from their stem. For example words — help, helper, helping, helped are all from the same stem — help. More details about Stemming can be found here on wikipedia.

So to reduce the number of words in a sentence, and to bring the words to their stem, we use stemmers available in NLTK. The most common one is Porter Stemmer.

'three iit india premier higher educ institut hei figur top institut across world rank accord employ student qs graduat employ rank'

This is how the sentence looks like after applying Porter Stemmer. Education word for example has been reduced to educ. Further duplicate stems can be removed and now the sentence looks like this.

'qs institut iit hei student india educ top world figur graduat rank accord higher across three employ premier'

This has no meaning from a human perspective, but now the content has been reduced and hopefully this can be used in a machine learning algorithm.

Technique 5: Lemmatization

As per wikipedia,

“Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form.”

In other words, as an alternative to the Stemming, Lemmatization can be applied to a sentence. Lemmatization derives the stem using the part of speech. It essentially uses the context to derive the stem as compared to Stemming which just finds the stem irrespective of the part of speech the word is used for.

In NLTK, parts of speech has to be provided otherwise it considers the word as a noun.

After applying just plain lemmatization, this results into

'three iits india premier higher education institution heis figure top institution across world ranked according employability student q graduate employability ranking'

Removing duplicate words, this becomes

'india top world ranked iits figure ranking according employability education heis q institution three premier student graduate higher across'

Technique 6: Count Vectorization

Count Vectorization is simply counting the number of words in a sentence and creating a vector out of it. During vectorization, the minimum and maximum frequency of words can be defined so that only the relevant words are included in the feature vector. This ensures that neither very frequently occurring words nor very rare words are included in the feature vector.

If Count Vectorization is applied to the entire text in the article, these are the most frequent words that come out (atleast 5% occurrence and no more than 95% occurrence)

'education employability engineers germany graduates hei heis india indian industry japan job level market report role skills students'

The vectors look something like this

[[0 1 2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1]
[1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 1 1 0 0 2 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0]]

Each row corresponds to the count of words in a sentence for the words given above.

Technique 7: Term Frequency Inverse Document Frequency Vectorizer

As an alternative to Count Vectorization which is simply a count of a particular word occurring in a document, Term Frequency Inverse Document Frequency (TF-IDF) simply creates a score for every word, indicating how relevant is that particular word.

Term Frequency(TF) is simply the number of times a word occurs in a document. This however has the disadvantage that it favors large documents.

Inverse Document Frequency(IDF) is the measure of the information a word provides across the document. IDF is computed as the log of the scaled inverse ratio of the number of documents that contain the word and the total number of documents (if you want a slightly mathematical view !!)

Now TF-IDF is a measure computed by the multiplication of Term Frequency and Inverse Document Frequency.

Using the TfIdfVectorizer from Scikit-Learn library, we get the following words

'employability hei heis india indian level'

For the first sentence, the TFIDF vectorization will look like this

[0.73872867, 0.        , 0.49663801, 0.45566504, 0.        ,
0. ]

With the above seven techniques, it is easy to see how the text data can be mined to extract features out of data and these can then be used further into building data science models.

--

--

--

Data Scientist, Big Data, Reinforcement Learning Enthusiast, Contributor to the Github Arctic Vault

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Top 10 Data Science Courses on the Web

Exploratory Data Analysis with MySQL

How can Salesforce Einstein supercharge your analytics?

Growth of Data Market by Srecko Dzeko, General Partner at Flash Ventures

Why use optimization techniques? A transportation case.

Visual Programming with Orange Tool

How can you Master Data Science without a Degree in 2020?

Opex Puzzle: The Smaller Pile

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amit Gairola

Amit Gairola

Data Scientist, Big Data, Reinforcement Learning Enthusiast, Contributor to the Github Arctic Vault

More from Medium

Recently, I have been working on bolstering my cloud computing knowledge (studying for Solution…

The perfect crop with Google Cloud Vision API

Amazon Fine Food Reviews Analysis

Clean & Process Raw Text Data: A Beginner’s Guide with Examples