Text Mining: Named Entity Recognition, Parts of Speech and Sentence Structures
In the earlier article, 7 techniques were explored to extract words out of a document, or even creating a count vector or a term frequency inverse document frequency vector. Here, we will look at Parts of Speech, Named Entity Recognition and how grammar can be used to extract chunks out of a document.
Parts of Speech
In any language, Parts of Speech involves categorizing words into similar categories, each category representing a similar grammatical property. For e.g. words can be classified as Nouns, Pronouns, Adverbs, Adjectives, Verbs etc. Using the Spacy library in Python, we can identify the parts of speech quite easily.
So let’s take the same article to identify Parts of Speech. The first paragraph from the document is reproduced below.
“Three IITs, India’s premier higher education institutions (HEIs), figure in the top 200 institutions across the world ranked according to employability of students in the 2022 QS Graduate Employability Rankings. No Indian HEI is in the top 100. In comparison, an HEI each from China and Hong Kong have challenged Anglosphere hegemony in the top 10. That no Indian HEI has been able to breach the top 100 aptly sums up the employability crisis of Indian graduates.”
After parts of speech tagging, this is what the output looks like. Spacy can also provide explanation as to why it attached a specific Parts of Speech tag to the word.
Named Entity Resolution
Named Entities are organizations, people, countries etc which are referenced in a document. In the paragraph, Spacy was able to automatically detect the below mentioned named Entities. It is also able to distinguish between 200 as a number as well as 2022 as Date. GPE is short for Geographical Political Entities and it is able to find India, China, Hongkong!
The named entities can be visualized as below
For any language, grammar is what defines the sentence syntax and the context. For humans, language has evolved over tens of thousands of years and maybe more! Human brain has learnt to understand the sentence grammar and construct sentences as per the grammar. It takes only a few years for a child to learn a language and to be able to communicate fluently.
Similarly for computers too, grammar has to be defined to understand natural language. For the programming languages such as C or Java or Python there are compilers/interpreters that convert the programming language syntax into machine language instructions.
To enable the computers to understand sentence structures, it is important to define grammatical rules. Using these rules, computers can understand sentence structures.
While it is important to extract information such as parts of speech, named entities from text, it is also important to understand sentence structures.
Consider the sentence “We saw the yellow dog”. The word “we” is a Pronoun, “saw” is a verb, “the” is a determiner or an adverb, “yellow” is an adjective and “dog” is a noun. Without any grammar, these are just a bunch of words without structure.
However, if we define a grammar rule that segregates noun phrases together (“the yellow dog”), we can create sentences as tree like structures. If our grammar rule is that a Noun Phrase could be constructed as <Adverb><Adjective><Noun>, then we can create a group or a chunk out of this sentence and use this to represent a complete sentence as a structure.
Let’s consider a sentence — “The little mouse ate the yellow cheese.”.
Let our grammar be defined as
Noun Phrase = <Determiner><Adjective><Noun>
Then the above mentioned sentence can be structured as
The parser has identified two distinct chunks of noun phrase and combined them with a verb to create a sentence.
Considering another sentence,
“The black dog snarled at the frightened kitten.”
Now if we were to apply the same grammar, this would become
If we were to change the grammar rule as
Noun Phrase = <Determiner><Adjective><Noun><optional Verb>
This would interpret the sentence as
If we apply the same grammar to say a sentence from the article
“An employability report in 2019 based on standardised testing by Aspiring Minds termed the challenge as stubborn unemployability”
Chinking is another option where one would like to define a grammar but remove certain phrases if they exist.
With these above new concepts, we can identify not only the parts of speech, but also look at Organizations, People, Institutions aka Named Entities in our text. We can also structure sentences and as an example can find out the common noun phrases in a set of tweets which have been classified as positive to understand what drives a positive tweet from a happy customer!