NLP with spaCy. (Part 1)

A Comprehensive Guide To NLP.

Manikanta Munnangi
Towards Data Science
7 min readNov 16, 2019

--

Natural Language Processing is the one widely used in almost every domain, somewhat depth knowledge is required before hands-on.

Applications of NLP.

This is the first part of a series of Natural Language Processing (NLP) with spaCy. The next post gives a Basics of NLP with code using spaCy.

Introduction:

we all know the evolving technologies generating enormous raw data in the form of Structured which has a Pre-defined format can be used to search in databases like the number, texts, etc. on the other hand, Unstructured has no Pre-defined format that can’t be found easily making it much more difficult to collect, process and analyze for example Images, Videos, Files, etc. these types of data is generating every day and the number is too big 2.5 quintillion bytes of data and it’s just started pacing up . I have never heard the word “Quintillion” unless it was pointed out on the web.

Did you know ??

The number of zeros for Million is 6 and the number of zeros for Quintillion is 18.

Million -> 1000000

Quintillion ->1000000000000000000

It is equivalent to 1 billion gigabytes of data is generated per day. You may be wondering that’s a big number. it because data comes from everywhere sensors used to gather shopper information, posts to social media sites, digital pictures, and videos purchase transactions, and cell phone GPS signals and the list goes on.

The question is how do we process, analyze and convert this data that are meaningful to us which helps in addressing our own needs and solve new technical problems.

Natural Language Processing will help to analyze text-related problems. As this article is about NLP we will be looking only on it. Leaving other categories of data beyond this article.

What you will be learning from this article:

  1. What is Natural Language Processing?
  2. Subsets of Natural Language Processing.
  3. What are the Applications of NLP?

1. What is Natural Language Processing?

With the help of a Bunch of Algorithms and rules the computer able to understand and communicate with humans in vast human languages and scales other language-related tasks. With NLP, it is possible to perform certain tasks like Automated Speech and Automated Text Writing in less time. Due to the evolving of large data (text), why not to use the computers which have high computing power, capable of working all day and ability to run several algorithms to perform tasks in no time.

After getting success in speech recognition and vision research, natural language processing is the most targeted research area in artificial intelligence.

Although it is started decades ago, most people lack the NLP experience. Because it’s hard to teach a machine with the challenges listed below:

  • Ambiguity: it is the challenge when a Single word has different meanings or a sentence that has different meanings in the context and even a sentence refers to sarcasm.
Lack of clarity in meaning.
  • It comprises Ambiguity which is divided into two types: Lexical and Syntactic ambiguity

Lexical Ambiguity is the presence of two or more possible meanings within a single word.

Syntactic Ambiguity is the presence of two or more possible meanings within a single sentence or sequence of words.

Source: ThoughtCo.
  • Syntax: Think of how a sentence is valid, it based on two things called syntax and semantics where syntax refers to the grammatical rules, on the other hand, semantics is the meaning of the vocabulary symbols within that structure. People change the ordering of sentences it is valid in some cases but not all.
  • Co-reference: Referring to the same person, thing, country..etc that were mentioned earlier in a sentence or phrase with pronouns.
Coreference.
  • Normalization: It is a well-known technique that is used in Machine Learning and Deep Learning. This step is done because data have different unit scales and convert those values back to the same unit scale. wherein Natural Language Processing we convert from informal words to standard form and make suitable for further processing.
Source: Slideshare.net

It involves normalizing text from social media, URLs, text-emojis, Company names with special characters e.g. Yahoo!, also includes misspellings words, hashtags, new words, and terminologies. There is no single best way to do normalization.

To do this task we use the Morphology part of NLU.

  • Sarcasm: Same words different meaning refers to the Ambiguity topic. Suppose when someone does something wrong you reply as very good or well done. it’s also a challenge for a computer to understand the sarcasm because it’s a way more different than a normal conversation.
Your Machine asks.

2. Subsets of Natural Language Processing.

when I talk about NLP it isn’t just one. It comprises the two topics embedded in it. Natural Language Understanding(NLU) and Natural Language Generation (NLG).

Source: Sciforce
Classification of NLP.

==> Natural Language Processing (NLP):

  • It’s a process of converting the input (unstructured) text data into a human-readable format and process the text using statistical techniques.

==> Natural Language Understanding (NLU):

  • As the name says it all Understanding the raw text before any modeling and analysis, the machine should have to get its underlying terms only then it starts processing it.
  1. Phonology: It’s the study of organizing sound systematically.
  2. Morphology: It is a study of the structure of words, formation, the relationship between words, forming things, analyze the meaning and lexical function.
  3. Pragmatics: It is the study of how words are used, signs, symbols and inferred meaning
  4. Syntax: It Refers to arranging words or phrases to form meaningful sentences, it follows grammatical rules.
  5. Semantics: It concerned about the meaning of words and how to combine words to form meaningful phrases and sentences.

==> Natural Language Generation (NLG):

  • when a computer writes the data into meaningful phrases or sentences from some internal representation.
  • It involves — Text planning, Sentence planning, Text Realization

Text Planning: Reserving relevant information from the knowledge base.

Sentence Planning: Helps in Choosing required words for complete meaningful sentences.

Text Realization: According to Wikipedia, Realization is also a subtask of natural language generation, which involves creating an actual text in a human language (English, French, etc.) from a syntactic representation.

3. What are the Applications of NLP?

Sentiment Analysis.

Sentiment Analysis helps the companies to understand their pros and cons and insights from their data,also able to make necessary changes in business stratagies .

Chatbots.

Chatbots provides standard solutions to customer common problems and also personalized assistance to customers.

Virtual Assistants.

and to name a few applications like text classification, information extraction, semantic parsing, question answering, paraphrase detection, language generation, multi-document summarization, machine translation, and speech and character recognition and a lot more.

Conclusions:

  1. Natural Language Processing is a subset of AI which deals with teaching a computer to understand the human-level language and act upon it.
  2. It has challenges that are lacking the process and also in performance as well.
  3. NLP comprises two subsets Called Natural Language Understanding and Natural Language Generation which helps in processing the text to the next level.
  4. It involves a lot of study and understanding of linguistics.
  5. A vast number of applications like Sentiment analysis, Chatbots, virtual assistants and more that benefit us using NLP.
  6. By leveraging the power of data and high-level computing hardware we can train the models in no time.

That’s it, Hope you guys like the summarization of NLP. The next part follows the coding of NLP basics using spaCy.

Happy Learning :)

References:

  1. Yunyao Li, Research Manager and Research Staff Member at IBM ”https://www.slideshare.net/YunyaoLi/adaptive-parsercentric-text-normalization”.
  2. Sigmoider, get started with NLP https://medium.com/@gon.esbuyo/get-started-with-nlp-part-i-d67ca26cc828
  3. Amar Jukuntla, the title is Learning, ”https://www.slideshare.net/amarjukuntla/learning-93260612”.
  4. https://www.upgrad.com/blog/5-applications-of-natural-language-processing-for-businesses/
  5. http://michealaxelsen.com/blog/?p=347

--

--

Active learner of Machine Learning and Data Science. Always passionate about learning new technologies that involve Data Science.