Part 1 - Introduction
First I'll start with a
little introduction about myself, I'm a senior back-end software engineer in a
major international technology company part of a group that develop a personal assistant.
One of the biggest challenges we had and still have is extracting meaning from user natural language, this challenge was given to me early in the development stage when we found our regular expression model was not sufficient enough for one of the input sources we examined.
In order to tackle this problem I did an extensive resource on NLP open source engines while our business development scouted for companies that might give solution to our problem. Eventually no such company was yet to be found and I was given the mandate to develop an engine myself.
In this article I've summed up what I've learned on the subject while trying to make it as simple as I can without going into the mathematics behind the NLP statistical models and only focusing on what that is interesting in order to understand the building blocks that makes the NLP engine works.
So let's begin:
Natural Language Processing (NLP) is a text or speech analyzing framework or tools that enable us to extract machine understandable meaning from a human language, this is done using statistical models for pattern recognition, sentence structure analyzing and semantic layering (If you're not familiar with those terms, they will all be explained later in this article).
One of the biggest challenges we had and still have is extracting meaning from user natural language, this challenge was given to me early in the development stage when we found our regular expression model was not sufficient enough for one of the input sources we examined.
In order to tackle this problem I did an extensive resource on NLP open source engines while our business development scouted for companies that might give solution to our problem. Eventually no such company was yet to be found and I was given the mandate to develop an engine myself.
In this article I've summed up what I've learned on the subject while trying to make it as simple as I can without going into the mathematics behind the NLP statistical models and only focusing on what that is interesting in order to understand the building blocks that makes the NLP engine works.
So let's begin:
Natural Language Processing (NLP) is a text or speech analyzing framework or tools that enable us to extract machine understandable meaning from a human language, this is done using statistical models for pattern recognition, sentence structure analyzing and semantic layering (If you're not familiar with those terms, they will all be explained later in this article).
There are many
implementation of NLP engines, each is design to solve a different set of
problem using models to evaluate and analyze different types of inputs. My
experience is based on evaluating short text inputs (sentences) using NLP
engines such as Stanford CoreNLP and apache UIMA, I'll focus here on NLP
engines structure similar to those which are design to analyze textual input.
Today's NLP is becoming
more and more popular and it is used in fields such as research, data mining,
machine learning, speech processing, AI and others. Implementation of it can be
found in Web Search Engines, mobile personal assistants, automated web crawlers
and many other more.
But even though and not
surprisingly it is still one of the most complex problems in computer science, extracting
meaning from natural text is a challenging task, even though there are some incredible
open source library and tools to work with and extensive research is done on
the field, some of it is still in the POC stage and require intensive work in
order to make it stable, I recommend following those steps before you start
your development in order to avoid working with tools that are not designed or
not mature enough to help you solve your problem:
- Does the engine support your preferred development language?
- Does it have a live and kicking community?
- What kind of license does it have?
- Does it have Independent API jars? Or is it an open 3rd party API?
- What is the documentation quality?
So let's examine some of
the different and most common building blocks and tools that compose text based
NLP engines:
Tokenizes:
A core model for
breaking down the text to tokens (usually a tokens refer to a word) using delimiters (such as space, comma and so on).
Statistical and Annotation
Engines:
- Part Of Speech
annotation (POS) - Model that annotate every
word in the sentence with the part of speech grammar base it has, such as: verbs,
nouns, pronouns, adverbs, adjectives and so on...
For Example: Where:
PRP$ = Personal pronoun
NN = Noun
VBZ = Verb
NNP = Proper noun - Name Entity Recognition
(NER) - Model that annotate
words that have semantic layer based on a statistical model or regular expressions
rules, common NER annotations will be: Location, Organization, Person, Money,
Percent Time and Date…
- Creating new Annotation: - A very useful technique when analyzing text is using your own created annotations to highlight and give meaning to words or phrases specific for your program context, most NLP engines give you the infrastructure to create and define your own annotation which will run in the NLP engine text analyzing cycle (this is part of the training described in below "Customizing NLP Engine for your own needs" section 2).
Semantic Graph / Relationship
extraction:
An
important NLP model that maps the connection between all sentence entities as
a graph object, this enable us to travel on the produced graph and find meaningful connections.
For example:
"My brother Joe is a software
engineer"
"Joe My brother is a software engineer"
"Joe My brother is a software engineer"
"A
software engineer is Joe my brother"
Joe <=> My brother
Joe <=> software engineerDictionaries:
A regular expression based model that
holds a set of words or regular expressions, those words should have the same meaning or
translation and should be handled / annotated in the some way once found in a text.
Some dictionaries are given by the NLP engine and used in the NER model or as
other predefined annotation (in the core annotation engine depending on the NLP
engine used), while others will be created by us when we customize the engine (as part of
the training described in below "Customizing NLP Engine for your own needs"
section 2).
An example for dictionary can be all
Cities in a country or all Degrees in some university, other cases can be
synonyms, like road is a synonym for alley, avenue, street, boulevard etc...
Once the model recognize a dictionary word it will
annotate it with an annotation you predefined for example ROAD_SYN can be the
annotation for all the road synonyms above.- Stemming (or Snowball) and lemmatization Algorithms - extracting the root or base form of a token / word, for example: "stemmer", "stemming", "stemmed" will all result with the same base form "stem". This can be very useful when adding the semantic layer on a predefined verb, no need to know all the forms it can have just the root one.
- TrueCase - Recognizes the true case of tokens in text where this information was lost, e.g., all upper or lower case text. This can be useful since NLP statistical models will give more accurate results if the text is well formed (grammar, punctuation, spelling and casing) and not always the input can be in a well formed casing, for example speech to text engines will commonly output text in lower case.
The last and most
important part of evaluating text is tuning and configuring the engine to best
suite our needs, there are many NLP engines and each has its own models for
training the data, there are two common training techniques:
- Training the Core engine
data - This training technique is used to train the core data to
give a more accurate result based on new input feed offline to the engine (very
similar to machine learning), using this training model you will change the
core statistical model.
- Creating and training
new data for the engine to consider - This training is done by using new created models and
algorithms (such as described above) that are dedicated for our program needs,
for example: new annotations, dictionaries, regular expressions and semantic layering
that should help recognize phrases or patterns in the analyzed text to give
them meaning in our program context.
On the next part I'll give code examples on how to use Stanford CoreNLP (an open source library in Java) to start writing and configuring your own NLP engine.