Sunday, November 30, 2014

Analyzing Text Using Stanford CoreNLP

Part 2 - Implementation

This tutorial is a step by step “Howto” implement a text analyzing project using the Stanford CoreNLP framework.

Lets start with a short Introduction:
Stanford CoreNLP is an open source NLP framework (under the GNU General Public License) created by Stanford University for labeling text with NLP annotation (such as POS, NER, Lemma, CoreRef and so on) and doing Relationship Extraction.

For a more elaborate information on what kind of NLP is Stanford CoreNLP you can refer to my “How does NLP engines work in a nutshell” blog post.

Ok so lets Start!

1. Downloading Stanford CoreNLP:
Stanford CoreNLP jars can be downloaded from: http://nlp.stanford.edu/software/corenlp.shtml
It is also in the maven repository so you can add them to your Maven or Gradle project as dependencies. 

2. What does the Download Contains:
The Core Model – Containing all the java classes files
The Models Model – Containing all the resources files
Sources jar – The source code of the Core model
Dependencies jars – All dependent 3rd party jars for compiling the project
Additional resources – Some additional resources and examples (I will ignore those here) 

3. Creating a project:
I prefer working with Maven or Gradle, its simplifying the creation of project and managing dependencies and I recommend using this approach, 
Having said that if you prefer to create a none Maven/Gradle project, all you need to do is add all jars that coreNLP downloaded zip file contains to your project build path. 
Also note that latest CoreNLP is Java-8 compatible so make sure you have Java-8 installed and configured in your project. 

4. Analyzing Text
Ok now that we have our project we can start write code to evaluate text.
Here is a code snippet illustrating all steps needed to start working with CoreNLP (you can copy below code to a main method to try it out - tested on version-3.5):

/* First step is initiating the Stanford CoreNLP pipeline 
   (the pipeline will be later used to evaluate the text and annotate it)
   Pipeline is initiated using a Properties object which is used for setting all needed entities, 
   annotations, training data and so on, in order to customized the pipeline initialization to 
   contains only the models you need */
Properties props = new Properties();

/* The "annotators" property key tells the pipeline which entities should be initiated with our
     pipeline object, See http://nlp.stanford.edu/software/corenlp.shtml for a complete reference 
     to the "annotators" values you can set here and what they will contribute to the analyzing process  */
props.put( "annotators", "tokenize, ssplit, pos, lemma, ner, regexner, parse, dcoref" );
StanfordCoreNLP pipeLine = new StanfordCoreNLP( props );

/* Next we can add customized annotation and trained data 
   I will elaborate on training data in my next blog chapter, for now you can comment those lines */
pipeLine.addAnnotator(new RegexNERAnnotator("some RegexNer structured file"));
pipeLine.addAnnotator(new TokensRegexAnnotator(“some tokenRegex structured file”));

// Next we generate an annotation object that we will use to annotate the text with
SimpleDateFormat formatter = new SimpleDateFormat( "yyyy-MM-dd" );
String currentTime = formatter.format( System.currentTimeMillis() );

// inputText will be the text to evaluate in this example
String inputText = "some text to evaluate";
Annotation document = new Annotation( inputText );
document.set( CoreAnnotations.DocDateAnnotation.class, currentTime );

// Finally we use the pipeline to annotate the document we created
pipeLine.annotate( document );

/* now that we have the document (wrapping our inputText) annotated we can extract the
    annotated sentences from it, Annotated sentences are represent by a CoreMap Object */
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

/* Next we can go over the annotated sentences and extract the annotated words,
    Using the CoreLabel Object */
for (CoreMap sentence : sentences)
{
    for (CoreLabel token : sentence.get(TokensAnnotation.class))
    {
        // Using the CoreLabel object we can start retrieving NLP annotation data
        // Extracting the Text Entity
        String text = token.getString(TextAnnotation.class);

        // Extracting Name Entity Recognition 
        String ner = token.getString(NamedEntityTagAnnotation.class);

        // Extracting Part Of Speech
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

        // Extracting the Lemma
        String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
        System.out.println("text=" + text + ";NER=" + ner +
                        ";POS=" + pos + ";LEMMA=" + lemma);

        /* There are more annotation that are available for extracting 
            (depending on which "annotators" you initiated with the pipeline properties", 
            examine the token, sentence and document objects to find any relevant annotation 
            you might need */
    }

    /* Next we will extract the SemanitcGraph to examine the connection 
       between the words in our evaluated sentence */
    SemanticGraph dependencies = sentence.get
                                (CollapsedDependenciesAnnotation.class);

    /* The IndexedWord object is very similar to the CoreLabel object 
        only is used in the SemanticGraph context */
    IndexedWord firstRoot = dependencies.getFirstRoot();
    List<SemanticGraphEdge> incomingEdgesSorted =
                                dependencies.getIncomingEdgesSorted(firstRoot);

    for(SemanticGraphEdge edge : incomingEdgesSorted)
    {
        // Getting the target node with attached edges
        IndexedWord dep = edge.getDependent();

        // Getting the source node with attached edges
        IndexedWord gov = edge.getGovernor();

        // Get the relation name between them
        GrammaticalRelation relation = edge.getRelation();
    }
 
    // this section is same as above just we retrieve the OutEdges
    List<SemanticGraphEdge> outEdgesSorted = dependencies.getOutEdgesSorted(firstRoot);
    for(SemanticGraphEdge edge : outEdgesSorted)
    {
        IndexedWord dep = edge.getDependent();
        System.out.println("Dependent=" + dep);
        IndexedWord gov = edge.getGovernor();
        System.out.println("Governor=" + gov);
        GrammaticalRelation relation = edge.getRelation();
        System.out.println("Relation=" + relation);
   }
}

The code snippet above should be a good starting point for evaluation text input using the CoreNLP framework, but there is much more to the Stanford framework that I can cover in this short tutorial, You can always refer to the Stanford CoreNLP home page (http://nlp.stanford.edu/software/corenlp.shtml) and look for additional abilities not covered here.


Tuesday, April 22, 2014

How does NLP engines work in a nutshell

Part 1 - Introduction

First I'll start with a little introduction about myself, I'm a senior back-end software engineer in a major international technology company part of a group that develop a personal assistant.

One of the biggest challenges we had and still have is extracting meaning from user natural language, this challenge was given to me early in the development stage when we found our regular expression model was not sufficient enough for one of the input sources we examined.

In order to tackle this problem I did an extensive resource on NLP open source engines while our business development scouted for companies that might give solution to our problem. Eventually no such company was yet to be found and I was given the mandate to develop an engine myself.

In this article I've summed up what I've learned on the subject while trying to make it as simple as I can without going into the mathematics behind the NLP statistical models and only focusing on what that is interesting in order to understand the building blocks that makes the NLP engine works.

So let's begin: 
Natural Language Processing (NLP) is a text or speech analyzing framework or tools that enable us to extract machine understandable meaning from a human language, this is done using statistical models for pattern recognition, sentence structure analyzing and semantic layering (If you're not familiar with those terms, they will all be explained later in this article).

There are many implementation of NLP engines, each is design to solve a different set of problem using models to evaluate and analyze different types of inputs. My experience is based on evaluating short text inputs (sentences) using NLP engines such as Stanford CoreNLP and apache UIMA, I'll focus here on NLP engines structure similar to those which are design to analyze textual input.

Today's NLP is becoming more and more popular and it is used in fields such as research, data mining, machine learning, speech processing, AI and others. Implementation of it can be found in Web Search Engines, mobile personal assistants, automated web crawlers and many other more.

But even though and not surprisingly it is still one of the most complex problems in computer science, extracting meaning from natural text is a challenging task, even though there are some incredible open source library and tools to work with and extensive research is done on the field, some of it is still in the POC stage and require intensive work in order to make it stable, I recommend following those steps before you start your development in order to avoid working with tools that are not designed or not mature enough to help you solve your problem:
  1. Does the engine support your preferred development language?
  2. Does it have a live and kicking community?
  3. What kind of license does it have?
  4. Does it have Independent API jars? Or is it an open 3rd party API?
  5. What is the documentation quality? 
So let's examine some of the different and most common building blocks and tools that compose text based NLP engines:

Tokenizes:
A core model for breaking down the text to tokens (usually a tokens refer to a word) using delimiters (such as space, comma and so on).

Statistical and Annotation Engines:
  1. Part Of Speech annotation (POS) - Model that annotate every word in the sentence with the part of speech grammar base it has, such as: verbs, nouns, pronouns, adverbs, adjectives and so on...
    For Example:
    Where:
    PRP$ = 
    Personal pronoun
    NN = Noun
    VBZ = Verb
    NNP = Proper noun
  2. Name Entity Recognition (NER) - Model that annotate words that have semantic layer based on a statistical model or regular expressions rules, common NER annotations will be: Location, Organization, Person, Money, Percent Time and Date…

  3. Creating new Annotation: - A very useful technique when analyzing text is using your own created annotations to highlight and give meaning to words or phrases specific for your program context, most NLP engines give you the infrastructure to create and define your own annotation which will run in the NLP engine text analyzing cycle (this is part of the training described in below "Customizing NLP Engine for your own needs" section 2).
Semantic Graph / Relationship extraction:
An important NLP model that maps the connection between all sentence entities as a graph object, this enable us to travel on the produced graph and find meaningful connections.

For example:       
"My brother Joe is a software engineer"
"Joe My brother is a software engineer"
"A software engineer is Joe my brother"

Using the semantic graph we can understand the same meaning from all above sentences 
Joe <=>  My brother
Joe <=>  software engineer

Dictionaries:
A regular expression based model that holds a set of words or regular expressions, those words should have the same meaning or translation and should be handled / annotated in the some way once found in a text. Some dictionaries are given by the NLP engine and used in the NER model or as other predefined annotation (in the core annotation engine depending on the NLP engine used), while others will be created by us when we customize the engine (as part of the training described in below "Customizing NLP Engine for your own needs" section 2).

An example for dictionary can be all Cities in a country or all Degrees in some university, other cases can be synonyms, like road is a synonym for alley, avenue, street, boulevard etc...
Once the model recognize a dictionary word it will annotate it with an annotation you predefined for example ROAD_SYN can be the annotation for all the road synonyms above.

Some Usable Add-on Models and Algorithms:
  1. Stemming (or Snowball) and lemmatization Algorithms - extracting the root or base form of a token / word, for example: "stemmer", "stemming", "stemmed" will all result with the same base form "stem". This can be very useful when adding the semantic layer on a predefined verb, no need to know all the forms it can have just the root one.
  2. TrueCase - Recognizes the true case of tokens in text where this information was lost, e.g., all upper or lower case text. This can be useful since NLP statistical models will give more accurate results if the text is well formed (grammar, punctuation, spelling and casing) and not always the input can be in a well formed casing, for example speech to text engines will commonly output text in lower case.
Customizing NLP Engine for your own needs:
The last and most important part of evaluating text is tuning and configuring the engine to best suite our needs, there are many NLP engines and each has its own models for training the data, there are two common training techniques:
  1. Training the Core engine data - This training technique is used to train the core data to give a more accurate result based on new input feed offline to the engine (very similar to machine learning), using this training model you will change the core statistical model.
    To use this training we will create a document (preferred a problematic one that the engine produce a wrong output on it or didn’t recognized entities we like to be recognized) and re-annotate it by ourselves correctly, then we will feed it to the statistical model to recalibrate using the new information, this will eventually give more accurate results when the engine will analyze text in run time that is similar to the training doc.
  2. Creating and training new data for the engine to consider - This training is done by using new created models and algorithms (such as described above) that are dedicated for our program needs, for example: new annotations, dictionaries, regular expressions and semantic layering that should help recognize phrases or patterns in the analyzed text to give them meaning in our program context.
    This training does not change the core engine statistical model and only adds new features above it which are needed to make the engine suitable to our machine needs.
This sums it up if we examine the main features functionality using most common out of the box NLP engines.

On the next part I'll give code examples on how to use Stanford CoreNLP (an open source library in Java) to start writing and configuring your own NLP engine.