Sunday, November 30, 2014

Analyzing Text Using Stanford CoreNLP

Part 2 - Implementation

This tutorial is a step by step “Howto” implement a text analyzing project using the Stanford CoreNLP framework.

Lets start with a short Introduction:
Stanford CoreNLP is an open source NLP framework (under the GNU General Public License) created by Stanford University for labeling text with NLP annotation (such as POS, NER, Lemma, CoreRef and so on) and doing Relationship Extraction.

For a more elaborate information on what kind of NLP is Stanford CoreNLP you can refer to my “How does NLP engines work in a nutshell” blog post.

Ok so lets Start!

1. Downloading Stanford CoreNLP:
Stanford CoreNLP jars can be downloaded from: http://nlp.stanford.edu/software/corenlp.shtml
It is also in the maven repository so you can add them to your Maven or Gradle project as dependencies. 

2. What does the Download Contains:
The Core Model – Containing all the java classes files
The Models Model – Containing all the resources files
Sources jar – The source code of the Core model
Dependencies jars – All dependent 3rd party jars for compiling the project
Additional resources – Some additional resources and examples (I will ignore those here) 

3. Creating a project:
I prefer working with Maven or Gradle, its simplifying the creation of project and managing dependencies and I recommend using this approach, 
Having said that if you prefer to create a none Maven/Gradle project, all you need to do is add all jars that coreNLP downloaded zip file contains to your project build path. 
Also note that latest CoreNLP is Java-8 compatible so make sure you have Java-8 installed and configured in your project. 

4. Analyzing Text
Ok now that we have our project we can start write code to evaluate text.
Here is a code snippet illustrating all steps needed to start working with CoreNLP (you can copy below code to a main method to try it out - tested on version-3.5):

/* First step is initiating the Stanford CoreNLP pipeline 
   (the pipeline will be later used to evaluate the text and annotate it)
   Pipeline is initiated using a Properties object which is used for setting all needed entities, 
   annotations, training data and so on, in order to customized the pipeline initialization to 
   contains only the models you need */
Properties props = new Properties();

/* The "annotators" property key tells the pipeline which entities should be initiated with our
     pipeline object, See http://nlp.stanford.edu/software/corenlp.shtml for a complete reference 
     to the "annotators" values you can set here and what they will contribute to the analyzing process  */
props.put( "annotators", "tokenize, ssplit, pos, lemma, ner, regexner, parse, dcoref" );
StanfordCoreNLP pipeLine = new StanfordCoreNLP( props );

/* Next we can add customized annotation and trained data 
   I will elaborate on training data in my next blog chapter, for now you can comment those lines */
pipeLine.addAnnotator(new RegexNERAnnotator("some RegexNer structured file"));
pipeLine.addAnnotator(new TokensRegexAnnotator(“some tokenRegex structured file”));

// Next we generate an annotation object that we will use to annotate the text with
SimpleDateFormat formatter = new SimpleDateFormat( "yyyy-MM-dd" );
String currentTime = formatter.format( System.currentTimeMillis() );

// inputText will be the text to evaluate in this example
String inputText = "some text to evaluate";
Annotation document = new Annotation( inputText );
document.set( CoreAnnotations.DocDateAnnotation.class, currentTime );

// Finally we use the pipeline to annotate the document we created
pipeLine.annotate( document );

/* now that we have the document (wrapping our inputText) annotated we can extract the
    annotated sentences from it, Annotated sentences are represent by a CoreMap Object */
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

/* Next we can go over the annotated sentences and extract the annotated words,
    Using the CoreLabel Object */
for (CoreMap sentence : sentences)
{
    for (CoreLabel token : sentence.get(TokensAnnotation.class))
    {
        // Using the CoreLabel object we can start retrieving NLP annotation data
        // Extracting the Text Entity
        String text = token.getString(TextAnnotation.class);

        // Extracting Name Entity Recognition 
        String ner = token.getString(NamedEntityTagAnnotation.class);

        // Extracting Part Of Speech
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

        // Extracting the Lemma
        String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
        System.out.println("text=" + text + ";NER=" + ner +
                        ";POS=" + pos + ";LEMMA=" + lemma);

        /* There are more annotation that are available for extracting 
            (depending on which "annotators" you initiated with the pipeline properties", 
            examine the token, sentence and document objects to find any relevant annotation 
            you might need */
    }

    /* Next we will extract the SemanitcGraph to examine the connection 
       between the words in our evaluated sentence */
    SemanticGraph dependencies = sentence.get
                                (CollapsedDependenciesAnnotation.class);

    /* The IndexedWord object is very similar to the CoreLabel object 
        only is used in the SemanticGraph context */
    IndexedWord firstRoot = dependencies.getFirstRoot();
    List<SemanticGraphEdge> incomingEdgesSorted =
                                dependencies.getIncomingEdgesSorted(firstRoot);

    for(SemanticGraphEdge edge : incomingEdgesSorted)
    {
        // Getting the target node with attached edges
        IndexedWord dep = edge.getDependent();

        // Getting the source node with attached edges
        IndexedWord gov = edge.getGovernor();

        // Get the relation name between them
        GrammaticalRelation relation = edge.getRelation();
    }
 
    // this section is same as above just we retrieve the OutEdges
    List<SemanticGraphEdge> outEdgesSorted = dependencies.getOutEdgesSorted(firstRoot);
    for(SemanticGraphEdge edge : outEdgesSorted)
    {
        IndexedWord dep = edge.getDependent();
        System.out.println("Dependent=" + dep);
        IndexedWord gov = edge.getGovernor();
        System.out.println("Governor=" + gov);
        GrammaticalRelation relation = edge.getRelation();
        System.out.println("Relation=" + relation);
   }
}

The code snippet above should be a good starting point for evaluation text input using the CoreNLP framework, but there is much more to the Stanford framework that I can cover in this short tutorial, You can always refer to the Stanford CoreNLP home page (http://nlp.stanford.edu/software/corenlp.shtml) and look for additional abilities not covered here.


1 comment:

  1. Thanks a lot for the information.
    when is your next blog releasing for Information about training data?

    ReplyDelete