Alon Eirew: Analyzing Text Using Stanford CoreNLP

Part 2 - Implementation

This tutorial is a step by step “Howto” implement a text analyzing project using the Stanford CoreNLP framework.

Lets start with a short Introduction:
Stanford CoreNLP is an open source NLP framework (under the GNU General Public License) created by Stanford University for labeling text with NLP annotation (such as POS, NER, Lemma, CoreRef and so on) and doing Relationship Extraction.

For a more elaborate information on what kind of NLP is Stanford CoreNLP you can refer to my “How does NLP engines work in a nutshell” blog post.

Ok so lets Start!

1. Downloading Stanford CoreNLP:

Stanford CoreNLP jars can be downloaded from: http://nlp.stanford.edu/software/corenlp.shtml

It is also in the maven repository so you can add them to your Maven or Gradle project as dependencies.

2. What does the Download Contains:

The Core Model – Containing all the java classes files

The Models Model – Containing all the resources files

Sources jar – The source code of the Core model

Dependencies jars – All dependent 3rd party jars for compiling the project

Additional resources – Some additional resources and examples (I will ignore those here)

3. Creating a project:

I prefer working with Maven or Gradle, its simplifying the creation of project and managing dependencies and I recommend using this approach,

Having said that if you prefer to create a none Maven/Gradle project, all you need to do is add all jars that coreNLP downloaded zip file contains to your project build path.

Also note that latest CoreNLP is Java-8 compatible so make sure you have Java-8 installed and configured in your project.

4. Analyzing Text

Ok now that we have our project we can start write code to evaluate text.

Here is a code snippet illustrating all steps needed to start working with CoreNLP (you can copy below code to a main method to try it out - tested on version-3.5):

/* First step is initiating the Stanford CoreNLP pipeline

(the pipeline will be later used to evaluate the text and annotate it)

Pipeline is initiated using a Properties object which is used for setting all needed entities,

annotations, training data and so on, in order to customized the pipeline initialization to
contains only the models you need */

Properties props = new Properties();

/* The "annotators" property key tells the pipeline which entities should be initiated with our
pipeline object, See http://nlp.stanford.edu/software/corenlp.shtml for a complete reference
to the "annotators" values you can set here and what they will contribute to the analyzing process */

props.put( "annotators", "tokenize, ssplit, pos, lemma, ner, regexner, parse, dcoref" );

StanfordCoreNLP pipeLine = new StanfordCoreNLP( props );

/* Next we can add customized annotation and trained data

I will elaborate on training data in my next blog chapter, for now you can comment those lines */

pipeLine.addAnnotator(new RegexNERAnnotator("some RegexNer structured file"));

pipeLine.addAnnotator(new TokensRegexAnnotator(“some tokenRegex structured file”));

// Next we generate an annotation object that we will use to annotate the text with

SimpleDateFormat formatter = new SimpleDateFormat( "yyyy-MM-dd" );

String currentTime = formatter.format( System.currentTimeMillis() );

// inputText will be the text to evaluate in this example

String inputText = "some text to evaluate";

Annotation document = new Annotation( inputText );

document.set( CoreAnnotations.DocDateAnnotation.class, currentTime );

// Finally we use the pipeline to annotate the document we created

pipeLine.annotate( document );

/* now that we have the document (wrapping our inputText) annotated we can extract the
annotated sentences from it, Annotated sentences are represent by a CoreMap Object */

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

/* Next we can go over the annotated sentences and extract the annotated words,

Using the CoreLabel Object */

for (CoreMap sentence : sentences)

{

for (CoreLabel token : sentence.get(TokensAnnotation.class))

{
// Using the CoreLabel object we can start retrieving NLP annotation data

// Extracting the Text Entity

String text = token.getString(TextAnnotation.class);

// Extracting Name Entity Recognition

String ner = token.getString(NamedEntityTagAnnotation.class);

// Extracting Part Of Speech

String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);

// Extracting the Lemma

String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);

System.out.println("text=" + text + ";NER=" + ner +

";POS=" + pos + ";LEMMA=" + lemma);

/* There are more annotation that are available for extracting
(depending on which "annotators" you initiated with the pipeline properties",
examine the token, sentence and document objects to find any relevant annotation
you might need */

}

/* Next we will extract the SemanitcGraph to examine the connection
between the words in our evaluated sentence */
SemanticGraph dependencies = sentence.get
(CollapsedDependenciesAnnotation.class);

/* The IndexedWord object is very similar to the CoreLabel object
only is used in the SemanticGraph context */
IndexedWord firstRoot = dependencies.getFirstRoot();
List<SemanticGraphEdge> incomingEdgesSorted =
dependencies.getIncomingEdgesSorted(firstRoot);

for(SemanticGraphEdge edge : incomingEdgesSorted)
{
// Getting the target node with attached edges
IndexedWord dep = edge.getDependent();

// Getting the source node with attached edges
IndexedWord gov = edge.getGovernor();

// Get the relation name between them
GrammaticalRelation relation = edge.getRelation();
}

// this section is same as above just we retrieve the OutEdges
List<SemanticGraphEdge> outEdgesSorted = dependencies.getOutEdgesSorted(firstRoot);
for(SemanticGraphEdge edge : outEdgesSorted)
{
IndexedWord dep = edge.getDependent();
System.out.println("Dependent=" + dep);
IndexedWord gov = edge.getGovernor();
System.out.println("Governor=" + gov);
GrammaticalRelation relation = edge.getRelation();
System.out.println("Relation=" + relation);
}
}

The code snippet above should be a good starting point for evaluation text input using the CoreNLP framework, but there is much more to the Stanford framework that I can cover in this short tutorial, You can always refer to the Stanford CoreNLP home page (http://nlp.stanford.edu/software/corenlp.shtml) and look for additional abilities not covered here.

Alon Eirew

Sunday, November 30, 2014

Analyzing Text Using Stanford CoreNLP

Part 2 - Implementation

1 comment: