Hello there! I’m back and I want this to be the first of a series of post on Stanford’s CoreNLP library. In this article I will focus on the installation of the library and an introduction to its basic features for Java newbies like myself. I will firstly go through the installation steps and a couple of tests from the command line. I will later walk you through a two very simple Java scripts that you will be able to easily incorporate into your Python NLP pipeline. You can find the complete code on github!

Corenlp is a toolkit with which you can generate a quite complete NLP pipeline with only a few lines of code. The library includes pre-built methods for all the main NLP procedures, such as Part of Speech (POS) tagging, Named Entity Recognition (NER), Dependency Parsing or Sentiment Analysis. It also supports other languages apart from English, more specifically Arabic, Chinese, German, French, and Spanish.
I am a big fan of the library, mainly because of HOW COOL its Sentiment Analysis model is ❤ (I will talk more about it in the next post). However, I can see why most people would rather use other libraries like NLTK or SpaCy, as CoreNLP can be a bit of an overkill. The reality is that coreNLP can be much more computationally expensive than other libraries, and for shallow NLP processes the results are not even significantly better. Plus it’s written in Java, and getting started with it is a bit of a pain for Python users (however it is doable, as you will see below, and it also has a Python API if you can’t be bothered).
- CoreNLP Pipeline and Basic Annotators
The basic building block of coreNLP is the coreNLP pipeline. The pipeline takes an input text, processes it and outputs the results of this processing in the form of a coreDocument object. A coreNLP pipeline can be customised and adapted to the needs of your NLP project. The properties objects allow to do this customization by adding, removing or editing annotators.
That was a lot of jargon, so let’s break it down with an example. All the information and figures were extracted from the official coreNLP page.

In the figure above we have a basic coreNLP Pipeline, the one that is ran by default when you first run the coreNLP Pipeline class without changing anything. At the very left we have the input text entering the pipeline, this will usually be a plain .txt file. The pipeline itself is composed by 6 annotators. Each of these annotators will process the input text sequentially, the intermediate outputs of the processing sometimes being used as inputs by some other annotator. If we wanted to change this pipeline by adding or removing annotators, we would use the properties object. The final output is a set of annotations in the form of ** a coreDocument objec**t.
We will be working with this basic pipeline throughout the article. The nature of the objects will be more clear later on when we look at an example. For the moment let’s note down what each of the annotator does:
- Annotator 1: Tokenization → turns raw text into tokens.
- Annotator 2: Sentence Splitting → divides raw text into sentences.
- Annotator 3: Part of Speech (POS) Tagging → assigns part of speech labels to tokens, such as whether they are verbs or nouns. Each token in the text will be given a tag.

- Annotator 4: Lemmatization → converts every word into its lemma, its dictionary form. For example the word "was" is mapped to "be".
- Annotator 5: Named Entity Recognition (NER) → Recognises when an entity (a person, country, organization etc…) is named in a text. It also recognises numerical entities such as dates.

- Annotator 6: Dependency Parsing → Will parse the text and highlight dependencies between words.

Lastly, all the outputs from the 6 annotators are organised into a CoreDocument. These are basically data objects that contain annotation information in a structured way. CoreDocuments make our lives easier since, as you will see later on, they store all the information so that we can access it with a simple API.

- Installation
You will need to have Java installed. You can download the latest version here. For downloading CoreNLP I followed the official guide:
- Downloading the CoreNLP zip file using curl or wget
curl -O -L http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
- Unzip the file
unzip stanford-corenlp-latest.zip
- Move into the newly created directory
cd stanford-corenlp-4.1.0
Let’s now go through a couple of examples to make sure everything works.
- Example using the command line and an input.txt file
For this example, firstly we will open the terminal and create a test file that we will use as input. The code was adapted from coreNLP’s official site. You can use the following command:
echo "the quick brown fox jumped over the lazy dog" > test.txt
echo
prints the sentence "the quick brown fox jumped over the lazy dog"
on the test.txt file.
Let’s now run a default coreNLP pipeline on the test sentence.
java -cp "*" -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat xml -file test.txt
This is a java command that loads and runs the coreNLP pipeline from the class edu.stanford.nlp.pipeline.StanfordCoreNLP. Since we have not changed anything from that class, the settings will be set to default. The pipeline will use as input the test.txt file and will output an XML file.
Once you run the command the pipeline will start annotating the text. You will notice it takes a while… (around 20 seconds for a 9-word-sentence 🙄 ). The output will be a file named test.txt.xml. This process will also automatically generate as a side product an XSLT stylesheet (CoreNLP-to-HTML.xsl), which will convert the XML into HTML if you open it in a browser.


Seems that everything is working fine!! We see the standard pipeline is actually quite complex. It included all the annotators we saw in the section above: tokenization, sentence splitting, lemattization, POS, NER tagging and dependency parsing.
Note: I displayed it using Firefox, however I took me ages to figure out how to do this because apparently in 2019 Firefox stopped allowing this. One can get around this by going to the about:config page and changing the privacy.file_unique_origin setting to False. If it doesn’t work for you you can choose json as the outputFormat or open the XML file with a text editor.
- Example using the interactive shell mode
For our second example you will also use exclusively the terminal. CoreNLP has an cool interactive shell mode that you can enter by running the following command.
java -cp "*" -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP
Once you enter this interactive mode, you just have to type a sentence or group of sentences and they will be processed by the basic annotators on the fly! Below you can see an example of how the sentence "Hello my name is Laura" is analysed.

We can see the same annotations we saw in the XML file printed in the Terminal in a different format! You can also try it out with longer texts.
- Example using very simple Java code
Now let’s go through a couple of Java code examples! We will basically create and tune the pipeline using Java, and then we will output the results onto a .txt file that then can be incorporated into our Python or R NLP pipeline. The code was adapted from coreNLP’s official site.
Example 1
Find the complete code in my github. I will firstly run you through the _coreNLP_pipeline1LBP.java file. We start the file importing all the needed dependencies. Then we make up an example of text that we will use for our analysis. You can change this to any other example:
public static String text = "Marie was born in Paris.";
Now we set up the pipeline, we create a document and annotate it using the following lines:
// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
props.setProperty("annotators","tokenize,ssplit,pos,lemma,ner,depparse");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object and annotate it
CoreDocument document = pipeline.processToCoreDocument(text);
pipeline.annotate(document);
The rest of the lines of the file will print out on the terminal several tests to make sure the pipeline worked fine. For instance, we firstly get the list of sentences of the input document.
// get sentences of the document
List <CoreSentence> sentences = document.sentences();
System.out.println("Sentences of the document");
System.out.println(sentences);
System.out.println();
Notice that we get the list of sentences using the method .sentences()
on the document object. Similarly, we get the list of tokens of a sentence using the method .tokens()
on the object sentence and the individual word and lemma using the methods .word()
and .lemma()
on the object tok.
List<CoreLabel> tokens = sentence.tokens();
System.out.println("Tokens of the sentence:");
for (CoreLabel tok : tokens) {
System.out.println("Token: " + tok.word());
System.out.println("Lemma: " + tok.lemma());
}
For running the file you only need to save it on your stanford-corenlp-4.1.0
directory and use the command
java -cp "*" coreNLP_pipeline1_LBP.java
The results should look like:

Example 2
The second example _coreNLP_pipeline2LBP.java is slightly different, since it reads a file _coreNLPinput.txt as input document and outputs the results onto a _coreNLPoutput.txt file.
We used as the input text the short story of The Fox and the Grapes. It is a document with 2 paragraphs and 6 sentences. The processing will be similar to the one in the example above, except this time we will also keep track of the paragraph and sentence number.
The biggest changes will be regarding reading the input and writing the final output. This bit of code below will create the output file (if it doesn’t exist yet) and print the column names using PrintWriter…
File file = new File("coreNLP_output.txt");
//create the file if it doesn't exist
if(!file.exists()){
file.createNewFile();}
PrintWriter out = new PrintWriter(file);
//print column names on the output document out.println("par_id;sent_id;words;lemmas;posTags;nerTags;depParse");
…and this other bit will read the input document using Scanner. The input document will be saved as a String text that we will be able to use as the one in Example 1.
Scanner myReader = new Scanner(myObj);
while (myReader.hasNextLine()) {
String text = myReader.nextLine();
Once the file _coreNLP_pipeline2LBP.java is ran and the output generated, one can open it as a dataframe using the following python code:
df = pd.read_csv('coreNLP_output.txt', delimiter=';',header=0)
The resulting dataframe will look like this, and can be used for further analysis!

- Conclusions
As you have seen coreNLP can be very easy to use and easily incorporated into a Python NLP pipeline! You could also print it directly onto a .csv file and use other delimitors, but I was having some annoying parsing problems…. Hope you enjoyed the post anyways and remember the complete code is available on github.
In the following post we will start talking about the Recursive Sentiment Analysis model and how to use it with coreNLP and Java. Keep posted to learn more about coreNLP ✌🏻
- Bibliography