A simple intro to tokenizing with OpenNLP in Clojure

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

A nice intro:

Finding sentences

Words (tokens) aren’t the only structures that we’re interested in, however. Another interesting and useful grammatical structure is the sentence. In this recipe, we’ll use a process similar to the one we used in the previous recipe, Tokenizing text, in order to create a function that will pull sentences from a string in the same way that tokenize pulled tokens from a string in the last recipe.

Getting ready

We’ll need to include clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data “0.1.0-SNAPSHOT”

:dependencies [[org.clojure/clojure “1.6.0”]

[clojure-opennlp “0.3.2”]])

We will also need to require it into the current namespace:

(require ‘[opennlp.nlp :as nlp])

Finally, we’ll download a model for a statistical sentence splitter. I downloaded en-sent.bin from http://opennlp.sourceforge.net/models-1.5/. I then saved it into models/en-sent.bin.

How to do it…

As in the Tokenizing text recipe, we will start by loading the sentence identification model data, as shown here:

(def get-sentences

(nlp/make-sentence-detector “models/en-sent.bin”))

Now, we use that data to split a text into a series of sentences, as follows:

user=> (get-sentences “I never saw a Purple Cow.

I never hope to see one.

But I can tell you, anyhow.

I’d rather see than be one.”)

[“I never saw a Purple Cow.”

“I never hope to see one.”

“But I can tell you, anyhow.”

“I’d rather see than be one.”]

How it works…

The data model in models/en-sent.bin contains the information that OpenNLP needs to recreate a previously-trained sentence identification algorithm. Once we have reinstantiated this algorithm, we can use it to identify the sentences in a text, as we did by callingget-sentences.

Focusing on content words with stoplists

Stoplists or stopwords are a list of words that should not be included in further analysis. Usually, this is because they’re so common that they don’t add much information to the analysis.

These lists are usually dominated by what are known as function words—words that have a grammatical purpose in the sentence, but which themselves do not carry any meaning. For example, the indicates that the noun that follows is singular, but it does not have a meaning by itself. Others prepositions, such as after, have a meaning, but they are so common that they tend to get in the way.

On the other hand, chair has a meaning beyond what it’s doing in the sentence, and in fact, it’s role in the sentence will vary (subject, direct object, and so on).

You don’t always want to use stopwords since they throw away information. However, since function words are more frequent than content words, sometimes focusing on the content words can add clarity to your analysis and its output. Also, they can speed up the processing.

Post external references

  1. 1