about
syllabus
All example source code
When I think of programming libraries and frameworks for text-based data I think of python and libraries like the Nature Language Toolkit (NLTK). However, there are still lots of options for language and text-based code libraries for JavaScript. I’ll look at two on this page starting with RiTaJS. RiTaJS is the JavaScript implementation of RiTa by Daniel Howe.
Designed to support the creation of new works of computational literature, the RiTa library provides tools for artists and writers working with natural language in programmable media.
The RiTa library has numerous features around text analysis and generation. For example, it has features built into it to generate text with algorithms and systems (Markov Chains, Context Free Grammer) I’ll cover in later tutorials.
To use RiTa, you can grab the JavaScript library files from the RiTa download page. For my examples, I’m using the RiTa Lexicon which requires rita-full.js
.
For now I want to look at two features, the RiString
object and the RiLexicon
object. RiString
allows you to analyze a piece of text. You can tokenize it, count syllables, determine parts of speech, etc. One particularly useful function is features()
which returns an object with features for the sentence, including phonemes, syllables, stresses, etc.
It should be noted that the parts of speech tags are from the Penn Treebank Project.
RiTaJS also has a lexicon built into it. A lexicon is another word for “vocabulary” and operates like a machine readable dictionary. The RiTa lexicon contains about 40,000 words along with associated spelling and phonemic data. The library provides many hooks into the lexicon. For example, you can ask it for random words of a given part of speech or with a certain syllable account. It also can provide words that rhyming words and words that “sound similar.” To use the lexicon, you simply need to make a RiLexicon
object.
Once you have the object you can query it anywhere in your code.
You can also customize the lexicon by editing the JS library files themselves or programmatically with addWord()
and removeWord
.
Another useful natural language processing library in JavaScript is NLP-Compromise, source on github by Spencer Kelly.
Just like with RiTaJS you can download the library files to use with your sketch. However, most JavaScript libraries are also available via a “CDN” (Content Delivery Network) meaning you can reference them on a web server directly in index.html
.
Once the library is loaded you can create a variable to call of its functions on.
NLP-Compromise works by allowing you to chain together a series of functions that build and/or adjust a block of text. For example, if you want to work with a noun you would say:
But typically you’ll see these actions chained together:
NLP-Compromise can negate statements, conjugate verbs (and therby alter tense), provide articles and pronouns, and more.
Some parts of this page is excerpted (and adapted for JavaScript) from Learning Processing.
Data can come from many different places: websites, news feeds, spreadsheets, databases, and so on. Let's say you've decided to make a map of the world's flowers. After searching online you might find a PDF version of a flower encyclopedia, or a spreadsheet of flower genera, or a JSON feed of flower data, or a REST API that provides geolocated lat/lon coordinates, or some web page someone put together with beautiful flower photos, and so on and so forth. The question inevitably arises: “I found all this data; which should I use, and how do I get it?”
If you are really lucky, you might find a JavaScript library that hands data to you directly with code. Maybe the answer is to just download this library and write some code like:
In this case, someone else has done all the work for you. They've gathered data about flowers and built a library with a set of functions that hands you the data in an easy-to-understand format. This library, sadly, does not exist (not yet), but there are some that do.
Let's take another scenario. Say you’re looking to build a visualization of Major League Baseball statistics. You can't find a library to give you the data but you do see everything you’re looking for at mlb.com. If the data is online and your web browser can show it, shouldn't you be able to get the data? Passing data from one application (like a web application) to another is something that comes up again and again in software engineering. A means for doing this is an API or “application programming interface”: a means by which two computer programs can talk to each other. Now that you know this, you might decide to search online for “MLB API”. Unfortunately, mlb.com does not provide its data via an API. In this case you would have to load the raw source of the website itself and manually search for the data you’re looking for. While possible, this solution is much less desirable given the considerable time required to read through the HTML source as well as program algorithms for parsing it.
The goal of these notes is to give you an overview of techniques, ranging from the more difficult manual parsing of data, to the parsing of standardized formats, to the use of an API designed specifically for JavaScript itself. Each means of getting data comes with its own set of challenges. The ease of using a JavaScript library is dependent on the existence of clear documentation and examples. But in just about all cases, if you can find your data in a format designed for a computer (spreadsheets, XML, JSON, etc.), you'll be able to save some time in the day for a nice walk outside.
The data exchange format that all of this week's examples focus on is called JSON (pronounced like the name Jason), which stands for JavaScript Object Notation. Its design was based on the syntax for objects in the JavaScript programming language (and is most commonly used to pass data between web applications) but has become rather ubiquitous and language-agnostic. Working with it in JavaScript is incredibly convenient.
All JSON data comes in the following two ways: an object or an array. And the syntax for these is identical to the syntax you see in JavaScript itself.
Let's take a look at a JSON object first. A JSON object is identical to a JavaScript object (without functions). It’s a collection of variables with a name and a value (or "name/value pair"). Each name is encoded as a string enclosed in quotes, this is just about the only difference. For example, following is JSON data describing a person:
This is how it might look if you typed it into your code directly (the quotes are no longer necessary.)
An object can contain, as part of itself, another object. Below, the value of “brother” is an object containing two name/value pairs.
To compare to data format like XML, the preceding JSON data would look like the following (for simplicity I'm avoiding the use of XML attributes).
Multiple JSON objects can appear in the data as an array. A JSON array is simply a list of values (primitives or objects). The syntax is identical to JavaScript syntax. Here is a simple JSON array of integers:
You might find an array as part of an object. Below the value of “favorite colors” is an array of strings.
A great place to find a selection of JSON data sources to play with is corpora, a github repository maintained by Darius Kazemi. For example, here’s a JSON file containing information about birds in Antarctica.
Now that I've covered the syntax of JSON, I can look at using the data in JavaScript and p5.js. The first step is simply loading the data loadJSON()
. loadJSON()
can be called in preload
or used with a callback. I'm using callbacks in just about all my examples so let's follow that syntax here.
The data from the JSON file is passed into the argument data
in the gotData
callback. Then it becomes a bit of detective work. How is the data structured — a single object? an array of objects? An object full of arrays of objects? Let’s look at a snippet from the birds of Antarctica.
If the JSON file is loaded into the variable data
, the way you access that data is no different than if you had said:
For example, if you wanted to display the description and link it to the source you would say:
And since birds
is an array of objects, you can use a for
loop just the way you always do with arrays. Each element of the array is an object itself with properties that can be accessed like family
and members
(which is also an array!).
Here’s what this looks like:
Birds of Antarctica, grouped by family
What makes something an API versus just some data you found, and what are some pitfalls you might run into when using an API?
An API (Application Programming Interface) is an interface through which one application can access the services of another. These can come in many forms. Openweathermap.org is an API that offers its data in JSON, XML, and HTML formats. The key element that makes this service an API is exactly that offer; openweathermap.org's sole purpose in life is to offer you its data. And not just offer it, but allow you to query it for specific data in a specific format. Let's look at a short list of sample queries.
One thing to note about openweathermap.org is that it does not require that you tell the API any information about yourself. You simply send a request to a URL and get the data back. Other APIs, however, require you to sign up and obtain an access token. The New York Times API is one such example. Before you can make a request, you'll need to visit The New York Times Developer site and request an API key. Once you have that key, you can store it in your code as a string.
You also need to know what the URL is for the API itself. This information is documented for you on the developer site, but here it is for simplicity:
Finally, you have to tell the API what it is you are looking for. This is done with a “query string,” a sequence of name value pairs describing the parameters of the query joined with an ampersand. This functions similarly to how you pass arguments to a function. If you wanted to search for the term "JavaScript" from a search()
function you might say:
Here, the API acts as the function call, and you send it the arguments via the query string. Here is a simple example asking for a list of the oldest articles that contain the term "JavaScript" (the oldest of which turns out to be May 12th, 1852).
This isn't just guesswork. Figuring out how to put together a query string requires reading through the API's documentation. For The New York Times, it’s all outlined on the Times' developer website. Once you have your query you can join all the pieces together and pass it to loadJSON()
. Here is a tiny example that simply displays the most recent headline.
Some APIs require a deeper level of authentication beyond an API access key. Twitter, for example, uses an authentication protocol known as “OAuth” to provide access to its data. Writing an OAuth application requires more than just passing a string into a request. There are some examples this week that use server-side programming in Node to perform the authentication.
Certain characters and invalid in URLs. For example, let’s say you were querying wordnik for the words “bath towel”. You would have to say bath%20towel
. You could do this yourself with a regex or use URI encoding with encodeURI()
. Here is more documentation and an example below.
encodeURI
does not encode the following characters: , / ? : @ & = + $ #
. This is as it should be since these are used in URLs to mean certain things. However, if you wanted to have a $ or / as part of some text you are passing into a key/value pair you would want to encode these characters. For this encodeURIcomponent()
can be used.
Coming soon.