Unstructured data can contain a huge amount of useful information. However, due to the complexities that can exist in processing and analyzing it, it’s often the case that practitioners avoid spending extra time and effort venturing outside of their comfort zones—i.e., outside of structured datasets—to analyze these unstructured goldmines.
Natural language processing (NLP) is a field within machine learning that focuses on using tools, techniques, and algorithms to process and understand natural language data, such as text and speech, which itself is inherently unstructured. In this article, we’re going to be looking at some of the basics of NLP and look at what ML teams can do with DeepLearning4J.
There are two important components in NLP: natural language understanding (NLU) and natural language generation (NLG).
Natural language understanding enables the machine to understand and analyze human language by extracting metadata from content such as keywords, concepts, emotions, relations, and semantic roles. It’s mainly used in business applications to help businesses understand customer problems in both spoken and written form. NLU involves tasks like:
Meanwhile, natural language generation acts as a translator that converts computerized data into natural language representation. NLG involves tasks like:
In short, NLU is the process of reading and interpreting language whereas NLG is the process of writing or generating language. The former produces non-linguistic outputs using natural language inputs while the latter constructs natural language outputs from non-linguistic inputs.
Building an NLP pipeline for ML applications involves several steps which we’ll summarize below.
Sentence segmentation is the first step in the NLP pipeline, and it’s used to break paragraphs into individual sentences. Consider this paragraph:
“Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.”
Sentence segmentation would produce the following results:
Word Tonekizer is used to break the sentence into separate words or tokens. For example, “It is seen as a part of artificial intelligence.” becomes:
Stemming is used to normalize words into their base or root form. For example, the words “intelligence” and “intelligently” stem from the single root word “intelligent”. The problem with stemming is that sometimes it produces the root word which may not have any meaning.
Lemmatization is quite similar to stemming. It’s used to group different inflected forms of a word, known as “lemma”. The main difference between stemming and lemmatization is that it produces the root word, which has a meaning.
In English, there are lots of words that appear frequently. Examples of these are, “is”, “and”, “the”, “it”, and “a”. NLP pipelines will flag these words as stop words, and they might be filtered out prior to any statistical analysis.
Dependency parsing is used to identify words n a sentence that are related to one another.
POS means “parts of speech”, which includes nouns, verbs, adverbs, and adjectives. It indicates how a word functions with its meaning as well as grammatically within a sentence. A word might have one or more POS based on the context of the sentence.
Although Google is commonly used as a verb, it is a proper noun.
Named entity recognition detects named entities in speech or written text, such as a person’s name, the name of a place, or the title of a movie.
Chunking collects the individual bits of information and groups them into larger sentences.
Now that we’ve looked at the pieces of a typical NLP pipeline, let’s look at the five key phases of NLP before putting everything together in a DL4J example. These phases are:
Eclipse Deeplearning4J (DL4J) is a suite of tools for running deep learning on the Java virtual machine. DL4J is the only framework that enables ML teams to train models from Java while interoperating with the Python ecosystem through a mix of Python execution via DL4J’s CPython bindings, model import support, and interop of other runtimes such as TensorFlow Java and ONNX Runtime.
Use cases for DL4J include importing and retraining models (Pytorch, TensorFlow, Keras) models and deploying them in JVM microservice environments, mobile devices, IoT, and Apache Spark. If used effectively, DL4J can be a great compliment to ML teams’ Python environment for running models built-in Python and deployed to or packaged for other environments.
Although DL4J isn’t exactly comparable to more high-level tools such as Stanford’s CoreNLP, it does include some text processing tools at its core that are useful for NLP. Let’s take a look at these in more detail.
Processing natural language involves many steps, the first of which is to iterate over your corpus to create a list of documents. These can be as short as a sentence or two or as long as an entire article.
This is done using something known as a SentenceIterator, which looks like this:
In this code, String is the path to the text file whereas SentenceIterator strips the white space before and after for each line.
The SentenceIterator encapsulates a corpus of texts and organizes it. It’s responsible for feeding text piece by piece into a natural language processor and crating a selection of strings by segmenting a corpus.
A Tokenizer further segments the text at the level of single words, also alternatively as n-grams. ClearTK contains the underlying tokenizers, such as parts of speech and parse trees, which allow for both dependency and constituency parsing, such as that used by a recursive neural tensor network (RNTN).
A Tokenizer is created and wrapped by a TokenizerFactory. The default tokens are words separated by spaces. The tokenization process involves some ML to differentiate between ambiguous symbols such as periods that are both used to end sentences and also abbreviate words such as vs. and Dr.
Both Tokenizers and SentenceIterators work with Preprocessors to deal with anomalies in messy text like Unicode, and to render such text, say, as lowercase characters uniformly.
Each document must be tokenized to create a vocab, the set of words that are important to that corpus. These words are stored in the vocab cache, which contains statistics about a subset of words counted in the document.
The line separating significant and insignificant words is mobile, but the basic premise behind distinguishing between the two groups is that words that occur only once or twice are hard to learn, and their presence represents unhelpful noise.
The vocab cache stores metadata for methods including Word2vec and Bag of Words, which treat words in radically different ways. For example, Word2vec creates representations of words in the form of vectors that are hundreds of coefficients long. These help neural networks predict the likelihood of a word appearing in any given context.
Here’s a look at Word2vec configuration:
When word vectors are obtained, they can then be fed into a deep network for classification, prediction, sentiment analysis, and more.
Want to build your own NLP project, train and evaluate your model in the cloud, and then send it to production, all from the same place? If so, Qwak has you covered!
Qwak is the full-service machine learning platform that enables teams to take their models and transform them into well-engineered products. Our cloud-based platform removes the friction from ML development and deployment while enabling fast iterations, limitless scaling, and customizable infrastructure.
Want to find out more about how Qwak could help you deploy your ML models effectively? Get in touch for your free demo!