How we improved NLP error rate fourfold and achieved 94% accuracy

Nikolay Zenovkin
Chatbots Magazine
Published in
6 min readOct 27, 2017

--

At Cubic.ai we’ve developed a toolkit that allows you to build conversational applications in a very efficient manner. Our current approach is based on both machine learning and templates. Here we will describe how we learned to use our template engine for building machine learning models, reducing the cost of development, and increasing the success rate four times over on a noisy ASR input.

The problem

As we’ve worked with customers from different areas, we’ve had to build dialog-based interfaces from scratch every time. As a result, we’ve had to write many utterances for every new use case. A good solution to this problem could be using machine learning algorithms, but very often it’s almost impossible to get either labeled datasets or corpora with users’ utterances for a specific use case.

A deficiency in datasets makes building dialog-based interfaces more challenging than question-answer based interfaces, which can be trained to answer practically any question on Wikipedia or on widely-available smalltalk datasets. This problem is a primary reason why NLP frameworks, like ours, are still not based on machine learning.

Even Api.ai, being a benchmark in NLP technology today, is still mainly a template-based engine and not a neural network, despite the fact that one can do some training on a manually-prepared dataset.

Our template configuration system, although it has several advantages such as the phonetic matcher and dialog management capabilities, was more or less similar to Api.ai’s one. Dialog configuration is very painful because it requires writing down all possible user phrases and utterances. We wanted our system to be capable of generalization, with deep language understanding and a better tolerance for typos and ASR noises.

This problem with datasets seemed to be unsolvable until we understood one interesting property of our templates.

The research

Our first question was: what is actually possible to generalize in a general “text to meaning” NLP solution? One could say that semantics obviously can be generalized. Well, it turned out that in our case it’s hard to generalize semantics, because we work with small domains, such as smart homes. And general embedding usually doesn’t reflect distances between words in particular meanings in the specific subset of the language. For example, “turn” and “make” are very distant words in Glove, but for the smart home domain, these two words are practically the same.

It is however possible to generalize morphology. Since the morphology of any language, although sometimes at a great stretch, is nonetheless regular and structured, generalization over morphology is a feasible task for machine learning. Even algorithmic stemmers can handle this task; however, the results are sometimes unpredictable and uncertain, so we didn’t want to depend on them. Thus, understanding the difference between affixes and roots is feasible, but it is a hard task that requires a huge dataset.

It is also possible to generalize from syntactic diversity. Although there are many ways of saying similar things with similar words, if we’re familiar with most words in a sentence, we can be tolerant for particular variations in word order and to some extent preposition usage, at least in English. We just have to train our model to find some key information and sometimes be tolerant of sentence structure. When we read users’ logs for the first time, we understood how many people speak like Yoda and realized that syntax tolerance is definitely a must-have feature for a spoken language understanding system.

Now we know that we’re able to generalize over syntax and morphology, but not semantics; and generalization over both syntax and morphology would be very helpful in building NLP solutions. The next step is to obtain a dataset that will help us to train a machine learning algorithm and build an algorithm that is capable of such generalizations.

Datasets

As was mentioned, the main problem for us was getting datasets for the specific domain(s) before production usage. But we always had templates of user utterances. Our insight was that if templates are a reflection of user utterances and can be matched with them, then they can be used to generate utterance datasets.

Unsurprisingly the first language model, which was trained on the synthetically generated dataset, failed. The dataset contained useful properties of meaningful words, but it wasn’t enough to train a good model.

Templates are not only a source of sample generation, they are also a source of information about the content of these samples. In generating samples, we can understand which words are meaningful for classifying intent and extracting slots, and we are able to substitute them. That was the first idea for a set of data augmentation algorithms that helped us to train a good model. This approach allowed us to build datasets which contain positive and negative samples. For example, if a model with character-based inputs should recognize the color red, without proper examples it’s hard to distinguish red, reddish and redneck. Our data augmentation can provide such examples using templates and a list of common English words.

The data augmentation can do even more magic. It could be designed in a way that adding the word ‘not’ before a slot value will change its value, which will work for every trained model for every domain.

The model

It’s not a major issue to generate classifier structure from a domain description. Every NLP domain that we developed had up to a few dozen slots with a limited number of values, therefore the model should be a multiple output classifier with one output per extracted slot.

In order to meet our requirements for noise and morphology tolerance, we chose a model with character-based inputs. The challenge was to build an appropriate character-based input implementation. We started with recurrent models, but they didn’t work well enough. So we decided to use a convolutional architecture. Recent research shows that convolutional architectures work fine on NLP tasks. For example, the ConvS2S model outperformed the standard LSTM seq2seq approach on a machine translation task, and our experience is consistent with that result. The other advantage of the convolutional architecture is that on modern GPUs convolution kernels work faster than recurrent networks, which can’t process input characters in parallel.

The final neural network architecture has shown quite a good result. Our machine learning model was generated and trained on templates; those templates could also be used to configure the template engine, which could work alongside neural networks. Since templates have practically zero variance and high bias, this fact could slightly improve neural network results, helping it to avoid some silly mistakes on simple inputs.

The evaluation

Since our model was trained on synthetic data, we needed evidence that it is capable of processing a wide variety of real user utterances. Luckily, we had the Cubic Butler application and extensive logs of real user utterances in a smart home domain. We prepared a test set from these logs and tested our architecture on it. In doing so we expected some improvements, but it was a huge surprise when we found out that the error rate in the machine learning implementation dropped almost threefold compared to the template implementation.

There were a lot of slightly distorted samples due to ASR noises, and a lot of grammatically incorrect utterances (which still had a clear meaning). The template engine had difficulties in understanding these utterances, but the neural network processed them easily. By completing our final architecture with both neural networks and templates we decreased the error rate even more, achieving a final exact match score of 94%.

End result

We found a way to build machine learning based conversational bots without having a large amount of data. We improved the quality of our smart home control system four times over and found a way to build high-quality NLP applications for our customers. We made a significant step forward towards deep human-like language understanding and reduced the cost of developing NLP applications.

--

--