In the beginning was the voice menu.
The first voice interfaces were command-and-control (C&C) voice menus. These are familiar to anyone who has ever called a customer support line and had to talk to a computer in order to be routed to the correct department. C&C systems were a breakthrough in their day, and were well matched to the computational limits of the late 1990s and early 2000s.
However, these systems were slow, cumbersome, and forced users to navigate rigid, hierarchical menus of options. C&C systems often limited users to one word at a time, pushing users into step-by-step navigation of the full menu tree.
As compute power increased and C&C’s limitations became clear, those systems gave way to rule-based Natural Language Understanding (NLU) systems. These allowed users to speak in multi-word phrases or whole sentences. Anything users said could be recognized, provided it matched the rules programmed into the system. Those rules were themselves derived from conversation flows created by the system’s designers.
Rule-based systems gave users a far better user experience, allowing them to issue complex commands in a single utterance rather than tediously selecting options from one-word menus. Rule-based NLUs were better at understanding people, but required developers to create rules for every distinct way users might phrase their requests. The result was more powerful but still very brittle; users always find unanticipated phrasings which don’t match the rules.
To combat this brittleness, NLU developers created more sophisticated rules, added heuristics for breaking ties when a user utterance matched multiple rules, and developed scoring systems that allowed them to fine-tune the results for problematic utterances. These augmentations helped, but added enormous additional cost. Even without these improvements, rule-sets tend to grow much more rapidly than the number of distinct actions the system supports. Each augmentation only added additional manual authoring on top of creating the rules.
Recently, the drawbacks of rule-based NLUs have led to the rise of machine learning (ML) NLU systems. These do away with explicit rules entirely, replacing them with statistical models: give a machine learning algorithm enough examples of utterances and what actions they correspond to, and let it derive a statistical model that matches new utterances to the appropriate actions.
Machine learning systems have swept the industry like wildfire and are transforming how NLU systems are built. The ML pipeline—data collection, annotation, and model-training—is rapidly supplanting the labor-intensive process of creating and testing rules.
Machine learning NLU systems are inherently robust to new phrasings of the same actions. They have much better scaling properties than rule-based systems, relative to the number of actions they need to understand. Once an ML pipeline is established, the statistical models can be updated much more rapidly than rule-based NLUs. And even better, the stages of the ML workflow can be partly or fully automated, greatly reducing cost and time-to-market.
The tradeoff is data: Machine learning demands large quantities of training data. Without it the trained models are not robust enough for real-world applications. The industry is beginning to realize that data sets are key, and that collecting and annotating high-quality data—even with partial automation—is a significant task. As ML processes become widespread, the data sets are almost more valuable than the models they yield. Generating a model from data is easy. Gathering good data is hard.
For those reasons, Voicebox has invested heavily in building a powerful, efficient ML pipeline. Our data scientists have made patent-pending breakthroughs to the data collection and annotation steps, allowing Voicebox to build larger, higher-quality data sets than our competition, with less time and labor.
Machine learning shows tremendous promise to realize Voicebox’s goal of Conversational Voice AI. Voicebox has already seen excellent results on several real-world projects, and we expect that trend to continue for years to come.