Evaluating Voice Interaction Pipelines at the Edge

Smruthi Sridhar
Matthew Tolentino, University of Washington Tacoma


With the releases of Alexa Voice Services and Google Home, voice-driven interactive computing has quickly become commonplace. Voice interactive applications incorporate multiple components including complex speech recognition and translation algorithms, natural language understanding and generation capabilities, as well as custom compute functions commonly referred to as skills. Voice-driven interactive systems are composed of software pipelines using these components. These pipelines are typically resource intensive and must be executed quickly to maintain dialogue-consistent la- tency; consequently, voice interaction pipelines are usually computed in the cloud. However, for many cases, cloud connectivity may not be practical and thus require these voice interactive pipelines be executed at the edge. In this paper, we evaluate the feasibility of pushing voice interaction pipelines to resource constrained edge devices. Driven by the goal of enabling voice-driven interfaces for first responders during emergencies when connectivity to the cloud is impractical, we characterize the end-to-end performance of a complete open source voice interaction pipeline for four different configurations ranging from entirely cloud-based to completely edge-based. We then identify and evaluate several optimizations, such as caching and customized acoustic models that enable voice-driven interaction pipelines to be fully executed at computationally-weak edge devices at lower response latencies than using high-performance cloud resources.