Continual Inference: A new paradigm for efficient online processing with deep neural networks

19 June 2023

Robots often rely on visual information to understand and interact with their environment. They need to process visual data quickly to make decisions in real-time. For example, a robot working in a dynamic environment needs to identify and track objects, recognize obstacles, interact with humans, and interpret visual cues to navigate and manipulate objects effectively. Fast online processing is crucial for robots to perceive their surroundings quickly and react promptly to changes as well as for fluid human-robot interaction.

At the same time, the physical design and mobility of robots impose significant constraints on their processing capabilities and energy consumption. Consequently, deploying high-accuracy perception algorithms becomes impractical due to the limited resources available. Instead, we must carefully balance computational complexity and predictive performance to identify valuable trade-offs that optimize the robot’s overall functionality. This entails seeking efficient algorithms that strike a suitable equilibrium between computational demands and the desired level of performance.

While there is “no free lunch”, opportunities for waste reduction do exist. In the realm of processing spatio-temporal data, particularly in tasks such as human activity recognition, prior state-of-the-art methodologies have been identified by OpenDR researchers as sources of excessive computational waste. To address this issue, a novel family of deep neural network architectures called Continual Inference Networks were introduced. In many cases the reformulation of prior methods into Continual Inference Networks can reduce computational demands during stream processing by more than tenfold. This significant improvement has unlocked the utilization of unprecedented predictive accuracy while maintaining low computational requirements.

The OpenDR toolkit contains multiple efficient yet high-performing Continual Inference Networks including the CoX3DLearner for video-based human activity recognition (see fig. 1), the CoSTGCNLearner for skeleton-based human action recognition, and the CoTransEncLearner for feature-based recognition. In such networks, the temporal processing of the 3D-CNN, ST-GCN and Transformer Encoder models are reformulated as Continual Inference Networks, respectively.

Fig. 1. Stream processing with the OpenDR CoX3DLearner (demo code available online) for real-time human activity recognition on a GPU Laptop.

Authored by Lukas Hedegaard, Negar Heidari, and Alexandros Iosifidis

Aarhus University, Denmark