Social robotics is an expansive and interdisciplinary research field focused on developing efficient robotic agents capable of interacting effectively with humans. It tackles the challenges of human-robot interactions in dynamic and complex environments where robots coexist and collaborate with humans. The domain of social robotics presents various hurdles, such as multi-modal perception and action planning.
This thesis delves into essential tasks crucial for creating successful social robots, with a particular emphasis on their acoustic aspects. Specifically, the research explores Sound Source Localization (SSL) in static and active contexts, along with perceptually driven robot navigation. To facilitate this exploration, a versatile and robust acoustic simulation environment is introduced. This platform allows for the virtual recording of signals in reverberant spaces, with features to model different microphone arrays, room layouts, and sound source arrangements.
Leveraging cutting-edge sound rendering libraries, this simulation environment acts as a comprehensive testing ground for training and evaluating novel auditory algorithms. Given that deep learning has revolutionized robotics advancements but demands vast amounts of data, often challenging or costly to collect physically, simulation becomes essential for scalable data generation.
Within this simulated framework, the focus is on auditory perception as a foundational skill, where SSL plays a critical role for social robots. SSL entails identifying the location of active speakers and is rooted in extensive research. The thesis explores this problem from multiple angles and introduces a set of deep learning-based methods to overcome its primary challenges. After addressing the single-source case, an advanced multi-source localization model is proposed, both rigorously evaluated through extensive experiments under diverse conditions.
In real-world social scenarios, robots seldom remain static, introducing new complexities to the localization issue. To tackle this, the SSL approach is extended to dynamic settings, explicitly considering robot motion. A novel method is introduced to aggregate predictions from the static multi-source localizer over time, utilizing robot movement to enhance speaker position estimations. The simulator is expanded to generate complete trajectories in virtual environments, enabling effective training and evaluation of this dynamic localization model.
Transitioning from perception to action, the thesis explores modern Deep Reinforcement Learning (DRL) techniques applied to robot navigation, specifically to improve auditory perception through movement. While recent Automatic Speech Recognition (ASR) models excel at transcribing human speech accurately, they may struggle in reverberant environments. To address this issue, a perceptually motivated navigation task is introduced, where a robot learns to position itself to minimize speech recognition errors based solely on audio captured by its microphone array.
This final contribution builds on previous work in acoustic simulation and SSL, integrating perception and action to advance embodied auditory intelligence in robots. The thesis examines how perception and action can be harmoniously combined to enhance social robots’ capabilities effectively.
