Keynote Speakers

We are pleased to announce the Keynote speakers for WASPAA 2021.

Learning from Natural Supervision

Carl Vondrick, Columbia University


Self-supervised learning has emerged as one of the most exciting directions in machine learning today. Rather than relying on labeled data, this paradigm aims to learn excellent representations of audio and images through prediction, often between between modalities or into the future. In this talk, I will discuss both some limitations and some opportunities in this framework. For example, not everything is predictable, such as outcomes of dice rolls and the exact words a person will say. We will introduce a new type of geometric representation for capturing this uncertainty, and discuss its broader applications in both audio and visual understanding.


Carl Vondrick is on the computer science faculty at Columbia University. His research group studies computer vision and machine learning. His research is supported by the NSF, DARPA, Amazon, and Toyota, and his work has appeared on the national news, such as CNN, NPR, the Associated Press, Stephen Colbert’s television show, as well as some children’s magazines. He received the 2021 NSF CAREER Award, the 2021 Toyota Young Faculty Award, and the 2018 Amazon Research Award. Previously, he was a Research Scientist at Google and he received his PhD from MIT in 2017.

How the Brain Separates and Focuses on Sounds

Barbara Shinn-Cunningham, Carnegie Mellon University


Not every sound that is audible gets processed in the brain in the same detail. Instead, your brain filters the information reaching the ears, letting through sounds that either seem inherently important (like the sudden crash of a shattering window) or are important for whatever task you are undertaking (like the question an Important Scientist poses to you at a poster session). Depending on what aspect of a sound you focus on, you recruit distinct brain networks that are shared with other sensory modalities. This talk will explain what we know about control of both spatial and non-spatial processing of sound, based on neuroimaging and behavioral studies, and discuss ways this knowledge can be utilized in developing new assistive listening devices.


Barbara Shinn-Cunningham (Brown University, Sc.B.; Massachusetts Institute of Technology, M.S. and Ph.D.) studies auditory attention, spatial perception, and scene analysis, taking into account everything from sensory coding in the cochlea to cognitive networks in cortex. She is Director of the Neuroscience Institute at Carnegie Mellon University. She has won various awards for her research, including the Helmholtz-Rayleigh Interdisciplinary Silver Medal from the Acoustical Society of America (ASA), a Vannevar Bush Fellowship, and mentorship awards from the Society for Neuroscience and ASA. She is a Fellow of ASA and the American Institute for Medical and Biological Engineers and is a lifetime National Associate of the National Research Council.


Deep learning, please meet speech compression

Jan Skoglund, Google LLC


Deep learning has radically advanced various speech processing areas. This has been demonstrated also for data compression of speech – speech coding. The most successful speech coding systems the last couple of years have been based on autoregressive generative models, where a decoded bit stream of encoded parameters is driving deep neural network speech synthesis. Quite recently significant steps forward have been made using a non-generative architecture. In this talk we will discuss the properties of some of these architectures, and how they can be utilized to build a practical system for, e.g., video conferencing.


Jan Skoglund leads a team at Google in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering. These components have been deployed in Google software products such as Meet and hardware products such as Chromebooks. After receiving his Ph.D. degree at Chalmers University of Technology in Sweden,1998, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing, such as compression, enhancement, and echo cancellation, tailored for packet-switched networks. GIPS’ audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung, and was open-sourced as WebRTC after a 2011 acquisition by Google. Since then he has been a part of Chrome at Google.

Comments are closed.