Sebastian Ewert

I am a Senior Research Manager at Spotify, where I head the Audio Intelligence research area. Before joining Spotify, I was a lecturer (≈assistant professor) for Signal Processing in the School of Electronic Engineering and Computer Science at Queen Mary University of London, Centre For Digital Music (C4DM) , where I was one of the founders of the Machine Listening Lab. My background is in Computer Science and Mathematics. I did a PhD in Computer Science at the University of Bonn, Germany, which was supervised at the Max-Planck-Institute for Informatics, Saarbrücken, Germany.

Current and Past PhD Students

Yin-Jyun Luo (QMUL, UK): Unsupervised Disentangled Representation Learning for Music and Audio
Supervision: Simon Dixon, Sebastian Ewert Abstract: Disentanglement has been one of the core challenges in representation learning. The property uniquely associates factors of variation that underlie highdimensional observations with representations in a low-dimensional manifold. This comes with interpretable latent representations which enhance our understanding of complex data and corresponding learning algorithms. In addition to extracting semantically ... Show full Abstract: Disentanglement has been one of the core challenges in representation learning. The property uniquely associates factors of variation that underlie highdimensional observations with representations in a low-dimensional manifold. This comes with interpretable latent representations which enhance our understanding of complex data and corresponding learning algorithms. In addition to extracting semantically meaningful representations, generative modelling enables synthesis of novel data through manipulating the disentangled representations in the low-dimensional space, which has facilitated creative applications for numerous modalities. In music, for example, the ability to disclose underlying musical attributes from complicated phrases as well as exploit the attributes to obtain fresh outcomes has bolstered the development of disentangled representation. It has been shown, however, disentanglement is infeasible without proper inductive biases which are often introduced by data annotations. The scarcity of quality labels thus warrants unsupervised learning techniques which impose the biases through a variety of regularisations and architectural design choices to the learning framework. In this proposal, we discuss approaches to tackle disentangled representation learning for sequential data without annotations. We first investigate dynamical variational auto-encoders (DVAE), a framework that combines latent-variable models and state-space dynamics with deep learning techniques. Although models built upon DVAE have demonstrated the capability of extracting timevariant and time-invariant latent representations, we show that the success is sensitive to model architecture and hyperparameters. We also explore potential solutions to enhance robustness, whereby the framework could be generally applicable to music in audio domain, including instrumental and vocal recordings. Utilising the robust framework as the backbone, the research aims to developing tailored specifications that enable interactive applications such as singing voice editing through manipulation of the learnt representations. Hide full

Jiawen Huang (QMUL, UK): Lyrics Alignment and Transcription for Polyphonic Music
Supervision: Emmanouil Benetos, Sebastian Ewert Abstract: Lyrics-to-audio alignment and transcription can provide an easy way to navigate through vocal music and enhance the listening experience. In this report, recent works on this task are reviewed. We discuss three main challenges and how previous research addresses them. They are 1) lack of datasets, 2) large gap between speech and singing voice, and 3) existence of background music. Inspired by recent advances on related topics, some possible directions are proposed to overcome the challenges.; ... Show full Abstract: Lyrics-to-audio alignment and transcription can provide an easy way to navigate through vocal music and enhance the listening experience. In this report, recent works on this task are reviewed. We discuss three main challenges and how previous research addresses them. They are 1) lack of datasets, 2) large gap between speech and singing voice, and 3) existence of background music. Inspired by recent advances on related topics, some possible directions are proposed to overcome the challenges. Hide full

Ishwarya Ananthabhotla (MIT, USA): Cognitive Audio: Enabling Auditory Interfaces with an Understanding of How We Hear
Supervision: Joseph A. Paradiso, Sebastian Ewert, Poppy Crum Abstract: Over the last several decades, neuroscientists, cognitive scientists, and psychologists have made strides in understanding the complex and mysterious processes that define the interaction between our minds and the sounds around us. Some of these processes, particularly at the lowest levels of abstraction relative to a sound wave, are well understood, and are easy to characterize across large sections of the human population; ... Show full Abstract: Over the last several decades, neuroscientists, cognitive scientists, and psychologists have made strides in understanding the complex and mysterious processes that define the interaction between our minds and the sounds around us. Some of these processes, particularly at the lowest levels of abstraction relative to a sound wave, are well understood, and are easy to characterize across large sections of the human population; others, however, are the sum of both intuition and observations drawn from small-scale laboratory experi- ments, and remain as of yet poorly understood. In this thesis, I suggest that there is value in coupling insight into the workings of auditory processing, beginning with abstractions in pre-conscious processing, with new frontiers in interface design and state-of-the-art infrastructure for parsing and identifying sound objects, as a means of unlocking audio technologies that are much more immersive, naturalistic, and synergistic than those present in the existing landscape. From the vantage point of today’s computational models and de- vices that largely represent audio at the level of the digital sample, I gesture towards a world of auditory interfaces that work deeply in concert with uniquely human tendencies, allowing us to altogether re-imagine how we capture, preserve, and experience bodies of sound – towards, for example, augmented reality devices that manipulate sound objects to minimize distractions, lossy "codecs" that operate on semantic rather than time-frequency information, and soundscape design engines operating on large corpora of audio data that optimize for aesthetic or experiential outcomes instead of purely objective ones. To do this, I aim to introduce and explore a new research direction focused on the marriage of principles governing pre-conscious auditory cognition with traditional HCI approaches to auditory interface design via explicit statistical modeling, termed "Cognitive Audio". Along the way, I consider the major roadblocks that present themselves in approaching this convergence: I ask how we might "probe" and measure a cognitive principle of interest robustly enough to inform system design, in the absence of immediately observable biophysical phenomena that may accompany, for example, visual cognition; I also ask how we might build reliable, meaningful statistical models from the resulting data that drive compelling experiences despite inherent noise, sparsity, and generalizations made at the level of the crowd. I discuss early insights into these questions through the lens of a series of projects centered on auditory processing at different levels of abstraction. I begin with a discussion of early work focused on cognitive models of lower-level phenomena; these exercises then inform a comprehensive effort to construct general purpose estimators of gestalt concepts in sound understanding. I then demonstrate the affordances of these estimators in the context of application systems that I construct and characterize, incorporating additional explorations on methods for personalization that sit atop these estimators. Finally, I conclude with a dialogue on the intersection between the key contributions in this dissertation and a string of major themes relevant to the audio technology and computation world today. Hide full

Daniel Stoller (QMUL, UK): Deep Learning for Music Information Retrieval in Limited Data Scenarios
Supervision: Simon Dixon, Sebastian Ewert Abstract: While deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, overfitting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate data-driven priors that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-specific knowledge. We first consider ... Show full Abstract: While deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, overfitting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate data-driven priors that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-specific knowledge. We first consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to find a weight initialisation for a universal MIR model so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-specific feature engineering, such as fixed spectrogram representations of audio commonly used in machine listening. To this end, we propose Wave-U-Net, an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogram-based counterpart. Finally, we derive Seq-U-Net as a causal variant of Wave-U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally efficient. Hide full

Delia Fano Yela (QMUL, UK): Signal Processing and Graph Theory Techniques for Sound Source Separation
Supervision: Mark Sandler, Dan Stowell, Sebastian Ewert Abstract: In recent years, source separation has been a central research topic in music signal processing, with applications in stereo-to-surround up-mixing, remixing tools for DJs or producers, instrument-wise equalizing, karaoke systems, and pre-processing in music analysis tasks. This PhD focuses on various applications of source separation technique in the music production process, from removing interfering sound sources from studio and live recordings to tools for modifying the singing voice. In this context, most previous methods often specialize on so called stationary and semi-stationary interferences, such as simple broadband noise, feedback or reverberation. In practise, however, one often faces a variety of complex, non-stationary interferences, such as coughs, door slams or traffic noise... Show full Abstract: In recent years, source separation has been a central research topic in music signal processing, with applications in stereo-to-surround up-mixing, remixing tools for DJs or producers, instrument-wise equalizing, karaoke systems, and pre-processing in music analysis tasks. This PhD focuses on various applications of source separation technique in the music production process, from removing interfering sound sources from studio and live recordings to tools for modifying the singing voice. In this context, most previous methods often specialize on so called stationary and semi-stationary interferences, such as simple broadband noise, feedback or reverberation. In practise, however, one often faces a variety of complex, non-stationary interferences, such as coughs, door slams or traffic noise. General purpose methods applicable in this context often employ techniques based on non-negative matrix factorization. Such methods use a dictionary of spectral templates that is computed using available training data for each interference class. A major problem here is that the training material often differs substantially in terms of spectral and temporal properties from the noise found in a given recordings, and thus such methods often fail to properly model the sound source and therefore fail to produce separation results of high or even acceptable quality. A major goal of this PhD will be to explore and develop conceptually novel source separation methods that go beyond dictionary-based state-of-the-art methods and yield results of high quality even in difficult scenarios. Hide full

Siying Wang (QMUL, UK): Computational Methods for the Alignment and Score-Informed Transcription of Piano Music
Supervision: Simon Dixon, Sebastian Ewert Abstract: This thesis is concerned with computational methods for alignment and score-informed transcription of piano music. Firstly, several methods are proposed to improve the alignment robustness and accuracywhen various versions of one piece of music showcomplex differences with respect to acoustic conditions or musical interpretation. Secondly, score to performance alignment is applied to enable score-informed transcription. Although music alignment methods have considerably improved in accuracy in recent years, the task remains challenging. The research... Show full Abstract: This thesis is concerned with computational methods for alignment and score-informed transcription of piano music. Firstly, several methods are proposed to improve the alignment robustness and accuracywhen various versions of one piece of music showcomplex differences with respect to acoustic conditions or musical interpretation. Secondly, score to performance alignment is applied to enable score-informed transcription. Although music alignment methods have considerably improved in accuracy in recent years, the task remains challenging. The research in this thesis aims to improve the robustness for some cases where there are substantial differences between versions and state-of-the-art methods may fail in identifying a correct alignment. This thesis first exploits the availability of multiple versions of the piece to be aligned. By processing these jointly, the alignment process can be stabilised by exploiting additional examples of how a section might be interpreted or which acoustic conditions may arise. Two methods are proposed, progressive alignment and profile HMM, both adapted from the multiple biological sequence alignment task. Experiments demonstrate that these methods can indeed improve the alignment accuracy and robustness over comparable pairwise methods. Secondly, this thesis presents a score to performance alignment method that can improve the robustness in cases where some musical voices, such as the melody, are played asynchronously to others – a stylistic device used in musical expression. The asynchronies between the melody and the accompaniment are handled by treating the voices as separate timelines in a multi-dimensional variant of dynamic time warping (DTW). The method measurably improves the alignment accuracy for pieces with asynchronous voices and preserves the accuracy otherwise. Once an accurate alignment between a score and an audio recording is available, the score information can be exploited as prior knowledge in automatic music transcription (AMT), for scenarios where score is available, such as music tutoring. Score-informed dictionary learning is used to learn the spectral pattern of each pitch that describes the energy distribution of the associated notes in the recording. More precisely, the dictionary learning process in non-negative matrix factorization (NMF) is constrained using the aligned score. This way, by adapting the dictionary to a given recording, the proposed method improves the accuracy over the state-of-the-art. Hide full