I have a background in digital signal processing (DSP) and have worked on a wide range of audio and related tasks, including speech- and audio processing, music analysis and music information retrieval.
I have a profound understanding of deep audio ml. I am specifically interested in the tasks of source count estimation and audio source separation. I am leading the research team at Audioshake.ai that created the best performing music separation model.
# About me
Since 2021, I'm head of research at audioshake.ai (opens new window) working on audio source separation. Before, I was a postdoctoral researcher at the Inria and University of Montpellier (opens new window), France. I did my Ph.D (Dr.-Ing.) at the International Audio Laboratories Erlangen (opens new window) (is a joint institution of Fraunhofer IIS (opens new window) and FAU Erlangen-Nürnberg (opens new window)) in Germany supervised by Bernd Edler (opens new window). My dissertation titled «Separation and Count Estimation for Audio Sources Overlapping in Time and Frequency» can be viewed here (opens new window). Before, I graduated in electrical engineering / communication engineering from the University of Hannover, Germany (opens new window). An extended CV is available here (opens new window).
# Current Research Interests
Deep learning on data hubs: I am interested multi-modal auto-encoders that can learn the relations between the different modalities to reconstruct or enhance missing or degraded data. Also, I work on multistore and heterogeneous heritage datasets.
User-centered AI for audio data: I want to develop new methods and tools for users with domain knowledge to deliver interpretable audio models. Furthermore, evaluation of audio processing tasks is often done in a computational manner, due to the lack of expertise from signal processing researchers in organizing perceptual evaluation campaigns.
Ecological machine-learning: I want to play a role in reducing the carbon footprint of my work. Reducing the size of datasets speeds up training and therefore saves energy. Reducing the computational complexity of models is an active research topic, with strongly investigated ideas like quantization, pruning or compression. Inspired by current trends in differentiable signal processing, I want to convert deep models so that they can be deployed on edge devices.
# Press/Media Interviews
- 02/2022 "L'intelligence artificielle au profit des stems musicaux", Radio-Canada (French) (opens new window)
- 12/2021 "Recycling von Songs: Wie KI neue Musik generiert", Deutschlandfunk Kultur (German) (opens new window)
# Scientific Service
# Student Supervision
- Yeong-Seok Jeong and Jinsung Kim (opens new window), Master students, Korea University, Internship on "Unsupervised Music Separation" (Summer 2022).
- Michael Tänzer, PhD student, Fraunhofer IDMT (Germany), (Summer 2021), Internship on audio tagging.
- Lucas Mathieu (opens new window), Master student, AgroParistech (France), Master thesis "Listening to the Wild" (03/2020). Theoretical research on self-supervised learning using data from animal-born loggers (MUSE project (opens new window)). Lucas was accepted as a PhD student after master thesis.
- Clara Jacintho and Delton Vaz, Bachelor Thesis, PolyTech Montpellier (France), "Machine Learning for Audio on the Web" (12/2019). Research on web based separation architectures. Resulted in a paper submitted to the Web Audio Conference 2021 (opens new window).
- Wolfgang Mack (opens new window), Master Thesis (FAU Erlangen-Nürnberg, Germany), "Investigations on Speaker Separation using Embeddings obtained by Deep Learning", (05/2017), Wolfgang was accepted as PhD student after master thesis.
- Erik Johnson (opens new window), DAAD (opens new window) Research internship, (Carleton University, Canada), "Open-Source Implementation of Multichannel BSSEval in Python" (opens new window) (03/2014).
- Nils Werner (opens new window), Master Thesis, (FAU Erlangen-Nürnberg, Germany), "Parameter Estimation for Time-Varying Harmonic Audio Signals", (02/2014), Nils was accepted as PhD student after master thesis.
- Jeremy Hunt, DAAD (opens new window) research internship, (Rice University, USA)
- Bufei Liu, Master, Research Internship (Shanghai University, China), 2014.
- Aravindh Krishnamoorty (opens new window), Master, Internship, 2014
- Ercan Berkan, Master Thesis, (Bilkent University, Turkey), "Music Instrument Source Separation", 3/2013
- Shujie Guo, Master, Research Internship, (FAU Erlangen-Nürnberg, Germany)
- Journals: IEEE Transaction in Audio, Speech and Language Processing (opens new window), Signal Processing Letters (opens new window), EURASIP (opens new window), Journal of Open Source Software (opens new window)
- Conferences: ICASSP (opens new window), EUSIPCO (opens new window), DAFx (opens new window), ISMIR (opens new window).
- Journals: Topic Editor for ML-Audio for the Journal of Open Source Software (opens new window).
# Graduate Programs
- 2021: Lecture: Selected Topics in Deep Learning for Audio, Speech, and Music Processing (opens new window), Music Source Separation, University of Erlangen (Germany).
- 2020: Research Internship (opens new window) (Master, Stage 5), PolyTech Montpellier
- 2018, 2019: Introduction to Deep Learning, Master 2, PolyTech Montpellier
- 2016: Reproducible Audio Research Seminar (opens new window), University of Erlangen (Germany)
- 2014-2016: Multimedia Programming , Highschool Students, University of Erlangen (Germany)
- 2013-2016: Lab Course, Statistical Methods for Audio Experiments, Master Students, University of Erlangen (Germany) Course Material (opens new window).
- 2020: Invited talk at AES Symposium "AES Virtual Symposium: Applications of Machine Learning in Audio" (opens new window) titled "Current Trends in Audio Source Separation". Slides (PDF) (opens new window) Video (opens new window)
- 2019: Invited talk at a conference “Deep learning: From theory to applications” (opens new window) titled “Deep learning for music unmixing”. Video (opens new window) Slides
- 2019: Tutorial at EUSIPCO 2019 (opens new window): "Deep learning for music separation". Slides Website
- 2018: Tutorial at ISMIR 2019 (opens new window): "Music Separation with DNNs: Making It Work". Slides Website
# Other Ressources
- sigsep.io (opens new window) - Open ressources for music separation.
- awesome-scientific-python-audio (opens new window) - Curated list of python packages for scientific research in audio.
# open-unmix Winner: Pytorch Global Hackathon 2019
Open-Unmix, a deep neural network reference implementation (PyTorch (opens new window) and NNabla (opens new window)) for music source separation, applicable for researchers, audio engineers and artists. Open-Unmix provides ready-to-use models that allow users to separate pop music into four stems: vocals, drums, bass and the remaining other instruments.
Demo Separations on MUSDB18 (opens new window) Dataset:
CountNet is a deep learning model that estimates the number of concurrent speakers from single channel speech mixtures. This task is a mandatory ﬁrst step to address any realistic “cocktail-party” scenario. It has various audio-based applications such as blind source separation, speaker diarisation, and audio surveillance.
# musdb + museval
A python package to parse and process the MUSDB18 dataset (opens new window), the largest open access dataset for music source separation. The tool was originally developed for the Music Separation task as part of the Signal Separation Evaluation Campaign (SISEC) (opens new window).
musdb users can quickly iterate over multi-track music datasets. In just three lines of code a subset of the MUSDB18 is automatically downloaded and can be parsed:
import musdb mus = musdb.DB(download=True) for track in mus: train(track.audio, track.targets['vocals'].audio)
Now, given a trained model, evaluation can simply be performed using museval
import museval for track in mus: estimates = predict(track) # model outputs dict scores = museval.eval_mus_track(track, estimates) print(scores)
# Hackathon Projects
# DeMask 1st Place
Event: 2020 PyTorch Summer Hackathon – Collaborators: Manuel Pariente, Samuele Cornell, Michel Olvera, Jonas Haag
DeMask is an end-to-end model for enhancing speech while wearing face masks — offering a clear benefit during times when face masks are mandatory in many spaces and for workers who wear face masks on the job. Built with Asteroid, a PyTorch-based audio source separation toolkit, DeMask is trained to recognize distortions in speech created by the muffling from face masks and to adjust the speech to make it sound clearer.
git wig Winner
Why can't we have version on control for making music? In this hack, we merged
git with a terminal based music sequencer, calling it
git wig. We also created a suitable, diffable sequencer format to compose music. Finally, we realized
git push by bringing this feature into a hardware controller.
# DeepFandom 1st Place
Event: 2016 Music Hackday Berlin. Collaborators: Patricio-Lopez Serrano (opens new window)
DeepFandom is a deep learning model that learns the Soundcloud comments and predicts what YOUR track could get as comments and where they are positioned on the waveform.
Magiclock is an macOS application that uses haptic feedback (also called Taptic Engine™) to let you feel the MIDI clock beat from your Magic Trackpad.
# Other Software Contributions
- stempeg (opens new window) - read/write of STEMS multistream audio.
- trackswitch.js - A Versatile Web-Based Audio Player for Presenting Scientifc Results.
- webMUSHRA - MUSHRA compliant web audio API based experiment software.
- norbert - Painless Wiener filters for audio separation.
The musdb18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems. It is currently the largest, publicly available dataset used for music separation. MUSDB18 serves as a benchmark for music separation tasks.
The dataset contains a simulated cocktail party environment of [0..10] speakers, mixed with 0dB SNR from random utterances of different speakers from the LibriSpeech
All recordings are of 5s durations, and all speakers are active for the most part of the recording. For each unique recording, we provide the audio wave file (16bits, 16kHz, mono) and an annotation
json file with the same name as the recording.
A novel dataset for musical instruments where we recorded a violin cello that includes sensor recordings capturing the ﬁnger position on the ﬁngerboard which is converted into an instantaneous frequency estimate. We also included professional high-speed video camera data to capture excitations from the string at 2000 fps. All of the data is sample synchronized