Distinguishing singing sample from musical instrument tone sample

playalongkeys · June 2021

I make free music games. One that I'm currently developing and is near beta is about singing accurately. From a short sample of singing a note at a known frequency, I'd like to be able to tell whether someone is actually singing or whether they are playing an instrument instead.

It has occurred to me that perhaps the stability of the frequency might be a clue, though singers and instrumentalists might each have vibrato. There are probably other ways to address the problem. Typically I might have 1/3 second to analyze, but I could have several samples available if that helped.

Ideas?

Jocphone · June 2021

Why does it matter?

The reason I ask is because there may be other solutions that don't require this maybe impossible step.

playalongkeys · June 2021

It matters because I want to have leaderboards based on how well people sing notes accurately. It's all too easy to do that with a musical instrument. I do that for testing, particularly when my voice is wearing out. :-)

richardyot · June 2021

The human voice will have an identifiable timbre, with a specific set of formants which will distinguish it from other types of instrument. You would probably need to do some kind of spectral analysis to identify it though.

NeonSilicon · June 2021

Neural net is probably going to be the best path. It's what I'd try first. You are also going to have all the supporting hardware to accelerate on an iPhone or iPad.

playalongkeys · June 2021

@richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

@NeonSilicon What would be a good way of learning about neural net? I don't think I've used anything like that. Are there libraries that one could use on iOS? Thanks.

richardyot · June 2021

@playalongkeys said:
@richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

I'm afraid that is what I would call a "research project"

Jocphone · June 2021

@playalongkeys ok that makes sense.

But you don't need to do it in real time. I would try with @NeonSilicon's approach but do it after the sample is recorded. If it is discovered to be a fake then you notify the user to behave themselves.

The big problem is that you can't tell whether they have used a sample or auto-tuned their voice.

NeonSilicon · June 2021

@playalongkeys said:
@richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

@NeonSilicon What would be a good way of learning about neural net? I don't think I've used anything like that. Are there libraries that one could use on iOS? Thanks.

If you are new to using neural nets, check out these two videos as an introduction:

The whole course is useful too and MIT has all the lectures up on YT.

Apple calls their libraries Core ML. If you do a search on the developer site (developer.apple.com) or in their Developer app, you'll get all sorts of links to look at.

You would probably train you models off of the deployment device and that seems to be done mostly in Python now. Apple has a developer video called Convert PyTorch models to Core ML in the Developer app that has other links in it as well. It's probably not what you want to look at to learn right now, but it could get you pointed in the right direction for the types of things you'll need to learn to use the tools.

When thinking about using formants directly, there are going to be some complications. It's not as easy as say the human voice versus some other instrument. Each vowel has a different formant structure for example. For most people, the human voice has what is essentially three bandpass filters and they move to form the vowels. Male and female voices have different general locations for the same vowel. Variation person-to-person is also there. Different languages and dialects have different formants and different ways of using the same formant. And to make it even more complex, trained singers learn to use formants differently to the point of even adding a fourth formant in some cases.

A neural net that is trained well, is probably going to be using all of this formant information to do the job, but you won't know what it's doing. You still need to be aware of this though so you can take care to not train in any cultural or other biases. The main thing is that you are going to have to have a really diverse training set. Singing is pretty universal.

I haven't looked, but there are probably projects out there that have this sort of thing in them already. You might want to look if Shazam had anything in it to discern between human voices and instruments before it starts to do the song recognition. Apple bought Shazam and is rolling out the underlying tech this year as ShazamKit.

With regard to the singing accurately part, I used to have a paper on pitch accuracy with trumpet players. I can't find the link right now. The paper was probably 15 or 20 years ago. I came across it when doing some research on micro-tonal stuff. They grouped the players into three categories, beginners, middle levels, and top pros. The beginners' pitch was all over the place. The mid-level players had really solid pitch accuracy and the distribution was small. The high level pros' pitch was almost as spread as the beginners. But, they knew where and how to be off the pitch. It's a fun thing to think about in general, but might also be a useful thing to use in training models for something like this. Something like using training sets and samples that could grade pitch placement, but also targeting and pulling of notes. Stuff like that.

SecretBaseDesign · June 2021

This is going to be tough to do -- there are sampling synths that use human voice as the source tone, so it's possible to hit a key on the keyboard, and get "the real thing."

I'd suggest looking at the spectral content of the audio, after processing through an FFT, but even here, you've got trouble. There's the root frequency f, and then the harmonics (multiples of f). Human voices have a range of harmonic ratios that are different than (for example) a guitar -- but the tone of any given voice can vary. If anything, it's the vibrato that makes human voice sound human -- but that can be emulated too.

My expectation is that any system you come up with can be fooled. It might be more trouble than it's worth to try and block this sort of cheating.

McD · June 2021

If you ask the participant to sing lyrics or vowels it will be hard to fake with a substitute.

“La-Di-Mah-toe Mister Ro-bo-to...”

making a synth do that is harder... you just focus on the long tone pitch detection.

rs2000 · June 2021

@SecretBaseDesign said:
This is going to be tough to do -- there are sampling synths that use human voice as the source tone, so it's possible to hit a key on the keyboard, and get "the real thing."

I'd suggest looking at the spectral content of the audio, after processing through an FFT, but even here, you've got trouble. There's the root frequency f, and then the harmonics (multiples of f). Human voices have a range of harmonic ratios that are different than (for example) a guitar -- but the tone of any given voice can vary. If anything, it's the vibrato that makes human voice sound human -- but that can be emulated too.

My expectation is that any system you come up with can be fooled. It might be more trouble than it's worth to try and block this sort of cheating.

Not sure if @playalongkeys would really have to go extra miles to block such sampling synth efforts - I'm sure the average user won't either 😉

As for the spectral content, you're faced with the same problem: Finding thresholds that work for different human vs non-human voices. I agree with @NeonSilicon in that neural nets would be my choice, plus enough training to make the learned information reliable enough. There are methods to play back the supposed "sweet spot clouds" in the learned information which can help a lot in fine tuning the presentation of the input data to learn.
I would definitely try both raw audio and amplitude spectra as learning material though, and include material with different kinds of background noises.

wim · June 2021

It seems to me that focusing on the amplitude envelope would be a good line of attack (pun not intended).

I would expect the attack and decay of a human voice to be relatively difficult to reproduce with many instruments other than samplers. I don't know how true that is, but it might be something to look into.

Another thought would be to have people submit a sample of themselves singing certain sounds as a baseline, which could then more easily be compared against. That would involve overhead of approving the initial samples, of course.

playalongkeys · June 2021

I appreciate all that really good feedback. The comments about the trumpet player are quite believable. That aligns with my personal experience with instrumentalists and even vocalists. Regarding autotune, I hadn't thought about that, but if someone is using autotune, I don't think they are really trying to learn to sing accurately.

I don't really want to upload samples because of privacy concerns. There seem to be pretty strict rules about this as far as kids are concerned and I don't want a "mature" rating. But in my previous music game, players seem to be pretty earnest about redoing songs to get higher scores and I don't want to delegitimize their efforts here. If I did upload a short sample. I could at least manually knock people off the leaderboards that don't belong there.

The comments that @wim made about amplitude are interesting. I had been considering using frequency deviations in that regard, in a way reflecting what @NeonSilicon said about the trumpet players, i.e. if there seems to be a controlled pattern to the deviations, perhaps it is just a good singer doing some vibrato.

With respect to @richardyot , I can't really let this become a big research project, since it is a probably free app that I am creating in my unpaid free time.

The neural net comments are interesting. In a way, my problem could be a very good test problem for neural nets, I suspect. I've used CoreML a small amount to detect shapes in an environment and I can see how it might apply here.

It is true that I probably don't have to do this in real time. In fact, if I did end up uploading samples, the whole process could be taken offline, but I'd rather avoid that and something that could be done in realtime makes everything a lot simpler.

rs2000 · June 2021

@playalongkeys If you consider making it a realtime process:
https://arxiv.org/pdf/2010.02871.pdf

Jocphone · June 2021

Recording a sample can still be kept local to a users device for analysis

playalongkeys · June 2021

Thanks, @rs2000. Good point, @jocphone. At this point, I'm just trying to get my Speed Singer game through TestFlight beta and on to a soft launch on the iOS App Store, so I'm going to have to make some compromises on this topic, at least initially. I'm eager for people to have fun with the game and interested to see how they might react. After a covid layoff, I'm also looking for contract or full-time work and this game will be an example of my SwiftUI skills.

Jocphone · June 2021

@playalongkeys said:
Thanks, @rs2000. Good point, @jocphone. At this point, I'm just trying to get my Speed Singer game through TestFlight beta and on to a soft launch on the iOS App Store, so I'm going to have to make some compromises on this topic, at least initially. I'm eager for people to have fun with the game and interested to see how they might react. After a covid layoff, I'm also looking for contract or full-time work and this game will be an example of my SwiftUI skills.

I wish you luck with your app and job search @playalongkeys

Audiobus: Use your music apps together.

Distinguishing singing sample from musical instrument tone sample

Comments