Audiobus: Use your music apps together.

What is Audiobus?Audiobus is an award-winning music app for iPhone and iPad which lets you use your other music apps together. Chain effects on your favourite synth, run the output of apps or Audio Units into an app like GarageBand or Loopy, or select a different audio interface output for each app. Route MIDI between apps — drive a synth from a MIDI sequencer, or add an arpeggiator to your MIDI keyboard — or sync with your external MIDI gear. And control your entire setup from a MIDI controller.

Download on the App Store

Audiobus is the app that makes the rest of your setup better.

Distinguishing singing sample from musical instrument tone sample

I make free music games. One that I'm currently developing and is near beta is about singing accurately. From a short sample of singing a note at a known frequency, I'd like to be able to tell whether someone is actually singing or whether they are playing an instrument instead.

It has occurred to me that perhaps the stability of the frequency might be a clue, though singers and instrumentalists might each have vibrato. There are probably other ways to address the problem. Typically I might have 1/3 second to analyze, but I could have several samples available if that helped.

Ideas?

Comments

  • edited June 7

    Why does it matter?

    The reason I ask is because there may be other solutions that don't require this maybe impossible step.

  • It matters because I want to have leaderboards based on how well people sing notes accurately. It's all too easy to do that with a musical instrument. I do that for testing, particularly when my voice is wearing out. :-)

  • The human voice will have an identifiable timbre, with a specific set of formants which will distinguish it from other types of instrument. You would probably need to do some kind of spectral analysis to identify it though.

  • Neural net is probably going to be the best path. It's what I'd try first. You are also going to have all the supporting hardware to accelerate on an iPhone or iPad.

  • @richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

    @NeonSilicon What would be a good way of learning about neural net? I don't think I've used anything like that. Are there libraries that one could use on iOS? Thanks.

  • @playalongkeys said:
    @richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

    I'm afraid that is what I would call a "research project" :)

  • @playalongkeys ok that makes sense.

    But you don't need to do it in real time. I would try with @NeonSilicon's approach but do it after the sample is recorded. If it is discovered to be a fake then you notify the user to behave themselves.

    The big problem is that you can't tell whether they have used a sample or auto-tuned their voice.

  • @playalongkeys said:
    @richardyot The formant idea is one that I wanted to check out. Do you know a good informal reference that would explain to me the difference in formant between a voice and a clarinet, violin or other instrument? Thanks.

    @NeonSilicon What would be a good way of learning about neural net? I don't think I've used anything like that. Are there libraries that one could use on iOS? Thanks.

    If you are new to using neural nets, check out these two videos as an introduction:

    The whole course is useful too and MIT has all the lectures up on YT.

    Apple calls their libraries Core ML. If you do a search on the developer site (developer.apple.com) or in their Developer app, you'll get all sorts of links to look at.

    You would probably train you models off of the deployment device and that seems to be done mostly in Python now. Apple has a developer video called Convert PyTorch models to Core ML in the Developer app that has other links in it as well. It's probably not what you want to look at to learn right now, but it could get you pointed in the right direction for the types of things you'll need to learn to use the tools.

    When thinking about using formants directly, there are going to be some complications. It's not as easy as say the human voice versus some other instrument. Each vowel has a different formant structure for example. For most people, the human voice has what is essentially three bandpass filters and they move to form the vowels. Male and female voices have different general locations for the same vowel. Variation person-to-person is also there. Different languages and dialects have different formants and different ways of using the same formant. And to make it even more complex, trained singers learn to use formants differently to the point of even adding a fourth formant in some cases.

    A neural net that is trained well, is probably going to be using all of this formant information to do the job, but you won't know what it's doing. You still need to be aware of this though so you can take care to not train in any cultural or other biases. The main thing is that you are going to have to have a really diverse training set. Singing is pretty universal.

    I haven't looked, but there are probably projects out there that have this sort of thing in them already. You might want to look if Shazam had anything in it to discern between human voices and instruments before it starts to do the song recognition. Apple bought Shazam and is rolling out the underlying tech this year as ShazamKit.

    With regard to the singing accurately part, I used to have a paper on pitch accuracy with trumpet players. I can't find the link right now. The paper was probably 15 or 20 years ago. I came across it when doing some research on micro-tonal stuff. They grouped the players into three categories, beginners, middle levels, and top pros. The beginners' pitch was all over the place. The mid-level players had really solid pitch accuracy and the distribution was small. The high level pros' pitch was almost as spread as the beginners. But, they knew where and how to be off the pitch. It's a fun thing to think about in general, but might also be a useful thing to use in training models for something like this. Something like using training sets and samples that could grade pitch placement, but also targeting and pulling of notes. Stuff like that.

  • This is going to be tough to do -- there are sampling synths that use human voice as the source tone, so it's possible to hit a key on the keyboard, and get "the real thing."

    I'd suggest looking at the spectral content of the audio, after processing through an FFT, but even here, you've got trouble. There's the root frequency f, and then the harmonics (multiples of f). Human voices have a range of harmonic ratios that are different than (for example) a guitar -- but the tone of any given voice can vary. If anything, it's the vibrato that makes human voice sound human -- but that can be emulated too.

    My expectation is that any system you come up with can be fooled. It might be more trouble than it's worth to try and block this sort of cheating.

  • If you ask the participant to sing lyrics or vowels it will be hard to fake with a substitute.

    “La-Di-Mah-toe Mister Ro-bo-to...”

    making a synth do that is harder... you just focus on the long tone pitch detection.

  • @SecretBaseDesign said:
    This is going to be tough to do -- there are sampling synths that use human voice as the source tone, so it's possible to hit a key on the keyboard, and get "the real thing."

    I'd suggest looking at the spectral content of the audio, after processing through an FFT, but even here, you've got trouble. There's the root frequency f, and then the harmonics (multiples of f). Human voices have a range of harmonic ratios that are different than (for example) a guitar -- but the tone of any given voice can vary. If anything, it's the vibrato that makes human voice sound human -- but that can be emulated too.

    My expectation is that any system you come up with can be fooled. It might be more trouble than it's worth to try and block this sort of cheating.

    Not sure if @playalongkeys would really have to go extra miles to block such sampling synth efforts - I'm sure the average user won't either 😉

    As for the spectral content, you're faced with the same problem: Finding thresholds that work for different human vs non-human voices. I agree with @NeonSilicon in that neural nets would be my choice, plus enough training to make the learned information reliable enough. There are methods to play back the supposed "sweet spot clouds" in the learned information which can help a lot in fine tuning the presentation of the input data to learn.
    I would definitely try both raw audio and amplitude spectra as learning material though, and include material with different kinds of background noises.

  • It seems to me that focusing on the amplitude envelope would be a good line of attack (pun not intended).

    I would expect the attack and decay of a human voice to be relatively difficult to reproduce with many instruments other than samplers. I don't know how true that is, but it might be something to look into.

    Another thought would be to have people submit a sample of themselves singing certain sounds as a baseline, which could then more easily be compared against. That would involve overhead of approving the initial samples, of course.

  • I appreciate all that really good feedback. The comments about the trumpet player are quite believable. That aligns with my personal experience with instrumentalists and even vocalists. Regarding autotune, I hadn't thought about that, but if someone is using autotune, I don't think they are really trying to learn to sing accurately.

    I don't really want to upload samples because of privacy concerns. There seem to be pretty strict rules about this as far as kids are concerned and I don't want a "mature" rating. But in my previous music game, players seem to be pretty earnest about redoing songs to get higher scores and I don't want to delegitimize their efforts here. If I did upload a short sample. I could at least manually knock people off the leaderboards that don't belong there.

    The comments that @wim made about amplitude are interesting. I had been considering using frequency deviations in that regard, in a way reflecting what @NeonSilicon said about the trumpet players, i.e. if there seems to be a controlled pattern to the deviations, perhaps it is just a good singer doing some vibrato.

    With respect to @richardyot , I can't really let this become a big research project, since it is a probably free app that I am creating in my unpaid free time.

    The neural net comments are interesting. In a way, my problem could be a very good test problem for neural nets, I suspect. I've used CoreML a small amount to detect shapes in an environment and I can see how it might apply here.

    It is true that I probably don't have to do this in real time. In fact, if I did end up uploading samples, the whole process could be taken offline, but I'd rather avoid that and something that could be done in realtime makes everything a lot simpler.

  • @playalongkeys If you consider making it a realtime process:
    https://arxiv.org/pdf/2010.02871.pdf

  • Recording a sample can still be kept local to a users device for analysis

Sign In or Register to comment.