Running a 2,048-point FFT can be taxing for small microcontrollers; autocorrelation can provide a good alternative for many applications.
A couple of months back, I decided to buy a mandolin. It had been quite a while since I last picked up a musical instrument and I thought: "Hey, this instrument looks small and easy to play." In hindsight (the one exact science), it's not turned out to be as easy as I expected. Somewhere down the line, I became curious to see if I could find a way to recognize which notes I was playing and then put them out in MIDI format to record them on a computer.
MIDI is a standard musical instrument format widely used in musical keyboards. It's basically an easy way to send musical notes encoded as particular numbers. A list correlating musical notes to MIDI numbers can be found here.
I finally got this running a couple of weeks ago -- you can see me playing the mandolin (badly) and GarageBand recording MIDI notes in this video (you may have to bump up the volume a bit to hear it).
What follows is a dive into the murky world of FFTs (Fast Fourier Transforms), correlation, and the general head-scratching I went through to achieve this.
Figuring out instrument pitch
The fundamental problem is to find what note I'm playing. The pitch of a musical note almost always coincides with the fundamental frequency. In plucked string instruments specifically, this is always true as the first 'pluck' (the wave bounces around from the bridge to the fret) has the highest energy and then the wave begins to die out. The make and type of the instrument do matter, but for simplicity I assumed that the frequency with the highest energy content would be the fundamental frequency and hence the pitch of the note.
"Easy enough," I thought. "The best way to figure out the energy content over a set of frequencies is to use an FFT, right?" So I got a couple of .WAV recordings of a guitar note and ran them through some FFTs. There was one little problem with this though. The range of notes I wanted to measure was from E2 (82.4Hz) and G2 (87.3Hz) to around E6 (1318Hz). Based on this, our FFT bin frequency sizes should be something less than 2.5Hz to ensure we don't miss the lower frequency notes. At 2.5Hz, my frequency bins would be 0, 2.5Hz, 5Hz,... 82.5Hz (close toE2), 85Hz, 87.5Hz (Close to F2),... etc.
Since the highest frequency is around 1,300Hz, a sampling frequency of 4,000Hz will be sufficient to recognizing all notes with the desired accuracy. The problem now moves to the FFT size required. The smallest FFT size that will satisfy both of the conditions is 2,048 (this gives a bin interval of 1.9Hz). I really wanted to run my project using a small embedded processor, but running a 2,048-point FFT is taxing even for processors like the Cortex M3/M4 with optimized libraries.
There had to be a better way. One approach is to use multiple short-length FFTs at different sampling frequencies to spread around the frequency spectrum. This can work pretty well because as the notes go higher, so does the amount of frequency separation; i.e., the difference between the lower two notes is 5Hz or so, while the separation between the higher notes is over 50Hz.
After a few trials, I ended up running five 64-point FFT "chunks," which turned out to be far more manageable.
Splitting the data to short FFTs (click here to see a larger image)
Example of FFT bin frequencies vs. the actual musical notes (click here to see a larger image)
All that remains is to find the frequency with the maximum energy content between the bins. I ran a couple of simulations in Python in which I fed in a real guitar pluck and performed the short FFTs. This worked pretty well, even after adding noise, so at this point I was feeling reasonably happy.
Input data for note A4 (440Hz) (click here to see a larger image)
Short FFT outputs for note A4 (440Hz) (click here to see a larger image)
However, it turned out that I had overlooked one little thing. Despite the fact that my method was working pretty well, there was a fundamental issue with the sampling time. In the case of detecting low frequency notes, I need to wait 128 samples at 250Hz (remember we have to discard the first 64 points as we don't need negative frequencies), which comes to little over half a second. The thing is that it's possibly to play more than two notes a second (some players can achieve eight notes a second fairly easily when playing tremolos).
The problem lies not with choosing the FFT sampling frequency or the FFT size, but with the condition that we need at least 2.5Hz of frequency resolution. The FFT resolution clamps the time taken to sample to at least 0.4 seconds sine.
So this just tossed my original plan out of the door; but don’t despair because there's still hope...