The Browser Sound Engine Behind Touch Pianist

24 May 2015

At the beginning of May 2015, I released the fun browser experiment Touch Pianist. I received a lot of questions from fellow developers about the tech used to make it tick, so here is my attempt at explaining the meat of the sound engine I created to make it possible.

Touch Pianist is a HTML5 web browser experiment using HTML5 Canvas (optional WebGL thanks to pixi.js) and WebAudio, which provides a visualization of all-time popular classical piano pieces, and gives you a very addictive way of performing those pieces using your computer keyboard or a touch screen. It also has iOS and Android versions at the moment with a bigger and growing music library.

These mobile apps also use native WebViews and utilize HTML, Javascript and WebAudio under the hood; except for the audio in Android because even after all those years (link is to the infamous issue no. 3434), audio in Android still sucks quite a bit and Chromium WebAudio latency is still too high for the purposes of this instrument (but still pretty damn impressive; better than my expectations considering the possible layers of buffering involved in the mess that is Linux audio), so I had to rewrite an inferior sound engine in Java to get it to work on Android devices. We’ll get back to that at the end of this document.

The Important Constraints

The website was mainly aimed at casual crowds of music loving people that are not used to waiting more than a few seconds for a game-like experience to (down)load into their web browser. I needed a decent piano in the browser. However a good quality sampled piano library has an uncompressed size of a whopping few gigabytes. Disregarding the possible illegality of redistributing a commercial piano sample library product, even if you compress the samples to lossy formats in such a library, it will still amount to a few hundred megabytes.

My aim was to have a convincing sampled piano inside a browser with at most 2 to 3 megabytes; so that the site would load just like any other javascript and media heavy site.

Before I talk about the compromises that needed to be made to make this a reality, let’s talk about why a good piano sample library requires gigabytes worth of information to begin with. What is the big deal? Why can’t sampled piano creators sample the mere 88 keys of a good piano placed in a good room with good quality microphones, package it and call it a day?

Piano Construction 101

The piano is a relatively new instrument, about 300 years old. But the idea that made the piano possible goes back to hundreds of years earlier than that; the piano is an important evolutionary step within the series of instruments that preceded it. The construction continues to evolve even today, and the sound of a 200 years old piano is significantly different than the modern piano we have today.

The Italian inventor of the piano named it un cimbalo di cipresso di piano e forte (“a keyboard of cypress with soft and loud”) and the name was later abbreviated as pianoforte (soft-loud) then simply as piano. The main selling point of this new instrument was that it allowed the player to comfortably control the loudness of the notes while playing it. It’s like the inventor, Bartolomeo Cristofori, fulfilled a feature request long sought after by music performers and composers. He had to solve a previously unsolved mechanical problem to make it happen (how do you build a keyboard instrument in which when you press a key, a hammer strikes a string but immediately detaches from it to let the string ring instead of sticking to it and damping it? And do this while allowing the hammer to be actuated in quick succession at the same time?).

The predecessors of the piano had a problem with loudness. The harpsicord was quite loud, but the player had no control over the loudness of individual pitches. The clavichord, in contrast, allowed you to control the dynamics of the sound produced but the compromises that were made to make this possible meant that the instrument was too quiet for bigger performances.

The pianoforte’s sound dynamics control was its main selling point. And the evolution of the modern piano saw many of its advancements around that particular feature.

What this means for us is that for our purposes, we should consider each note of a piano as a single instrument. So a piano can be considered as an instrument which contains 88 sub-instruments, one at each key, because the difference of the timbre of a single note when it’s played soft and when it’s played loud is quite different. It’s not just the loudness that changes, the timbre of the whole sound (especially the attack portion) is different. This means, one can’t simply get away by recording a note of a piano at a single loudness level, and by letting the software adjust its volume based on the velocity of the input. It sounds awkward. When you play back a loud piano note at a quieter level, it doesn’t sound like a softly performed piano note, it sounds like a loud piano note played back in a lower volume. Just like how you can’t pass a human scream sound as whispering merely by decreasing its playback volume. This simple volume adjustment method for piano sounds might be passable for some cases (might be the case for a web site), but it isn’t enough if you want to make a professional library aimed for recording musicians.

So companies that produce piano sample libraries record each key of a piano at different loudness levels, and play the appropriate samples for each key based on the velocity input during performance. A single key of a piano can have up to 127 different samples for different loudness levels but in practice most you’ll see is 16 or 32 or 64 for higher end piano sample libraries.

There are lots of other details too; some libraries record multiple samples for a single loudness level and alternate between them to provide a more natural sound (so, say, if you press the middle C twice one after another at a loudness level of 64, the sampler won’t play the same sample the second time it is pressed, this prevents the “machine gun” effect at fast passages). The professional libraries also provide separate samples for the mechanical noises of pianos to be mixed into or out of the final output sound as desired. Combine all of those and you have a few gigabytes worth of sound samples.

The Touch Pianist website is not aimed for recording musicians obviously, but I still wanted a somewhat convincing and more importantly entertaining piano sound so that the pieces using it would sound reasonably good. But I wanted one with a very small size so the download would be very fast and the bandwidth costs wouldn’t bankrupt me if it went viral (which it did).

So I did the math. If I limit myself to one 3-4 seconds long sample per key (I decided against using pitch shifting to reuse single samples for multiple pitches), and use mp3 and / or ogg compression, I figured I can hit my 2-3 megabyte target. Using a single sample per key however, meant that I wouldn’t be doing velocity layering. I could do simple velocity layering by using two samples per key (one for loud and one for soft, doubling the package size), but I wanted this thing to work on mobile, and WebAudio uncompresses the files as raw audio into memory so this wouldn’t work well in the memory constrained environment of mobile devices (my iPad2 wouldn’t handle 88x2 three second mono samples in memory for instance, I tried and received stern memory warnings).

I needed a way to do somewhat convincing velocity layering; a way to change the timbre of soft sounds compared to loud sounds without relying on changing the volume of samples alone. It needed to be on budget in terms of download size, and memory requirements when they’d be eventually uncompressed.

Hearing is Believing

Here is an example from the three different ranges of a piano. The notes are C1, C3 and C5. For each note, first the key is pressed very forcefully, then very softly.

WARNING: Can be loud, headphone users.

If you pay attention, the difference is not only the loudness. The timbre of a pitch that is played softly and loudly are also quite different. Especially at the very beginning, the attack portion. For loud sounds, the attack of the note has a lot of high frequency content whereas in softer sounds, those high frequency vibrations are damped. This qualitative difference in different dynamic levels comes from the very physical construction of the piano itself. This is a huge part of what makes a piano, well a piano.

If the difference is not that clear (soft sounds are hard to hear), here is the same example but this time the softer sounds are volume matched to the louder ones.

The sounds with softer dynamics almost have a whisper-like quality. You can’t get a scream and pass it on as a whisper merely by decreasing its volume. So for a decent piano, you can’t simply get away with adjusting the loudness of sound based on input velocity, you need to alter the frequency content too. But how do we do it in realtime? And do it in a web browser?

Cue Lowpass Filtering

The most obvious way to kind-of simulate the meat of what is happening above is to use a lowpass filter on the sound. Lowpass filter is a kind of filter that cuts the frequencies in a signal above a determined cutoff frequency. So if I lowpass a piece of sound at 500Hz for instance, the filter will dampen the frequency content of the sound above 500Hz and will allow lower frequencies to pass.

With a lowpass filter, you can control the cutoff frequency and this is what we will do. We want to use a single loud piano sample, and lowpass filter it during playback if a softer sound is requested. Removing frequencies is a lot easier than adding them, that’s why we are starting with a loud sample, so we can carve out the already existing high frequency content from it when desired.

The WebAudio implementation in all major and recent web browsers include a built-in lowpass filter using the AudioContext.createBiquadFilter() interface (and also highpass, bandpass and many others). Even if they didn’t include it, we could build one ourselves with ScriptProcessorNode in javascript (a decent lowpass filter is a few lines of code once you have the filter algorithm), but it would be a lot less efficient because the native WebAudio nodes are implemented in native code inside the browser.

Here is some WebAudio code for implementing realtime lowpass filtering on audio nodes:

//grabbing the right AudioContext object for the browser
var AudioContext = window.AudioContext || window.webkitAudioContext;

/*
actually creating the audio context. you have a global limit (even across tabs) on the number of AudioContexts in WebAudio spec and implementation for now, so you need to be careful. create one only when absolutely necessary. Always share a global instance.
*/
ctx = ctx || new AudioContext();

//WebAudio does not currently have a native white noise generator so let's create one ourselves.
noise = ctx.createScriptProcessor(4096, 1, 1);

noise.onaudioprocess = function(e) {
	var output = e.outputBuffer.getChannelData(0);
	for (var i = 0; i < output.length; i++) {
		//output values should be between -1 and 1
		output[i] = (Math.random() * 2) - 1;
	}
}

//creating the lowpass filter
cutoffFreq = 500;
lowpassNode = ctx.createBiquadFilter();
lowpassNode.type = "lowpass";
lowpassNode.frequency.value = cutoffFreq;

//connecting nodes together
noise.connect(lowpassNode);
lowpassNode.connect(ctx.destination); //at this point, we have sound on speakers.

//..
//when you want to alter the cutoff frequency
newFreq = 1000;
lowpassNode.frequency.value = newFreq;

//..
//when you want to end it all
lowpassNode.disconnect();
noise = null;
lowpassNode = null; //GC will handle the rest

In action: (WARNING: Might be loud.)

Cutoff:500Hz

In this example, we applied the lowpass filtering on a white noise signal to demonstrate the effect. Since WebAudio has no native white noise implementation right now, we created a script processor and implemented white noise ourselves.

I provided options for testing out some of the different filter types. Highpass is the exact opposite of lowpass, the frequencies above the cutoff frequency is passed to output with it. Bandpass filter passes frequencies around the cutoff frequency and nothing else. The full list of the available filter types are listed in this MDN article.

To apply realtime lowpass to audio samples, instead of the script processor above, you’ll have an AudioBufferSourceNode that reads its data from an AudioBuffer, like the following:

//soundFile holds your mp3 file

ctx.decodeAudioData(soundFile, function(buffer) {
	decodedBuffer = buffer;
});


//when decodedBuffer is ready...
sourceNode = ctx.createBufferSource();
sourceNode.buffer = decodedBuffer;

lowpassNode = ctx.createBiquadFilter();
lowpassNode.type = "lowpass";
/*
cutoff should be derived from the velocity / loudness we want for this particular note. Louder the note, higher the cutoff frequency since we want more high frequency content in louder notes, and we want softer notes kind of muffled which means lower cutoff frequencies.
*/
lowpassNode.frequency.value = cutoffFreq; 

//gain node for loudness adjustment
gainNode = ctx.createGainNode(); //should be createGain() for older browsers

/*
gain should be directly proportional to how loud you want the sound to be.
lowpass filtering already decreases the gain of your sound source (since it's a subtractive process) so use this for additional tuning of loudness.
*/
gainNode.gain.value = targetGain;

//lets make the connections. source sound -> lowpass filter -> gain node -> speakers
sourceNode.connect(lowpassNode);
lowpassNode.connect(gainNode);
gainNode.connect(ctx.destination);

sourceNode.start(0); //noteOn(0) for older browsers.

The cutoff frequency of the lowpass filter needs to be tuned by ear for each sample source. You want lower cutoff frequencies for quieter sounds, and higher cutoff frequencies for louder sounds. But the velocity and pitch to cutoff frequency mapping needs manual tuning for each case. You need to try a bunch of values and tweak until it sounds good for all pitches and velocity ranges.

To try and see how it works in my case, go to Touch Pianist site, choose a piece that doesn’t have a lot of variation in note velocity (I suggest Prelude 1 in C Major which can be found in Bach Pack 1) and instead of using the keyboard, try the range of sound qualities you get when you click / tap on the screen with your mouse. The lower you click on screen, the quieter the note will be. The cutoff frequency of the filter will be lower, and the value of the gain node will also be lower if you click on the bottom parts of the screen. If you click high, the cutoff frequency will be high, more high frequency content will pass on to the speakers.

Alternatives to Sampling

There is an arguably better way of creating physical instrument sounds, something other than recording and preserving each and every sound an instrument might make. The technique is called physical modelling synthesis. PMS (heh) is an entirely procedural approach to creating instrument sounds: You figure out the mathematical formulas that govern the sound producing characteristics of a vibrating body, and run it instead. For a piano’s case, you feed the system your key press, and the system runs the formulas to create the vibrations that will make the sound happen.

Unfortunately for me, such an approach requires a Ph.D. in applied mathematics and years of research in the area. Also such a system requires quite a bit of processing power.

Still, if you are interested, I know at least one company that creates a PMS piano, and in my opinion it works extremely well. It is Pianoteq by Modartt (I’m not affiliated with them, I just love their work). They have audio examples.

The PMS approach also allows you to tweak the physical properties of the instrument (from a single code base) and even lets you create instruments with plausible physics that can’t possibly be constructed in real life.

Practical Problems With Relying Solely on WebAudio

The initial plan I had involved having multiple fallbacks in case the browser didn’t support WebAudio (e.g. Internet Explorer). First by going with the good old <audio> tags and if all else fails using Flash.

WebAudio Support

After implementing the whole WebAudio version, I burned out and decided to ship Touch Pianist as fast as possible. In retrospect, I’m glad I did it this way because vast majority of my visitors (of more than a million people in 3 weeks so far) had the support in their browsers. I suppose the kinds of people that would be interested in this sort of thing, or at a computer located in a place where looking at music playing websites is appropriate, already use recent versions of Chrome or Firefox or Safari. Although caniuse.com says WebAudio should be available in some form at ~66% of the browsers out there, my analytics says that about 93% of my visitors had a browser that supported it. As it turns out, it would be a waste of time for me to spend the time on implementing inferior fallbacks (<audio> tag wouldn’t have filtering and Flash would have increased latency and worse performance).

Sample Format

The reading of samples rely on the AudioContext.decodeAudioData() method in WebAudio API. Almost all browsers support decoding of mp3 data this way. The only exception I found for my case was Firefox on Windows. Firefox on OS X had no problem decoding mp3 data, but I just couldn’t get Firefox on Windows to decode it. So for that browser alone, I had to put separate ogg audio assets on site, and they are handled by Windows Firefox without issues.

Puzzling Garbage Collector Behaviour of OS X Firefox

One of the more frustrating issues I encountered during development was the GC behavior of this one browser: Firefox on OS X.

The AudioBufferSourceNode interface in WebAudio is designed in a way that you just fire it and forget about it. When you call .start() (or noteOn() in older WebAudio implementations, you have to support both), unless you want to do anything else with the node (stop it prematurely, for instance) you can just forget about it (remove all references to it from your code) and the GC is supposed to handle its release from the audio graph before accumulated live nodes become an issue. You are deliberately forbidden from reusing them (calling start() on them again, for instance) so you just leave it alone after playing it and GC handles it. When you want to play the same sample again, you create a new, lightweight by design AudioBufferSourceNode and play this new instance instead of reusing the old one.

This works swimmingly in all browsers out there except for one: Firefox on OS X. For these nodes, the GC simply does not kick in until thousands of nodes accumulate during a playback of a piece and only when your computer is brought to its knees, the browser releases hundreds, sometimes thousands of nodes at once.

Playing a piece starts just fine, but these nodes (one for each pitch with their connected gain and filter nodes) pile up, CPU usage monotonically increases and framerate starts to drop, sounds start to glitch, the computer fans start whirring at max rpm, then you wait a few seconds (ranging from 10 to 60) and only then OS X Firefox decides enough is enough and releases everything. After that, life is beautiful again, but your Beethoven performance was ruined.

Firefox on Windows on the other hand, works beautifully. It does what you expect it to do. Creates the nodes when you ask it, and releases them before inactive nodes start being a burden. I have no idea why this happens or why the difference is there for these two platforms.

I just couldn’t find a workaround for this, so my apologies to Firefox users using a Mac.

iOS Issues

I also made iOS and then Android versions of Touch Pianist.

With iOS, WebAudio support and efficiency is amazing. Really. There is almost no latency, support is great, there are just no issues with it. I use WebAudio inside a WebView and the embedded webkit engine handles it amazingly.

…except for one thing I haven’t been able to figure out: A very little percentage of my users reported that they get no sound at all. From the small sample I have, this mostly happens on some iPads and more rarely on iPhone 6 devices running the latest version of iOS. What is more confusing is that the issue is resolved by a complete OS reboot on some, but not all devices.

I haven’t been able to reproduce the issue on my devices yet and still have no solutions. I found a part of Apple documentation that says sounds in WebAudio must first be initiated following a user action (making the first sound in response to a user tap) but I don’t know if this holds true for embedded webviews. I’ll include such an action in the next version to see if it will help or not. In any case this doesn’t explain the fact that on some devices the problem is fixed with an OS reboot.

Still, this affects, from what I can tell, a small percentage of users (I might be wrong though), and I’m sure the reason will reveal itself soon. If you have any ideas, please let me know.

Android Issues

Audio in the fragmented Android world continues to be a bag of hurt. The audio latency in embedded Chromium is not acceptable, and renders the instrument unplayable. There is a significant latency between the time you tap the screen and the time sound is heard on speakers. This makes keeping the correct rhythm practically impossible.

For that, I had to disable WebAudio inside the webview completely and had to implement the sampler using SoundPool in Java code running in a separate thread. The implementation of SoundPool has different bugs and limitations in different devices. On some it works extremely well, and in others it just crashes and burns. Each has different polyphony limitations. And there is no efficient way to responsibly apply fade out to running sounds without cutting them abruptly other than running a Timer thread and setting the volume of different sound streams a few times each second which is very inefficient.

At least, in the devices that support it properly, the latency is adequate.

Surprisingly, Firefox Fennec engine on Android supports WebAudio a lot better than Chrome in terms of latency. Really, the latency is almost on par with webkit on iOS. I don’t know how they managed to do that but the latency is great. Unfortunately, the general graphics / javascript performance is a lot worse so the engine is unusable for my purposes.

Closing Words

Over the last few months, I got the opportunity to get quite intimate with the WebAudio API. Flexible realtime audio inside web browsers had been my dream for a long time before WebAudio was commonplace. Two of my earlier and popular projects, Otomata and Circuli used Flash after Adobe brought realtime audio processing into web browsers starting with Flash Player v10. To port them to iOS though, I had to rewrite the sound engines from scratch in native code. This is the first time I’m sharing realtime sound processing code between web browser and mobile versions of apps, and my feeling is that WebAudio is almost ready to be used in this fashion.

The support is there; if you are planning to do audio experiments in web browsers, don’t be afraid to use WebAudio alone, as most (more than 90%) of your visitors will handle it just fine. Porting to mobile devices still provides some challenges but they aren’t the type of problems that are here to stay. On iOS, I had just one glitch (but a damning one, no sound in some devices) and Android is pushing towards low latency audio for some time already so I’m confident WebAudio will be a workable solution for all mobile devices you care about, in the near future.

Earslap