So, if 1 bit is fine for an electronic symphonic work, why do CDs and MP3s use 16 bits? Well, it turns out that Tristan Perich’s 1 bit music is a special case, and in general, we need more bits – but is 16 enough? How many should we really be using?
Digital audio, despite its bad rap amongst hard core vinylphiles, has the potential for perfect audio reproduction. All you have to do is use enough bits.
The bit is the atomic unit of the digital world. A bit is a binary (base two) digit, and can be either a zero (0) or a one (1). To represent numbers other than 0 or 1, multiple bits are used, the same way we humans string our decimal (base 10) digits (aka fingers) together to represent numbers other than 0-9.
The fact that we have digital audio is thanks to the work of Harry Nyquist and Claude Shannon. Nyquist is the more famous of the two, but Shannon was responsible for concepts of the information theory and entropy which enabled our modern age of computers and digital signal processing.
The concept of digital sampling a 2D signal (such as audio) is simple – at regular intervals, the amplitude of the signal is measured, or “sampled”. The frequency with which the sampling is done is known as the sampling rate. The number of bits used to store each sample is called the bit depth.
Samples are not perfectly instantaneous. They have a non-zero duration and a shape, which is called a sinc function, which is defined by the formula sin(pi*x)/(pi*x). A sample contains information about the period around the actual point of the sample. Because of this, digital audio can be converted back into analog signals that we can hear, and all sorts of digital signal processing is possible.
A plot of the sinc function is the shape of the modern world – without it, there would be no digital anything.
Audio digitized in this manner is called Pulse Code Modulation, or “uncompressed” – and is the most common way of digitizing audio for digital music production.
With the ubiquity of powerful personal computers and inexpensive (or free) audio editing tools, anyone can make music of high technical quality. Before the digital revolution, DIY musicians had to rely on analog tape which suffers from noise and generational loss.
There are certain rules to live by in the world of analog audio that in general apply to digital audio production. Keeping your levels hot, while preserving some headroom, is the best way to avoid noise.
But digital audio has the potential to achieve near-perfect representation of audio, not just mimic analog audio techniques.
The first element of PCM audio is sampling rate, which affects the frequency response of the digital audio.
Compact Discs store audio using 16 bit samples 44,100 times per second (also called “hertz,” abbreviated Hz). Most MP3s use these same specifications (though not as PCM – MP3 audio is compressed).
Nyquist’s eponymous frequency defines the maximum signal frequency that can be captured by a digital signal without aliasing. Aliasing is one frequency masquerading as a lower one. This has a variety of effects in both video and audio, resulting from image and sounds not present in the original signal. In audio, aliasing can sound “crunchy,” particularly in the high frequencies. The Nyquist frequency is one half the sampling rate. As such a CD can reproduce a theoretical maximum frequency of 22,050 Hz.
The CD sampling rate was chosen in part because the upper frequency bound of human hearing is roughly 20,000 Hz. (There were a few sampling rates considered, for different reasons, all around 44,000 Hz.) This works out okay for distribution purposes. Most music and field recordings are bandwidth limited well below 20,000 Hz (meaning they don’t contain frequencies above 20,000 Hz).
Most digital audio distribution formats have sampling rates above 40,000 Hz to be able to represent frequencies up to the theoretical limits of human hearing. However, there are two reasons why sampling rates above 40,000 Hz make sense.
First, broad-frequency signals become distorted as they approach the Nyquist frequency, even though their frequencies remain free of aliasing. A 20,050 Hz sine wave sampled at 44,100 Hz, has only two samples per cycle. Depending on the phase of the samples with the signal, the amplitude of the reconstructed signal may be affected. And the signal will tend towards a triangle wave. This effect correlates with frequency – it is lessened at lower frequencies, but is still present.
An 11,025 Hz sine wave sampled at 44,100 Hz has only four samples per cycle. A 5,512.5 Hz sine wave has only eight. These are not particularly high resolutions. A 2,756.25 Hz sine wave has sixteen, which significantly better. To achieve high resolution over the entire human frequency range would require increasing the sampling rate by a factor of 4.
Professional digital audio formats mostly use a sampling rate of 48,000 Hz. It’s an easier, rounder number to work with, so quadrupling that number results in a high resolution sampling rate of 192,000 Hz. If we only want to be able to represent frequencies up to 12,000 Hz in high resolution, we can use a sampling rate of 96,000 Hz, which has become a common professional audio sampling rate.
Most popular music is fairly frequency constrained, and does not contain frequencies high enough to notice this lack of resolution at CD sampling rates. However, it is still worthwhile to be able to represent higher frequencies, particularly for live recordings. Any real-world audio source generates not only its primary (or fundamental) frequency, but also a series of overtones at frequencies at even multiples of the primary frequency. These overtones help “color” the sound. In order to avoid aliasing, a digitizer must employ a low-pass filter to cut out all frequencies above the Nyquist frequency, which also eliminate the overtones. The effect of eliminating these overtones can be subtle, but it can make the audio sound artificial or “dead.”
Typically, as overtones increase in frequency, their strength decreases, so the ones at twice and four times the fundamental frequency are the most important. Increased sampling rates (96,000 or 192,000 Hz) help preserve these overtones.
The other element of audio resolution is bit depth, or quantization.
Bit depth determines the fineness with which the signal amplitude is digitized. An audio signal’s amplitude more or less correlates with its loudness or volume. The bit depth of the samples determines how many steps there are between the quietest and loudest possible sound that can be stored in that digital signal.
If there are too few steps, the signal gets “steppy” and sounds like a sequence of discrete notes rather than a smooth gradient. Chiptune and other videogame-inspired music intentionally uses audio that has been quantized to too few bits (often 8 or less) to achieve this sound.
The first thing that comes up when talking about bit depth is dynamic range. Bit depth only secondarily has to do with dynamic range, however. The quietest and loudest points in an audio signal are not fixed representations of particular sound pressure levels because there is generally a volume control involved in the playback system. Dynamic range in a digital audio signal is important primarily in the question of how many steps are necessary to achieve a smooth gradient over the full range of human hearing.
Just to be precise, loudness is a subjective perception of audio. It is difficult to measure, unlike objective attributes like amplitude or sound pressure. However, I will use the terms interchangeably because sound pressure and loudness roughly correlate, though there are some differences between the two.
The primary difference is that while the objectively measurable aspects of audio (sound pressure) are linear, but the perception of loudness is logarithmic. To tie this to the way humans hear loudness, a unit of measurement called a decibel (dB) is used.
A decibel is one tenth of a bel, which was defined to describe sound loss over telephone systems due to distance. But the bel is a too large to use for normal loudness purposes.
Decibels are used for a variety of measurements, and not just for audio. The dB is a relative measurement, so you always need to have a point of comparison.
For measuring human hearing, that point of comparison is typically the lower end of the hearing range. The threshold of hearing is the point below which any sound is inaudible. This point varies from person to person, but for any given person, we define this as 0 dB.
The upper end of the hearing range is the threshold of pain – you don’t really want to go higher than that even if you can. This is also subjective, but testing show that for most people, this is about 130 dB above the threshold of hearing.
(Digital audio is typically measured in negative values from the highest possible value, which is called 0 dBFS – or 0 dB Full Scale. The direction you measure from is not really important, so we will stick with the dB above the threshold of hearing for the purposes of this discussion.)
Digital audio sample values correspond to sound pressure measurements. The formula for calculating the difference in decibels between two sample values is:
dB = 20 * LOG10( a1 / a2 )
Audio signals are effectively alternating current (AC), so they have positive and negative values. The peak positive and negative values are the loudest, and the middle zero is the quietest.
Most common digital audio formats use 16 bits to store each sample. Many sources mistakenly declare that 16 bit audio has about 96 dB of dynamic range. Most of them also claim that this is close to the range of human hearing.
Because audio signals have positive and negative values, the dynamic range must be measured in absolute values, so a 16 bit signal only has 15 bits of dynamic range, about 90 dB. 90 dB is pretty loud, so this works just fine for mastered digital audio like CDs and MP3s.
However, if we want to cover a full 130 dB of dynamic range, we need 22 bits, or a 23 bit sample. This is not a particularly convenient size, so professional audio systems use 24 bit samples, which provides 138 dB of dynamic range.
The problem is that at lower levels, very few bits cover relatively large dB ranges. The bottom bit covers 6 dB. The bottom two bits cover 12 dB. The bottom three bits cover 18 dB. And so on.
A single bit audio signal (like Tristan Perich’s One-Bit Music or microcontroller-based noise toys), by the above formula, has 6 dB of range, which doesn’t make much sense – the peaks can be set to any sound pressure level on playback. But, with only a single bit of dynamic range, there can obviously be no steps between the peaks and zero.
Even an ideal digitizer will have some noise resulting from rounding errors in the bottom bit – at least when digitizing to integer values. Which means that the bottom 6 dB is composed mostly of noise.
This can become a problem during audio production. Professional audio is often peak limited, by as much as 12 dB. If the audio is normalized to remove the headroom, the 6 dB of noise becomes 18 dB of noise. We really need more code values at the bottom of the range to provide real resolution and hide the noise.
Additionally, in real-world audio recording, we want to preserve some headroom to protect against clipped peaks in the event of audio spikes. 6, 12 or even 20 dB of headroom is helpful, depending on the scenario. 6 dB of headroom at the highest levels unfortunately eliminates using half the code values possible in the file (12 dB eliminates 75%). This also pushes digitized values down towards the lower level quantization noise.
A kind of a corollary to the fact that as a signal approaches the Nyquist frequency, the signal tends towards triangle waves: as an audio signal approaches low amplitude bits or code values, the signal tends towards square waves.
Because of this, for sufficient fidelity, low dB values needs to be set farther above 0 code value. Although 1 dB is generally considered to be the just noticeable difference for human hearing, it is often useful to be able to adjust by a half a dB, so I propose we use .1 dB granularity. Therefore, if we want at least 60 values for the bottom 6 dB (though these are log, so we strictly speaking need more), we need seven bits for just the bottom 6 dB. Another 22 bits on top of that totals 29 bits needed for full range digital audio. We can round this up to a more computer convenient 32 bits.
One could argue that this is unnecessary, as most recorded audio – music especially – uses very little of this full dynamic range. But from a audio production standpoint, having the lower end of the audio range would be very helpful.
A 32 bit audio sample yields almost 190 dB of dynamic range, enough to set 0 dB usefully above the zero code value, and leaving plenty of headroom to protect against clipping during recording or mixing.
The Audio Fool argues that 32 bits is pointless overkill for audio. The Audio Fool is wrong, at least as far as audio production goes. Once mastered, audio can be re-quantized to 24 or 16 bits for distribution.
So, if we accept the idea of using 32 bits per sample, how do we store it? 16 and 24 bit audio formats are generally signed integer values. But we have other options.
The international standard IEEE 754-2008 float format is attractive. It uses 32 bits but can represent an huge range of values to which we could map the 150 dB of dynamic range.
The primary advantage of using floating point values is that the granularity of each step is very fine. An integer value can be thought of as a series of buckets. For 8 bit, there are 256 of these buckets. Sampling is the process of choosing which bucket an analogue signal fits into at each sample point. There is a range of analogue values which end up in each bucket, and this lumping is called quantization. There are different techniques for choosing which bucket to place each sample into, particularly when the analogue value is close to the bucket boundary. This lumping of values into buckets necessarily results in some steppiness. The steppiness is exaggerated towards the bucket/sample boundaries, and when we do various types of processing (filtering, EQ, normalization, etc), this quantization noise can be compounded and amplified. We try to minimize this by choosing sufficient bit depth which hides this quantization noise well below the threshold of the just noticeable difference, but concatenated quantization noise can emerge when processing using integer values.
Technically, 32 bit float values are still buckets, but there are far more of them than there are in a 32 bit integer value. The ratio is something on the order of 7.9×10^28:1. This number is so big there’s not enough of anything on Earth to physically represent it. It’s overkill, but provides all the precision we require for representing analogue signal values.
It also minimizes errors when quantizing to integer-based audio formats, such as CD or MP3. The accuracy of this quantization will be better from a floating point value than from an integer of the same number of bits.
Unfortunately, most 32 bit floating point audio representation limits values to a normalized range between 0 and 1. IEEE 754 32 bit float values only have 24 bits of precision between these values, so this method of audio representation has little advantage over 24 bit integer values.
One way to get around this gotcha but still using floating point values would be to use double precision, or 64 bit, float values, which have 53 bits of precision. This is a bit like using a sledgehammer to pound nails, but it would get the job done with plenty of code values.
64 bit audio is not yet common. Until then, 32 bit integer is probably the best choice.
All the precision in the world does not matter if is not practical. However, we’re good on this point.
32 bit samples (integer or floating point) at 192 kHz sampling rate result in 768,000 bytes per second (46 MB/min, 2.8 GB/hr). Twenty years ago, this would been impossible on most computer systems. Today, even a 24 track workstation would generate 18 MB/sec throughput, which is manageable by even low-end computer and hard drive. And mid-level systems could easily handle the data rate of hundreds of tracks.
Floating point math takes more computing power than integer math, but with processor clock speeds in excess of 3 GHz (and getting faster all the time), the processing hit is not a huge problem.
So there is no reason at this point not to do very high resolution audio at 32 bit (integer), 192 kHz sampling rate. And even 48 or 64 bit audio samples are not unrealistic for the future.