Audio testing insights: DXOMARK’s audio clips

Last October, along with the launch of our brand new audio quality benchmark, we published a couple of papers outlining how DXOMARK tests audio playback and recording in smartphones. In this article, we’ll take a deeper look at our rigorous protocol by exploring the audio clips we use for evaluating the recording quality of each tested device.

Let’s start with an important question: why are audio clips necessary? Why not simply film a concert, record a meeting, call a friend?

Real-life situations are by definition non-reproducible.

While all these situations are indeed representative of typical smartphone use cases, they do not qualify as valid testing environments. For every device to be evaluated in an unbiased and impartial manner, it is essential to expose each smartphone to strictly identical sound scenes within perfectly-controlled environments, allowing our Audio team to obtain robust and consistent test results.

To respond to our protocol’s highly specific needs, we decided to create our own proprietary audio clips. In addition to the Outdoor and Electronic Concert environments, our team conceived three sound scenes — Urban, Office, and Home — each of them designed to evaluate precise attributes in particular scenarios.

The Office audio clip is a mix of various sound elements found in a generic open space.

The Office audio clip, with its small open space chatter, its mouse and keyboard clicks, and its people walking by, is dedicated to the Meeting scenario. It focuses on signal-to-noise ratio and sound envelope, background artifacts, objective and perceptual loudness, as well as numerous timbre, spatial, and artifacts attributes. We can use each audio clip in several scenarios: for instance, we use the Urban recording for testing both Life Video (videos filmed with the rear camera) and Selfie Video recording quality.

Each clip is a mix of background sounds, recorded with a HEAD acoustics eight-microphone array in various Parisian locations (the banks of the Seine, rue de Rivoli’s intense traffic, rue de la Huchette’s touristic effervescence and musicians, rue Montorgueil’s myriad shops and restaurants, and obviously, DXOMARK offices), and vocal clips recorded with a measurement microphone in the anechoic chamber of the famous IRCAM institute.

Recording in IRCAM’s anechoic chamber

Vocals are a crucial element for evaluating speech intelligibility, so our team thought through them carefully, coming up with four different timbres (two male, two female) with different accents interweaving at various angles and amplitudes, plus two additional troublemaking voices used exclusively for interference purposes. As for the words, they come from the Harvard sentences, a set of 22 lists comprised of 10 phonetically balanced sentences, in which each one of English’s 44 phonemes appears at the same frequency as they do in the language of Shakespeare.

Now that we’ve discussed how these audio clips were created, let’s dive into the heart of the matter and see how we use them for evaluating smartphone recording quality. Follow us into DXOMARK’s offices: a modern building, a vast hall, an elevator, a corridor, an entrance door, another corridor, another door, a lab, an acoustic door — and here we are, in the listening room.

Tested device placed in the center of the listening room

Why, come on in, don’t be shy! Here’s where we play the audio clips for the device’s microphones in a very precise and reproducible manner: once we have meticulously placed the smartphone at the center of the auditorium, we play back the background at 360° through 8 calibrated speakers, while synchronously playing back vocals through 8 other dedicated speakers.

We then compare the recorded file to other recordings of other smartphones conducted under the exact same conditions. For you to hear the difference, we prepared an extract of the same audio clip recorded with three smartphones: our current top scorer for recording, a second one that’s somewhere in the middle of the pack, and a third one that’s among the least capable phones we have tested so far for audio recording.

Can you recognize them, based on the audio files included in the reviews? On that subject, you may wonder what to listen for in those abstracts, where you may distinguish the first sentence from the 20th Harvard Sentences list (“The fruit of a fig tree is apple-shaped!”):

In this Home clip abstract, voices should remain clear and natural — not sound canned or nasal. The vocal intensity changes shouldn’t induce sudden drops of volume (caused by temporal artifacts such as overcompression), and sources should be precisely localizable. Finally, the background shouldn’t overpower the voices, thus leaving speech perfectly intelligible. Lots of elements to listen for in only a few seconds!

Feel free to guess in the comments section which smartphones were used in the comparison clip — and to tell us which other specific parts of our audio protocol you’d like explained.