How we test smartphone speaker playback

The DXOMARK audio testing laboratory
- The anechoic box
- The listening room
Playback use cases
The testing protocol for playback
“Perceptual” and “objective” measurements
The scores

There appears to be no end to the amount of content users are consuming on their smartphones. As the displays become larger and brighter and offer a better overall visual experience, users are increasingly listening to content through the smartphone speakers instead of through headphones or bluetooth and external speakers. It’s not unusual these days to see friends gather around a phone to watch videos.

Screen sizes have become larger, feature-rich, and sharper over the years, but the audio experience hasn’t always kept up, particularly when it comes to speaker output. In response, DXOMARK has created a new audio testing protocol designed to help both manufacturers and consumers better understand the strong and weak points of smartphone speaker capabilities.

The DXOMARK audio testing laboratory

We do all our audio testing in DXOMARK’s carefully-designed audio testing laboratory, which houses an anechoic box and the listening room.

The anechoic box

An anechoic box is a space that completely absorbs sound waves. The chamber is insulated against external noise; moreover, it is lined with fiberglass wedges that cover the entire ceiling, floor, and walls to ensure that all the energy of a sound wave dissipates, thus completely eliminating echoes. Anechoic boxes are the quietest places on Earth.

In our protocol, we place a device under test in the box and seal the box. We recreate use cases by carefully arranging speakers inside the chamber, and we record the results using microphones and sensors.

Microphones inside the chamber record the output from the device’s speakers.

We arrange speakers inside the box to recreate use cases.

The listening room

We use the listening room to reproduce typical user experiences with playback. We place the device in the center, and DXOMARK audio experts evaluate different audio attributes according to different use cases recorded at varying distances and with different smartphone orientations. Further, we also use the room to compare the performance of two different smartphones under the same conditions to ensure the impartiality of the scoring system.

Audio experts evaluate different audio attributes.

We compare two different smartphones to ensure impartiality.

Playback use cases

DXOMARK has identified the most common use cases and has faithfully reproduced them in the laboratory to test the quality of audio playback on a smartphone’s speakers under different conditions, because certain aspects of the sound playback take more prominent roles in the audio experience for the user when watching a video (for example) than when playing a game or listening to a concert.

Below you can see an example of a typical gaming use case. We use this game, N.O.V.A. Legacy from Gameloft, for actual testing. For this kind of game, gunshots and other game sounds need to be correctly spatialized so players know where in the scene they are coming from, and to enhance the overall user experience. (Best to watch this clip with headphones on to hear what we mean.)

DXOMARK Audio gaming use case (game clip courtesy of Gameloft).

Going further, music shared with a group is commonly played in portrait mode, with the user browsing and selecting music from their own collections and playlists. However, portrait videos, such as those shared on social media, are often shot natively on a smartphone. As they are more likely to capture the spoken words—such as a shared experience with friends and family, users will be listening for a different set of attributes from the playback than they would when listening to music. All of these distinctions have required DXOMARK to carefully develop testing protocols to fit each use case.

Playing a game on the smartphone

Sharing music with a group

The testing protocol for playback

DXOMARK’s testing protocol for speaker playback tests for the following attributes—timbre, spatial, dynamics, volume, and artifacts. Depending on the use case, some of these attributes will carry more weight than others.

Timbre

DXOMARK replicates all the use cases in the laboratory and carefully records the results of both perceptual tests and objective measurements. For timbre, DXOMARK evaluates a device’s ability to render the correct frequency output according to the use case and users’ expectations. Generally speaking, smartphones need to improve the rendering of the bass frequencies, for example, and our testing protocol will give device manufacturers the data they need to accomplish this.

Our experienced sound engineers test the bass, midrange, and treble frequencies and the overall balance among them. A good tonal balance typically consists of an even distribution of these frequencies.

Spatial

Spatial audio is the ability to create the impression that sound exists in a three-dimensional space that nearly always exceeds the physical dimensions of the device itself—in other words, the soundscape. Being able to accurately place an instrument in an orchestra or the boom of an on-screen video explosion in the right location in the soundscape, for example, creates a richer experience for the listener, whether they are listening to music or watching a blockbuster movie.

Spatial audio has several sub-attributes such as localizability, balance, distance, and wideness. Much of this is reproduced using psychoacoustic modeling techniques, a branch of science devoted to the illusion of space in music that requires DXOMARK to rely on perceptual tests and measurements for evaluating these sub-attributes.

Poor spatial wideness

Appropriate spatial wideness

A sub-attribute of spatial, wideness is the ability of a device to create a large peripheral area from where sound is perceived to be coming. From the illustration above, the aim of smartphone makers is to create an appropriate perception wideness based on the position of the user.

Poor spatial balance

Appropriate spatial balance

The sub-attribute balance measures the equilibrium between multiple speakers on a device. This is important for a smartphone across the whole range of the use spectrum.

Poor spatial distance

Poor spatial distance

Appropriate spatial distance

Distance refers to the perceived sound and how far it travels to the user—that is, the distance the user has to be from the speaker to hear the audio properly under different use cases.

Poor spatial localizability

Appropriate spatial localizability

Localizability is the ability of a device’s speakers to create the impression that specific sounds are coming from particular locations within the overall soundscape. From the illustrations above, it’s clear that precision is important.

Dynamics

Dynamics is a measure of the attack, bass precision, and punchiness of a recording. As with spatial testing described above, all the testing that DXOMARK records for dynamics is perceptual because much of the science of audio dynamics is grounded in psychoacoustics, the human perception of sound, rather than the objectively quantifiable output of audio levels.

Poor dynamic punch

Appropriate dynamic punch

As part of dynamics, we also test the overall volume dependency. This refers to the energy of a particular sound when played back on speakers. Our engineers evaluate how the attack, punch, and bass precision changes based on the user volume level. (Punch measures how forceful and bold certain sounds are at certain frequencies, particularly the midrange. Another way of describing it is the overall energy of the sound.)

Volume

Conversely, we objectively analyze the volume and its sub-attributes by measuring the sound pressure levels (SPL) of a speaker at various volume settings to determine the maximum volume, the minimum volume, and the volume consistency. However, even with this easily-recorded metric, perception plays a significant role, so our experts have also designed a protocol to faithfully record the perception of volume.

Artifacts

As with Volume, many tests for artifacts can be objectively measured and recorded. A spectrogram easily defines and measures such typical artifacts as pops, clicks, and other aberrant sounds. Even in the digital world, user interactions with a device can sometimes cause unwanted artifacts and sounds—for example, a user can create artifacts when increasing or decreasing the volume; when pausing and pressing play, or otherwise handling the device.

Some of these artifacts are temporal (that is, they change over time) and can be characterized as sudden increases or decreases in volume (“pumping”). Pumping is most obvious in pop and electronic genres of music. Sometimes pumping effects are desired and sometimes they are not; DXOMARK is concerned with undesired pumping effects. We measure any unwanted artificial sound that is not part of the original desired playback using mixture of objective and perceptual tests.

Distortions are measured as spectral artifacts.

The aim is to minimize spectral artifacts and obtain a clean sound.

Fundamentally, spectral artifacts relate to unwanted frequencies. At certain amplitudes, sound can get distorted and generate unwanted frequencies. (This mostly occurs at higher volumes.)

Temporal artifact, fluctuations in volume

Consistent volume levels

“Perceptual” and “objective” measurements

When we talk about perceptual testing, we are talking about using the human ear and brain as the main measurement tools. Because our sound experts have worked for years in the fields of audio engineering and audio industrial design in a wide array of industries, they are able to discern different aspects of audio and sound to a level far superior to untrained listeners. Furthermore, DXOMARK has created specific protocols to ensure that any perceptual measurement is consistent over time. Therefore, the same test carried out months later on the same device would provide identical results.

Our objective tests, on the other hand, rely on sensors that record the results from testing devices such as a spectrogram or sound-level meter. Whether the tests are perceptual or objective or a combination of both, both types of tests are quantified using proprietary protocols.

All this means that our perceptual tests are no less scientific than the objective measurement tests. We carefully record the measurements and we repeat the tests several times to ensure that the results are recorded accurately and impartially.

The scores

It would be very difficult for the average user to properly evaluate the audio quality of a device’s speakers without owning the device for several weeks and having had the opportunity to put it through most, if not all, use cases. Additionally, what most of us never get the opportunity to do is to compare two separate devices side-by-side. This is the fundamental value of an impartial scoring system—users will be able to see in a quantifiable manner the differences between competing devices.

Within the testing protocol, the attributes and use cases have their own scores. Furthermore, each attribute has several sub-attributes that we also evaluate and score. We weight these scores individually and subsequently use a complex algorithm to calculate our headline score from them.

A number of attribute sub-scores feed into the Playback score which in turn is a sub-score of the DXOMARK Audio overall score.

DXOMARK’s reviews will drill down to every aspect of the device’s audio capabilities and will consider the use cases and the various attributes. This means that anyone who reads our reviews will be able to investigate the specific sub-scores that are relevant to their own use cases.

Finally, unlike some benchmarking systems, DXOMARK’s audio scoring will be an open scale. This means that the headline score (and the sub-scores) will be a number that will constantly climb as the technology and the performance improve. This open scale with allow you to objectively compare the results of current flagship smartphones with any future devices.

For more information about our Audio testing, read the article on how we test audio recording.