The DXOMARK Speaker protocol

Reading Time: 10 min read

Wireless speakers are an immensely popular product allowing you to connect to and control output from smartphones, computers, TVs and other devices without requiring cables. Instead they use Bluetooth, Wi-Fi or proprietary protocols to transmit data to and from the connected device. Many wireless speakers (so-called smart speakers) also come with an integrated virtual assistant that offers interactive hands-free activation and often control via a “hot word” and voice commands. (Note that we have not tested voice control in this first version of our test protocol.) DXOMARK has developed its new Speaker test protocol to evaluate and compare the audio playback quality of the products on offer in the wireless speaker market segment.

Test devices

We at DXOMARK are testing and evaluating a wide range of wireless speakers from a large number of brands, from entry-level products with fairly basic specs and features to high-end models that should satisfy the audio requirements of even the most demanding audiophiles.

Our reviews cover both battery- and mains-powered speakers and models with or without voice assistant function. All speakers we test offer wireless connectivity via Apple Airplay, Google Cast, proprietary protocols, or Bluetooth, and for now, we are only testing single units. (We may also end up testing speakers that are combined to form dual/stereo units in the future, though.)

The Speaker test protocol

The DXOMARK Speaker text protocol is designed to grade the consumer experience when using the device’s playback function by evaluating relevant audio quality attributes across a range of representative use case scenarios. Like our Camera, Selfie, and Audio test protocols, it uses a two-pronged approach, combining objective measurements done in the lab under controlled conditions with perceptual evaluation methods. Our experts carry out the tests using a wide selection of audio clips, including commissioned music tracks, voice, and other multimedia content. The DXOMARK Speaker overall score is computed from all scores and measurements obtained during the testing process.

Objective testing in the lab
Perceptual testing in an outdoor setting. Testers swap smartphones to connect to both the speaker under test and to reference (comparison) devices.

Test settings

For testing at DXOMARK, speakers are connected to and controlled via a smartphone. (Depending on compatibility, we either use an Apple iPhone or Android model.) For our TV use case, we also connect the speakers to a smart TV—an Apple TV or Android TV or other smart TV if required.

Devices are connected either via Apple’s AirPlay or Google Cast. If neither of those is compatible or available, we use the speaker manufacturer’s proprietary protocol or Bluetooth as a last resort.

Objective and perceptual measurements

To test speakers, DXOMARK uses both objective and perceptual evaluation methods, both of which are quantified using proprietary protocols. Objective testing for speakers takes place in the DXOMARK Audio lab where we also do our smartphone audio evaluations. Our experts record speaker output using such specialized equipment as sound-level meters and calibrated microphones in our brand new semi-anechoic chamber, a space that absorbs sound reflections.

The chamber is insulated against external noise; moreover, it is lined with fiberglass wedges that cover the entire ceiling and walls to ensure that all the energy of a sound wave dissipates, thus completely eliminating echoes. Anechoic boxes are the quietest places on Earth.

When we talk about perceptual evaluation, we are talking about using the human ear and brain as the main measurement tools. Because our audio experts are trained in listening to specific cues that are defined in the DXOMARK audio protocols, they are able to discern even the slightest differences in terms of a speaker’s audio quality attributes.

Perceptual testing in the purpose-designed apartment

In addition, we have created specific protocols to ensure that any perceptual measurement is consistent over time. Therefore, the same test carried out months later on the same device provides identical results. Tests are also repeated several times to ensure that the results are recorded accurately and impartially. This means that our perceptual tests are no less scientific than our objective measurements.

Perceptual testing takes place in simulated real-life environments— for example, in a living room, bathroom, or kitchen that have been set up in a purpose-designed apartment.

Precision microphones in the lab setup
Control setup for the lab

Speaker use cases

People use their speakers in a wide variety of ways. To ensure our tests cover the most common scenarios, we have designed a number of speaker use cases that cover all audio quality attributes. We designed and selected these use cases after comprehensive data collection and analysis in order to find out where and how consumers currently use speakers in their homes and elsewhere.

Where consumers use speakers in 2020 (source: Voicebot.ai)

Use cases take into account a wide range of factors— for example, the position of the speaker in the room (center, in the corner, close to a wall…), the space (living room, kitchen, bathroom, outdoors…), ambient noise (silence, chatter…), the type of content played (different music genres, podcasts, movies…), the playback volume (quiet, soft, medium, loud, maximum) and the position of the listener(s) relative to the speaker (in front, to the side, etc.).

For example, for our Cooking use case we have recorded kitchen noises, such as frying in a pan or cutting with a knife, in a real environment. For the perceptual evaluation, the listener is located in a kitchen by the counter. Multiple speakers play back the pre-recorded kitchen noises. Each speaker is only playing one sound and is placed at the location that sound would come from in a real kitchen for a very immersive sound experience. The setup of this test and test conditions remain identical over time, which means results are repeatable.

DXOMARK Speaker use case definitions

Using these criteria, we have created the following use cases:

  • Bathroom (listening to music in the bathroom): The bathroom is a very challenging environment for any speaker, as tiles, glass, and mirrors tend to amplify reverberations and reflections. In this use case we mostly play bass-heavy music to make things even more difficult. Some speakers are capable of auto-adjusting to these conditions, others aren’t. This use case is designed to find out which ones adjust the best.
  • Kitchen (listening to cooking instructions in the kitchen): This use case is very much geared towards evaluating voice intelligibility, with the speaker playing podcasts and cooking instructions in a kitchen environment. The speaker is placed on a table behind the tester who is standing at the kitchen work surface. Several other speakers are placed around the kitchen playing simulated kitchen noises to make the conditions as real as possible.
  • Bedtime (listening to music or relaxing podcasts at low volume before falling asleep): For this use case, the speaker is placed on a bedside table in close proximity to the listener. Relaxing music such as classical or jazz is played, as well as relaxing podcasts that are designed to help falling asleep. All content is played at low volume.
  • Gathering (listening to music in the company of friends or family): This use case is designed to simulate a social gathering at which music is not in the focus but played in the background. The speaker is placed at the center of the table, which allows for efficient evaluation a speaker’s directivity,  and volume allows for easy conversation.
  • Party (listening to music at high volume in the living room): For this use case, music genres like electronic, hip-hop, or EDM are played at loud but not maximum volume with the speaker placed close to the wall (where many people would have it placed or mounted on a permanent basis) and in the middle of the room (where you might move it for a party). So listeners are either standing in front or around the speaker. This is one of the most important use cases because the loud volume takes hardware and software to their limits, revealing weaknesses that might not be noticeable at lower volumes.
  • Relaxing (listening to music or podcasts while sitting on the sofa): This is also an important use case, simply because it replicates how many consumers use their wireless speakers at home. We play a selection of pop music, rock, or chill-out music at moderate volume and our experts evaluate performance at several distances between the speaker and the listener (who is always facing the speaker). We also use podcasts that contain mostly voices.
  • Outdoor (listening to music in an outdoor environment): For this use case, we play music in an outdoor space, such as on a balcony or in a garden, with people gathered around the speaker. This use case differs from the others, as sound reflections and reverberations are much reduced outdoors, making for very different acoustics compared to indoor settings. We also use this use case to check if a speaker’s performance under battery power differs from when it is plugged into a power outlet.
  • Movies (watching movies on a TV with the speaker connected for sound output): In this use case, the speaker is used as a soundbar replacement, putting out the audio of the TV or of a smartphone while watching movies. For the TV scenario, the speaker is placed close to the TV, just as a soundbar would be. One important point we check in this use case is whether there is any noticeable latency that would have a negative impact on the movie experience.
Bedroom use case setup
Bathroom use case setup
Kitchen use case setup: speakers on the countertop play typical kitchen background noises, such as boiling or chopping.
Relaxing use case setup

DXOMARK custom music tracks

For both our objective testing and perceptual evaluation we use a wide variety of audio clips, including a range of music tracks which were specifically commissioned for this purpose. The tracks cover most popular music genres, including classical, hip-hop, reggaeton, electronic dance music, pop, electronic, ambient and jazz.

These genres were selected to ensure our tests are relevant to the vast majority of speaker users. Additionally, each genre has its own style in terms of sound design, loudness, instruments, timbre, and other characteristics. By using different types of music we ensure the speakers are stimulated in many different ways during testing.

All tracks were produced, mixed, and mastered by professional artists after the DXOMARK Audio team had precisely defined what exact musical elements each track had to include. The objective was to create tracks for each genre that contain specific audio cues that facilitate perceptual evaluation when played back on a test speaker. In our testing these commissioned audio tracks are predominantly used for perceptual evaluation but also for some objective measurements, for example maximum loudness.

You can listen to a few extracts of some of the tracks here.

Speaker audio quality attributes

DXOMARK’s testing protocol for speakers uses the standard attributes for perceptual audio evaluation as defined by the International Telecommunication Union (ITU-R BS.2399-0). We have selected the attributes that are most relevant for speakers: timbre, spatial, dynamics, volume, and artifacts. Depending on the use case, some of these attributes will carry more weight than others.

Audio quality attributes (Source: Report ITU R BS.2399 0 (03/2017))

Timbre

Timbre describes a speaker’s ability to render the correct frequency output according to the use case and users’ expectations, looking at bass, midrange, and treble frequencies, as well as the balance among them. Good tonal balance typically consists of an even distribution of these frequencies. In addition, we look out for resonances, notches, and extensions in each of the frequency regions.

Spatial

Spatial evaluates the speaker’s capability to properly place the audio elements in a two-dimensional sound field. Being able to accurately place an instrument within a band or orchestra or an explosion in movie in its location in the sound field makes for a better listening experience.

Spatial audio has several sub-attributes, including localizability, balance, distance, wideness, and directivity. Localizability is the ability of a speaker to create the impression that specific sounds are coming from particular locations within the overall sound field. Balance measures the equilibrium between multiple speakers on a device. Distance refers to the ability to maintain the reciprocal distance of the audio elements in the overall mix. Wideness is the ability of a device to create a large peripheral area from where sound is perceived to be coming. Directivity is the capability of the device to reproduce consistent sound in any direction.

Dynamics

Dynamics evaluates a speaker’s ability to convey punch as well as clear attack and bass precision. As part of dynamics, we also test the overall volume dependency, or in other words, how the attack, punch, and bass precision changes based on the user volume step.

Volume

Our volume tests evaluate if a speaker can produce adequate volume levels for every use case. We objectively measure the sound pressure levels (SPL) of a speaker at various volume settings to determine the maximum volume and the volume consistency.

Artifacts

During our artifacts tests we check for any sounds that could be disturbing to the listener— for example, noise or clipping. Artifacts can also be caused by user interaction with the speaker, as when for example changing the volume level, pausing, pressing play, or simply handling the speaker.

Artifacts can be temporal or spectral. Temporal artifacts change over time— for example, sudden increases or decreases in volume or “pumping.” Pumping is most obvious in pop and electronic genres of music and is sometimes desirable (and sometimes not). DXOMARK is concerned with undesired pumping effects. Spectral artifacts relate to the addition of frequencies which are not part of the input audio signal. Sound can get distorted and generate unwanted frequencies. This mostly occurs at higher volumes.

In addition we measure audio/video latency, which has a negative impact on the user experience when playing movies and other videos with sound.

The scores

During our testing we obtain perceptual and/or objective scores for all sub-attributes across all use cases. Sub-scores are weighted and aggregated into attribute scores (for example, dynamics or artifacts) and use case scores (for example, bathroom, party or outdoor) using a complex algorithm. We then compute the overall Speaker score from these sub-scores.


Videos

DXOMARK invites our readership (you) to post comments on the articles on this website. Read more about our Comment Policy.