Flac Lack, Alas...

For some years, now, the BBC has made its radio output available in the form of 320 kbps aac streams. People can listen to them either as live streams or ‘on demand’. There is no doubt that the audio quality of the Radio 3 stream is very high. But as a ‘lossy’ format, aac risks the possibility that some audible detail might be lost, or small alterations to the sound will occur. So there has also been a nagging concern that a ‘lossless’ form of stream would be preferred. And, indeed, various non-BBC internet audio streams have adopted lossless formats - in particular the use of FLAC (Free Lossless Audio Codec). FLAC has also become the de facto standard for high quality file downloads. Given this, the question many of us have wondered about has been, “Should the BBC also adopt FLAC as their best-quality streaming format?” Doing so would quell any remaining anxiety that their output is being even slightly degraded by the encoding process.

Enjoyment of the 2017 series of BBC Prom concerts from Radio 3 was given a real boost by the decision to stream them in FLAC format as a test. The resulting ‘Concert Sound’ was a real hit with me, and many other audio enthusiasts. The closest we could get to hearing exactly what Radio 3 producers and engineers were producing from the Hall itself. Applause all around for the BBC. I’m sure I wasn’t alone in anticipating that the BBC would then press the ‘go’ button and make FLAC streaming their new gold standard for Radio 3 – ideally in time for the 2018 Proms. Alas, this simply didn’t happen, so I’ve been trying to find out why, and what we can hope for in the future...

Having made some inquiries, the overall situation as of November 2018 is that any decision on adopting FLAC is still in a box labelled, ‘no decision has yet been taken’. Although I do wonder if this really means some people within the BBC are keenly arguing for it, whilst others feel that making the required changes isn’t really justified for various practical reasons. I also found and read an AES Conference Paper, A subjective evaluation of high bitrate coding of music by Grivcova, Pike, and Nixon who are all working in the BBC R&D Department. (144th Convention, Milan May 2018), which I suspect may have now be treated by some at the BBC as providing some evidence for them against the adoption of FLAC.

Note that this paper has been made ‘open access’ so any who is interested can download a copy via http://www.aes.org/e-lib/browse.cfm?elib=19397 to read if they fancy. I contacted the authors and sent some some of my questions and comments on this issue to them. They then kindly replied in some detail. I accepted many of the points they made, and agree with a lot of what is in their paper. But I still have some concerns which I will go on to outline here. If you are interested in this topic I strongly recommend you read their paper and make sure you get their side of the story as well. That said, I’ll outline what I think are the key points below in order to clarify my concerns. For brevity I’ll refer to the authors and their work as ‘GPN’ to save having to write longer sentences...

The test method used in the GPN paper is one which professional audio engineers will be familiar with. It consists of a series of ‘RAB’ comparisons. The same source material (piece of music) was available via a three-way switch, and the listener could switch back and forth between the three switch positions as they fancy while the music plays. Here ‘R’ was the ‘reference’ version, which was duplicated as either ‘A’ or ‘B’, The challenge was: could the listener identify which of the choices – ‘A’ or ‘B’ – was the same as ‘R’? The test system randomised which of ‘A’ or ‘B’ was arranged to be identical to ‘R’ during each listening comparison. The other version was different. So, for example, ‘R’ and ‘B’ might be the FLAC version whilst ‘A’ was the aac version during one run, but not during another. The aim here is that listener’s only clue as to which version is ‘R’ will be what they hear.

A number of such test runs allowed GPN to do the statistics and discover if the listeners showed a clear sign of being able to ‘spot the FLAC’. The conclusion was that there was no reliable sign of this being the case. i.e. That those participating in the test were unable to reliably hear any difference between the FLAC and 320k aac versions. The implication then may be drawn by some that there simply isn’t any need for the BBC to adopt FLAC as it won’t provide an audible improvement over aac. But the reality is more nuanced...

As someone used to designing and using measurement systems for other kinds of experiment I know that a key challenge is to ensure that both the experimental design and the analysis of the results have to be appropriate for discovering what you wish to know, and for any decisions based on the results. The question then hinges on what is actually ‘appropriate’ for a given task. That is the main issue I want to consider in more detail here.

The GPN tests were carried out in a situation arranged to conform with the ITU-R BS.1116 standards. These specify many of the details - for example, the acoustic environment where listeners are placed. In essence, these details attempt to ensure a carefully controlled and well known listening environment. One which experienced radio broadcasters will be familiar with and are similar to what they may prefer when balancing or assessing radio output. This gives them a situation where they can feel able to assess sound quality with some confidence based on practical experience. It also serves as a basis for any results to be ‘reproducable’.

One vital aspect of any experimental or observational science is that the experiment should be ‘reproducable’ by others who may wish to check what has been reported. So it is, in terms of academic science, very useful for listening tests to be done in accord with a standard like BS.1116 because any other researchers know how to set up a similar test to see if they can, indeed, confirm reported results. This gives everyone involved confidence in the reported results because they can choose to check them rather than take for granted what they have been told.

For me the key potential weakness of the above is that the definition of ‘appropriate’ isn’t bound to be the same as a specific choice of what is either ‘familiar’ or ‘preferred’, or even ‘reproducable’. If everyone does their experiments in the same way, this opens up the risk that doing them in a different way might reveal different outcomes... which might then also prove to be reproducable, etc. The choice of a specific standard like BS.1116 might be an example of behaviour traditionally reported as a story about the fabled character, Nasrudin.

Nasrudin was walking home one day, and saw a man on the ground under a street-lamp. He asked the man what he was doing and was told he was looking for his key which he had dropped. Nasrudin spent some time looking with him, but no sign of the key. So he asked the man what he was doing when he lost the key. “I was over there, about to open my front door”, the man said, pointing to the dark side of the street. “So why aren’t you looking over there?” Nasrudin asked. “It’s dark over there, but here I can see”, the man replied!

Perhaps a closer analogy is a situation that actually arose many decades ago in astronomy. At the time almost all the telescopes used by astronomers only had a light collecting aperture of a few inches. And this was before the invention of photography. Astronomers could see some faint light areas of the sky which they named ‘nebulae’. But there was an argument about if these were all patches of gas or dust, lit by starlight, or might be clusters of many, far distant, stars too small and faint and numerous to resolve with the telescopes of the day. They simply lacked the experimental ability to answer this question. Many astronomers took the view that the nebulae had to be gas or dust because they were certain that the Universe was too small for them to possibly be clouds of distant stars. (In those days the ‘Universe’ was assumed to mean the same as ‘Galaxy’. No-one realised there were many other galaxies far beyond our own.)

At the time, when astronomers compared their results to determine if they were ‘reproducable’ they were all using similar telescopes and looking at visible light with their human eyes. Later on, larger telescopes and better observational methods arrived and we realised that the idea that the nebulae were all gas or dust was due to a misleading but ‘reproducable’ result affected by the choice of the experimental system the astronomers had adopted. Which leads me back to wondering if the choice of how to run the listening tests may have influenced the results, in a way that may be reproducable, statistically valid... but misleading!

Two particular aspects of the GPN tests (and other similar tests) are worth considering here. One is the choice of music and listening environment. The other is the duration of the items of music used and the way they were compared. These factors, along with others, have come to be regarded as appropriate and compatible with BS.1116, with a good track record of aiding in previous tests where, for example, someone might have compared low or modest rate mp3 with an uncompressed LPCM original, often in tests that use a wide range of types of music, or focus on pop/rock/jazz more specifically.

The durations for test ‘clips’ from material chosen for GPN are reported to have been in the range from 10 to 25 second duration per clip. This facilitates being able to quickly compare different versions of a clip. In effect, using short term memory and avoiding the risk that either long term memory or changes in hearing or environment over a longer period lead to false recollections of previous sound quality. However this raises a point along the following lines:

The ‘short clip’ approach may well be ideally suited to, say, pop music and the defects of lossy compression at modest bitrates, or using relatively poor codecs that materially affect aspects like the timing of transients. But codecs using a higher bitrate and more modern methods may essentially cure these ‘short term’ problems whilst leaving some longer term effects that only become clear when the listener can compare much longer durations, or give themself time to listen more to the detailed sound quality or timbre. Quite subtle alterations in the sound quality of combinations of many acoustic instruments – i.e. the sound of a full orchestra – might show up more clearly in a long uninterrupted listening session than in a short ‘clip’. The risk is that everyone uses the same ‘kind of telescope’, so agree, but would have discovered some differences using a ‘bigger telescope’.

A practical problem here, of course, is ‘listener fatigue’. i.e. tests of this kind tend to be quite long and demanding anyway. Using much longer duration test examples might make it much harder to reach any reliable conclusions. Thus cease to be defensible on a scientific basis. But this again leaves open the possibility that someone simply listening in the long term for pleasure would get an audibly different experience given that they may be listening to an entire concert rather than a short ‘clip’.

In a similar way, most home listeners, even Hi-Fi enthusiasts who go to great lengths to set up good domestic audio systems, won’t be in a BS.1116 compliant environment. Their preferred / actual situation may emphasize a particular part of the audible spectrum, etc. That may mean that alterations caused by a lossy codec could, in some cases, be made more audible than if they were sitting in a BS.1116 compliant situation. (A possible parallel here is that my own experience that I find using a ‘peaky’ sounding set of headphones that emphasises the presence region makes it easier for me to hear and locate clicks in old LP transfers than if I use a more neutral listening arrangement.) So, again, although BS.1116 is useful as a standard for scientific research, it may be the ‘wrong kind of telescope’ for at least some domestic listeners.

All the above said, I would certainly say that the 2018 Proms, streamed as 320k aac by the BBC provides a superb sound quality. Overall, to my ears the sound is pretty convincing as a stereo representation of what it sounds like in the RAH. And as a result, I don’t get any feeling of ‘missing something’. I’ve not spent any time crying into my beer about the lack of FLAC this year as I’ve been too busy enjoying what we have heard.

However it’s also fair to say that the state of mind of the listener does tend to play a role in the extent to how much a person can sit back, relax, and enjoy the music. The above points seem to me to mean that it remains quite possible that FLAC would, at least for some listeners, deliver better sound quality than the admittedly-excellent 320k aac. In addition, I suspect that many listeners would also gain increased enjoyment from the use of FLAC because it would give them more confidence that they have not ‘missed details’ that the 320k aac might have lost without them noticing. The argument here is akin to the use of JPEG for lossy compression of images. The trick with JPEG is to remove details the viewer won’t notice are absent when they lack an original to compare with the JPEG. i.e. it assumes the end user simply ‘won’t miss’ what they never had a chance to see, and thus notice has been removed.

But the slight nagging worry for listeners – all too common for Hi-Fi enthusiasts – is that they may have missed audible details without knowing it. Lossless FLAC removes that concern, whereas any lossy method - no matter how good – cannot. So given the above comments about my not being certain the test method used by GPN is always the ‘right kind of telescope’ it would seem preferable to use FLAC unless there is are good reasons not to do so. FLAC, of course, also has the advantage of being open, free, and well-understood. It is also used for streaming, etc, widely outwith the the BBC, so has established itself for such purposes.

I contacted the authors of the paper and raised various points with them, including the ones I have explained above. Some of the points I asked them about were due to my initially misunderstanding the details of the GPN tests, etc. Having had those errors on my part cleared up I’ve not mentioned them above. For the sake of fairness, though, I’ll quote below some of their reply so you can see their views on the remaining relevant issues I have described.

“We stand by the results of our experiment. Given its limited scope (one codec, short clips of radio 3 content), we believe that the experiment design and data analysis techniques employed were suitable. If there was an observable effect under these conditions, we believe that our experiment would likely have detected it.”

“We believe that it is reasonable to extrapolate from short-term listening results such as ours to long-term perception of quality. Short test items are commonly used in listening tests because it makes them more sensitive and repeatable. Human memory for audio quality is extremely short, so most listeners can only detect subtle impairments when listening to very short segments of content. Using longer clips would introduce more variance into the results, decrease listener sensitivity to small impairments, and therefore reduce the overall sensitivity of the experiment.

“If there is some effect caused by long-term exposure to compressed music compared to lossless, it is reasonable to believe that it is a small effect. There are two mechanisms that could cause such an effect:

First, we only tested a relatively small amount of content, so there may be signals which trigger perceptible impairments which were not represented in our test stimuli. The stimuli used in the test were selected (according to the process outlined in ITU-R BS.1116-3 section 6) in order to avoid this effect. Good stimulus selection is a large part of the work of setting up a test like this, but broadly the stimuli selected are representative of broadcast content (covering a wide range of music and programme types), and are known to be critical for this codec (impairments were heard at lower bit-rates).
Second, there is the idea that listeners may be more sensitive to impairments during long term listening, or that long-term listening to coded audio may cause listener fatigue, despite there being no evidence of impairments during short-term tests. We know of no mechanism that would cause such effects, nor of test methods which could be used effectively to detect and quantify them. We would be interested to see more research in this area though.”

“Any decision to implement lossless streaming would be based on an overall assessment of the potential costs and benefits of such a change, not just on the potential increase in audio quality. Although producing a FLAC stream may be relatively easy, it creates some challenges during playback:

A new codec would need to be tested on all the platforms on which it would be used. In our experience there are often many problems encountered when implementing such a change.
Losslessly compressed audio is inherently variable bit-rate, while the other profiles we use have a constant average bit-rate, so this could potentially cause players to switch between profiles more frequently on marginal internet connections.
A lossless stream would have a significantly higher bit-rate than our current highest-quality stream (and therefore may cost the listener more money, or cause their device to use more power), but may not provide an improvement to the quality of experience.

We believe that the overall benefit to the audience would be small. We are passionate about audio quality and would like to see more lossless streams available, but having conducted this experiment it is clear to us that there wouldn't be enough of a real benefit to the audience.”

In its own terms the above does, to me, make sense as an argument, but I remain less confident that the conclusions drawn are the right ones for the decision to use FLAC or not for Radio 3 streams. The conclusions GPN draw are both plausible and likely, but might also turn out to be mistaken.

It’s perhaps worth adding something here about my own background. Although those with an interest in Hi-Fi may know about the work I did in that area, most of my career was as a University academic. There the basis of my work involved devising improved measurement systems to allow other people to make discoveries which had been beyond the capabilities of their pre-existing equipment and methods. I was reasonably successful in this and it made me realise the extent to which new discoveries tended to be made only when improved test or measurement methods made them possible. I understand why researchers rely on their standard methods, but am wary because “no sign of X” may tell us something about the measurement/observation method used, not the possible existence of “X”.

Coming back to FLAC versus aac, as things stand, in terms of sound quality the critical question seems to me to be a matter of judgement over if the GPN results can be extrapolated into a different situation. i.e. that of experienced listeners sitting at home using a good audio system which is tailored to their individual long-term enjoyment rather than being easily used as a basis for scientifically reproducable test results.

So in summary, I’d accept that – in themselves – the GPN results are fine. As an ex-academic and an engineer I have no reason to question their work on that level. My concern is with respect to how others – who may not always be engineers and understand the cavils – may interpret it when they may be making decisions. From what I write above I think it will be clear that, even accepting GPN, to me it remains plausible that adopting FLAC would still be a desirable step forwards for the BBC iPlayer’s sound quality. And it might, despite the GPN results, provide increased enjoyment for listeners. The above examination is, however, focussed on the topic of sound quality. In reality the BBC have many other practical and cost considerations which they need to take into account. Like it or not, these must affect decisions in the real world.

During the last few years we’ve become accustomed to a BBC radio iPlayer system which works very well for listeners using a range of internet ‘receiving’ devices. These range from web browsers to set-top-boxes and commercial ‘streaming’ devices. However when the ‘Audio Factory’ system was introduced just a few years ago a number of listeners encountered problems because their existing web browser or device simply wasn’t capable of coping with the 320kbps aac streams we now regard as ‘standard’. This was a particular annoyance for people using ‘closed’ commercial boxes like ‘internet radios’ whose manufacturers didn’t promptly update them to handle the 320 aac streams. For the BBC it also meant a period when they had to maintain older ‘legacy’ streams operated in parallel with the 320k. Thus the process was more complicated and costly for the BBC than you might think. And because of the ‘user end’ problems they got a lot of ‘flak’– of a different kind! – from some listeners and press alike!

A change now from aac to FLAC will involve another period of both extra costs and risks of criticisms surfacing again, when the chances are that many listeners who aren’t keen hi-fi enthusiasts may not think the result when they hear it was worth any inconvenience. It also means the BBC have to develop a changed system based on FLAC that has the resilience and capacity of their existing arrangements. That may seem simply enough given that they went though the process of building the Audio Factory. But each change brings its own practical problems, and involves a lot of work. Here I can give one simple example to illustrate this that a BBC engineer pointed out to me as a concern.

The current arrangements use aac which is a ‘lossy’ codec. If your internet connection and receiving setup can cope, the BBC’s system will send you 320k aac (if you live in the UK). However if the sending system finds that your connection can’t cope with this it will automagically drop down to a lower rate. This is because it is assumed listeners will find this preferable to audible ‘drop outs’ or the audio suddenly ceasing and a connection lost. Of course, it means a reduction in sound quality, but for most purposes it will still sound better than silence. (Although I can probably think of one or two programmes where silence might be preferred anyway!)

The snag here is that FLAC is not a lossy codec. Hence it can’t simply be dropped down to a much lower rate and remain ‘loss free’. So the above ‘adaptive’ method can’t be employed. Instead any ‘reduced rate’ fallback would have to either employ a lossy codec like aac, or cut down the input audio quality to a FLAC encoder – e.g. fall back to less than 16 bits per sample and 48k sample rate. Either way, the effect would be to drive a coach and horses though the aim of providing a ‘loss free’ version of what the BBC have to send to you. And, as yet, it is far from clear how this issue could best be dealt with, implemented on the scale of the Audio Factory and then incorporated into all the receiving arrangements which various listeners will be using.

This example shows us that there are some potential practical icebergs in the water here. So I can understand why they may be cautious of spending a lot of time and money only to find they hit an unexpected iceberg or two along the way! However my personal opinion here is that an intermediate step might be useful. In 2017 we did have an experimental trial. This worked very well (most of the time!) in my experience. But to listen to it I had to get hold of the source code of a specially modified version of a particular computer program called ffmpeg, and then build a working version to run on my computer. This was because the way the FLAC stream had to be sent was ‘non standard’. Hence most of the software, etc, people normally use was unable to cope. This gives a chicken-and-egg problem. The common web browsers, streaming devices, etc, won’t be updated to deal with a new arrangement until the computer programmers think it is needed. Yet there is no point broadcasting something no-one can hear!

To me, the way forwards would be via a route similar to what happened for the 320k aac. This was made available as a live Radio 3 feed, accessible from a specific webpage/address, and this feed was maintained for some time whilst making clear it was distinct from the ‘standard’ BBC iPlayer system, so might be intermittent or eventually be removed if few people used it. The long term presence of this lossless stream would then be a reason for users/listeners to put pressure on web browser developers, etc, and tell them, “I want this. When are you going to add support for it?” We could then see over a longer period if usage increased and the capability became more easily available to users who have no interest in building their own software. In essence this represents a step mid way between an initial test and full adoption. I don’t think the 2017 experiment was sufficient for this to happen because it was clearly time limited and experimental, and relatively few people would have had a chance to give it a long tryout. So there would not have been much pressure on software / hardware developers to bother with incorporating it. A longer trial of undefined duration might yield useful results.

All of that said, the main aim of what I have written here is to suggest that those making decisions at the BBC should take care not to assume that the GPN results ‘prove’ there can be no advantage in changing to FLAC. In the end, though, its all a question of how you look at things, or in this case, how you listen...

My thanks to the authors of the GPN paper, and others, for kindly being willing to discuss this topic with me. We may disagree about some of the implications we draw from the work, but I think we all have understandable reasons for doing so. I can confess that I’d have written about this earlier, but I was too busy simply enjoying the Proms via 320 aac!