Spatial Audio for Cinematic VR and 360 Videos

Spatial audio is a powerful way to fully immerse a user and direct attention within a 360 video or VR experience via sound. A huge part of our attention can be directed with audio cues but a fully immersive experience requires a detailed spatial audio mix and not just cues that are added on as an afterthought. Spatial audio makes what we hear a believable auditory experience that matches what we see and have experienced. And for this precise reason, for the most realistic and impactful experience, it is pertinent for the sound design to be a part of the creative brief from the very beginning as bad or misplaced audio design and cues can be a deterrent to a convincing outcome.

Here we’ll cover the basics of spatial audio and provide some how-tos to help you make an awesome 360 VR experience.

What is Spatial Audio?

The human brain interprets auditory signals in a specific way that allows it to make decisions about its surrounding environment. We use our two ears, in conjunction with the ability to move our heads in space, to make better decisions about the position of an audio signal and the environment the sound is in.

Spatial audio in virtual reality involves the manipulation of audio signals so they mimic acoustic behavior in the real world. An accurate sonic representation of a virtual world is a very powerful way to create a compelling and immersive experience. Spatial audio not only serves as a mechanism to complete the immersion but is also very effective as a UI element: audio cues can call attention to various focal points in a narrative or draw the user into watching specific parts of a 360 video, for example.

Spatial audio can be best experienced using a normal pair of headphones . No special speakers, hardware, or multi-channel headphones are required. For more details on auditory localisation, check out this write up on the Oculus Developer Center.

Try it for yourself with these examples of spatialized audio. Make sure you have your headphones on!

Fuerza Imprevista (JauntVR)

Through the Ages (NatGeo)

Rapid Fire: A Brief History of Flight (Studio Transcendent)

Linear vs. Interactive Audio Design

While the playback and consumption of spatial audio is the same regardless of whether the experience is an interactive experience, 360 video, or a cinematic mixed reality piece, the workflow to create such content is significantly different. In particular, games and interactive experiences often rely on audio samples played from discrete sound sources that are mixed in real time relative to the position of the camera. The Oculus Audio SDK is designed to add high-quality spatialization tools to existing tools (such as FMOD and Wwise) and engines (Unity and Unreal) often used by game developers.

Thanks to the growth of production and consumption of immersive panoramic or VR experiences, developers and infrastructure owners are re-examining the constraints of video container specifications. Better file compression, codec and streaming infrastructure support for multichannel spatial audio, or, support of various real-time interactive metadata elements to influence traditional video playback or social reactions with live video are examples of a trend which is proof of the worlds of traditional video broadcasting and siloed game-like apps coming together.

Ambisonics

Ambisonic technology is a method to render 3D sound fields in a spherical format around a particular point in space. It is conceptually similar to 360 video except the entire spherical sound field is audible and responds to changes in head rotation. There are many ways of rendering to an Ambisonic field, but all of them rely on decoding to a binaural stereo output to allow the user to perceive the spatial audio effect over a normal pair of headphones.

Ambisonic audio itself can be of n- orders comprising of various channels. More channels results in higher spatial quality, although there is a limit to the perceived difference in sound quality as one goes beyond 3rd order Ambisonics (16 channels of audio). Regardless of the number of channels used for encoding the original signal, the decoded binaural audio output will always be to two channels. As the listener moves their head the content of the decoded output stream shifts and changes accordingly, providing a 3D spatial effect.

Ambisonics is not the only way to render spatial audio for 360 videos. There are other solutions in the market as well, although the effectiveness, feature set, toolchains, and final render quality varies between various techniques:

  • Traditional surround sound such as 5.1, 7.1 etc. which can be decoded over virtual speakers and rendered binaurally over headphones. Depending on the content, the rendered sound field may suffer from ‘holes’ between the speakers and won’t have the same smoothness in spatial accuracy or resolution
  • Quad-binaural: 4 pairs of pre-rendered binaural stereo tracks each in 0, 90, 180 and 270 degrees. The audio streams are faded in based on head-rotation