
When people think about video game audio, they usually think about music.
In reality, music is often only a small part of the work.
- Music can elevate a game, but immersion is built long before the soundtrack starts.
- The best audio design is often the one players never consciously notice.
- Without strong diegetic audio, the world stops feeling alive.
Of course, productions like Clair Obscur: Expedition 33 and the music of Lorien Testard are exceptions, where the soundtrack elevated the experience and, in my opinion, helped give the game worldwide reach through the countless covers and reinterpretations made by fans.
This article is here to give newcomers — but also people already working in the industry — a broader perspective on all the invisible work hidden behind audio. Quite literally invisible, since you do not “see” audio, even though it constantly shapes the experience.
The Audio & my Cupcake Analogy

- Diegetic audio is the cake itself. The ingredients, the way they are assembled, and the baking process determine whether the final result succeeds or fails. In audio, this is the core of the experience: footsteps, breathing, heartbeats, interactions, environmental sounds… everything that gives life and credibility to the world.
- Essential non-diegetic audio is the frosting. UI sounds, menu feedback, transitions, notifications, and interaction cues are not part of the world itself, but they are necessary for clarity, comfort, and usability. If done poorly, they can negatively affect the whole experience.
- Optional non-diegetic audio is the cherry on top: the music. It is not always essential, but it can make a huge difference emotionally. Music can elevate the experience, strengthen immersion and help to create memorable moments.
The mistake is often to focus on the frosting or the cherry first, while neglecting the cake underneath.
A strong audio experience starts with a solid diegetic foundation, then adds non-diegetic layers to enhance and elevate the whole experience.
Diegetic & Non-Diegetic Audio

Diegetic Audio
Diegetic audio is every sound that belongs to the world itself.
Players feel it without noticing it. Most of the time, audio works subconsciously. It creates tension, comfort, and a sense of danger without the player actively thinking about it.
Even silence can be as important as sound and it can even create a bigger feeling of danger.
Good audio provides constant feedback. It makes the game feel responsive and helps the world feel alive.
Audio can guide the player naturally, without breaking immersion.
Great audio design is often invisible but players immediately notice when it is missing :
- Reverb applied differently depending on the environment gives players a sense of space: whether a room feels large, small, empty, crowded, metallic, or enclosed.
- The exact same event — footsteps or an object falling on the ground — must adapt depending on the context it has been played as the surface material to convince the player that the world around them truly exists.
- A sound repeated dozens, hundreds, or even thousands of times should never be perfectly identical. Tiny variations in volume, pitch, timing, or layering matter enormously. Even when the player does not consciously notice it, the brain does. Exact repetition breaks immersion surprisingly quickly. Just clap your hands several times: no two sounds will ever be exactly the same.
- Audio also gives life to parts of the world the player cannot even see: distant crowds in a city, a siren far away, birds chirping, leaves rustling in the trees. These everyday sounds are essential for immersion. Interestingly, they are often much cheaper to create than visuals since they require no additional 3D assets.
- Similarly, a machine can emit complex beeps and electronic noises to suggest advanced technology without requiring expensive animations or VFX. Once again, sound can reinforce immersion while reducing production costs.
- Audio must also react to the player’s actions. This is essential both for immersion and gameplay readability. If the character grunts after being hit or starts breathing heavily after sprinting for too long, players instinctively understand those situations are dangerous — no tutorial required. Sound becomes an invisible guide.
- Realism is not always the goal. In fact, “cheating” audio often strengthens immersion rather than breaking it.
Imagine the breathing of a terrifying creature nearby. Realistically, you would only hear someone breathing from a few meters away. But exaggerating that sound and letting the player hear it through the next room can build anticipation and tension. Like Chekhov’s gun, the important part is not the event itself, but preparing the audience emotionally beforehand.
Non-Diegetic Audio
Non-diegetic audio refers to sounds that exist for the player, not for the world itself.
- At the very minimum, this includes UI feedback sounds when menus are not diegetic. A simple click confirms that the game registered an action, while a descending or dissonant sound can indicate that an action is forbidden.
- Then comes music.
Personally, I believe music should only exist when it truly adds something. Audio budgets are rarely huge, and “wallpaper music” — constant background music that tells the audience exactly what to feel every second — often weakens emotional impact instead of enhancing it. Strong music should support emotion, not replace it. - Twenty years ago, endlessly looping a short track while the player stayed in an area was acceptable, today it’s not. Players have too many alternatives for that kind of repetition to remain unnoticed. Creating twenty completely different tracks for every area would be far too expensive, so modern game music had to evolve.
- This is where adaptive music systems become fascinating.
There are two major approaches: vertical layering and horizontal layering, and they can be combined together.- Vertical layering treats music like stacked layers of instruments. Depending on gameplay intensity, layers can dynamically appear or disappear. Randomizing certain layers also creates subtle variation between playthroughs.
- Horizontal layering works more like assembling a train from different wagons. Musical segments are chained together dynamically, reusing and recombining pieces to generate variation. Segments can also be grouped by intensity level, allowing the system to adapt organically to gameplay. With these systems, it becomes possible to create hours of evolving music without obvious repetition.
- Narrator voices are much rarer in games than in films because player actions constantly alter the flow of events, making narration difficult to adapt dynamically. That is precisely why The Stanley Parable works so brilliantly: the narration intentionally plays with the player and breaks the fourth wall in clever ways.
- Narration can also place the player directly at the center of the story itself, almost like the game is speaking to them personally. This is one of the reasons I often recommend Moss in VR.
- Voice narration can also replace traditional written tutorials that many players simply skip. A great example of this is A Fisherman’s Tale, where guidance feels natural and integrated into the experience.
Conclusion

In game production, audio teams often arrive very late in development and are forced to adapt to everything that already exists. Ironically, involving audio from day one would usually be cheaper while also improving storytelling, player guidance, immersion, and emotional intensity.
A game can absolutely exist without non-diegetic audio. Many great games barely use music at all.
But a game without diegetic audio — without a living world — is almost unthinkable. It would feel as incomplete as a game without visuals.
So when budgets are limited — and they almost always are — prioritization becomes essential.
Build the cake first. Then add the frosting. And then, if you have the budget, add the cherry on the top.
And finally: music may only represent a small percentage of the overall audio workload, but it often represents a very large part of the budget due to original composition, recording, performers, and licensing rights.

