Back to Blog
AI Technology

The Science Behind Acoustic Intelligence

MFCC, Mel Spectrograms, and Empirical Mode Decomposition in voice analysis

·9 min read

Turning Sound into Data

Before any AI model can analyse a voice recording, the raw audio waveform must be transformed into meaningful numerical features. This transformation — from pressure waves captured by a microphone to structured data that reveals the speaker's mental and emotional state — is the domain of acoustic signal processing.

Happo AI employs multiple complementary feature extraction techniques, each capturing different aspects of the voice signal. Together, they create a rich, multi-dimensional representation that captures far more information than any single technique alone.

Mel-Frequency Cepstral Coefficients (MFCC)

MFCCs are the workhorses of speech and audio analysis. They represent the short-term power spectrum of sound in a way that approximates human auditory perception. The 'Mel' in MFCC refers to the Mel scale — a perceptual scale that maps physical frequencies to how humans actually hear them, spacing lower frequencies more widely (where we're more sensitive) and compressing higher frequencies.

The extraction process involves windowing the audio signal into short frames, computing the frequency spectrum, applying Mel-scale filter banks, taking the logarithm (mimicking the ear's logarithmic loudness perception), and applying a discrete cosine transform to decorrelate the features. The result is a compact set of coefficients — typically 13 per frame — that efficiently encode the spectral shape of the voice.

Happo AI extends standard MFCC extraction with delta (velocity) and delta-delta (acceleration) coefficients, capturing not just the spectral shape at each moment but how it's changing over time. This temporal dynamic information is crucial for detecting conditions like panic (rapid spectral changes) versus depression (slow, minimal changes).

With 13 base MFCCs plus their first and second derivatives, Happo AI extracts up to 156 dimensional features per audio frame — a comprehensive spectral fingerprint that evolves over time.

Mel Spectrograms

While MFCCs compress spectral information into a small set of coefficients, Mel spectrograms preserve the full frequency resolution. A Mel spectrogram is a visual and numerical representation showing how the energy distribution across Mel-scaled frequency bands changes over time.

Mel spectrograms are particularly valuable for deep learning approaches, where neural networks can learn to identify patterns directly from the rich, image-like representation. The 257-dimensional Mel spectrogram features capture fine-grained spectral details that MFCCs might smooth over, including subtle harmonic structures and noise characteristics relevant to conditions like sleep apnea or fatigue.

Empirical Mode Decomposition (EMD)

EMD is an adaptive signal processing technique that decomposes a complex signal into its intrinsic mode functions (IMFs) — oscillatory components at different time scales that are embedded in the original signal. Unlike Fourier analysis, which assumes stationary signals, EMD handles the non-stationary, non-linear characteristics of real speech.

For voice analysis, EMD is particularly powerful for detecting tremor and micro-tremor in speech. Stress and anxiety produce involuntary muscle tension that manifests as tremor in the vocal folds — oscillations too subtle to hear but clearly visible when the signal is decomposed into its constituent modes.

The algorithm works through an iterative sifting process: identifying local extrema, computing envelope functions, and extracting oscillatory modes one by one from highest frequency to lowest. The resulting IMFs reveal the layered structure of the voice signal, separating the fundamental vibration from tremor components, breathing artifacts, and baseline drift.

Spectral and Prosodic Features

Beyond these core techniques, Happo AI extracts a broad set of complementary features. Spectral features include spectral centroid (brightness), spectral rolloff (frequency distribution), spectral flux (rate of change), and zero-crossing rate (a rough measure of frequency content). These features characterise the overall tonal quality and energy distribution of the voice.

Prosodic features capture the rhythm and melody of speech — fundamental frequency (pitch) contours, speaking rate, pause patterns, and intensity dynamics. Prosody is a rich source of information about emotional state: depressed speech tends to be slower with flatter pitch; anxious speech is faster with more pitch variation; manic speech shows extreme prosodic range.

The combination of spectral, cepstral, temporal, and prosodic features creates feature vectors ranging from 6 to over 200 dimensions depending on the condition being analysed. This dimensionality is carefully tuned per condition — simpler conditions like stress detection require fewer features, while complex conditions like autism spectrum analysis benefit from richer representations.

From Features to Insights

The signal processing pipeline transforms each audio segment into a structured feature matrix that downstream AI models use for classification and severity estimation. The quality and relevance of these features directly determines the accuracy of the final analysis.

Happo AI's feature extraction pipeline was developed through doctoral research, iteratively refined through clinical validation studies. Each technique was selected not just for its theoretical properties, but for its demonstrated ability to capture the acoustic correlates of specific mental health conditions in real-world recordings with varying quality, background noise, and speaker characteristics.