Real-Time Speaker Analysis in Sales Calls
How automatic speaker diarization and voiceprint identification transform sales intelligence
The Multi-Speaker Challenge
In any real-world sales call or counselling session, multiple people are speaking. A dental consultation might involve the patient, a treatment coordinator, and a dentist. A sales call includes the salesperson and one or more prospects. To generate meaningful analytics, the system must first answer a fundamental question: who is speaking when?
This is the problem of speaker diarization — automatically segmenting an audio stream into homogeneous regions, each corresponding to a single speaker. It's a problem that sounds simple but is computationally demanding, especially when done in real time with no prior knowledge of how many speakers are present or what they sound like.
Dual-Layer Voice Embeddings
Happo AI employs a dual-layer approach to speaker representation. Each segment of speech is converted into two distinct embedding vectors — compact numerical representations that capture the unique characteristics of a speaker's voice.
The first layer produces a compact 192-dimensional embedding optimised for fast comparison and real-time processing. The second layer generates a richer 1,024-dimensional embedding that captures finer voice characteristics for high-accuracy identification. By combining both layers, the system achieves both speed and accuracy.
The combined 1,216-dimensional speaker representation captures everything from fundamental frequency patterns to subtle vocal tract resonances, creating a voiceprint as unique as a fingerprint.
Real-Time Diarization
During a live audio stream, Happo AI processes incoming audio in chunks. For each chunk, speaker embeddings are extracted and compared against an evolving model of the speakers in the conversation. The system uses adaptive clustering to group segments by speaker without requiring any advance knowledge of who is participating.
The diarization runs in parallel with all other analyses — sentiment tracking, condition detection, and transcription happen simultaneously. This parallel architecture means that diarization adds no additional latency to the analysis pipeline. Results are delivered with speaker labels attached, enabling per-speaker timelines and individual performance metrics.
Speaker Enrollment and Recognition
Beyond anonymous diarization (labelling speakers as Speaker 1, Speaker 2, etc.), Happo AI supports speaker enrollment — registering known individuals so they can be recognised automatically in future sessions.
When a speaker is enrolled, their voice embeddings are stored as a reference profile. In subsequent sessions, the system compares detected speakers against the enrolled database using cosine similarity matching. This enables tracking individual sales staff performance across multiple calls, or monitoring a patient's vocal health across counselling sessions over time.
The enrollment process requires only a short sample of natural speech — no special recording session is needed. The system can enroll speakers retroactively from previously analysed recordings, making it easy to build a speaker database from existing call archives.
Per-Speaker Sales Analytics
With speakers identified, every metric becomes attributable. The system can report that the salesperson's empathy scores dropped during the pricing discussion, or that the patient's stress levels peaked when hearing about a specific treatment option. This granularity transforms raw audio into actionable coaching insights.
Sales managers can compare how different team members handle objections, track individual improvement over time through sale journals, and identify specific moments where a sale was won or lost. The per-speaker analysis turns every call into a structured learning opportunity.