AI Earbuds Learn to Speak in Your Voice
From Sci-Fi to Real-Time
A groundbreaking prototype headset tool is redefining real-time translation by not only converting language on the fly but also cloning each speaker’s voice in the process. Developed by a team of researchers at Meta, the system can translate a conversation between multiple people speaking different languages while maintaining the unique voice and tone of each participant. This innovation marks a notable departure from traditional translation apps and services, which typically use a single generic voice for output. The system operates in real time, making it ideally suited for applications like smart earbuds or AR glasses—essentially bringing the Star Trek-style universal translator closer to reality.
How It Works—and Where It Goes
The technology leverages a suite of large language models and text-to-speech synthesis tools, integrating speaker recognition and multilingual voice cloning. It works by first transcribing speech into text, translating that text, and then synthesizing speech that sounds like the original speaker—even capturing emotional tone and cadence. Though the demo was restricted to just six languages and controlled lab conditions, Meta sees potential for expansion. However, the team acknowledges that scalable deployment must address issues like latency, privacy, data processing power, and the possibility of misuse—particularly around the impersonation risks posed by cloned voices.