Engineering the Future: The Promise and Perils of Voice AI
Engineering the Future: The Promise and Perils of Voice AI
By Rupal Patel
Early days at VocaliD
If we are honest, we started VocaliD to change the world, at least, to change the way voice is represented within the world. We are proud to know that for some individuals, our technology has changed THEIR world and improved their lives and the lives of those around them.
Building a technology company with the goal to change societal norms comes with significant responsibilities. When you do anything, you impact something. When you bring a new technology into the world, the ability to fully predict every impact is impossible. Chaos theory, the butterfly effect, … despite our best intentions, there will be unintended consequences. And so, as a company that began with the goal of empowering every voice to be heard, how do we limit the misuse of our work?
For us, we have asked ourselves hard questions early and often. What do we stand for? Are we opening up a pandora’s box? How will we act responsibly today to mitigate risks in the future? It is never an easy exercise, but nowhere is it as challenging as building a technology-based company within a nascent space. There isn’t a formula to follow and you have to be comfortable with the discomfort of not knowing.
In the early stages, we were just figuring out who we were and then, as the industry has grown around us, we needed to respond to the changing environment. We needed to decide how and whether to pivot or expand. We asked ourselves whether we can grow and still stay true to ourselves.
When we began this company in 2014, we could only begin to imagine the potential for voice. Our initial focus was on the needs of individuals living without speech, or with impaired speech. It soon became apparent that there was a need for unique synthetic voice beyond assistive technology.
Technological Advances Bring New Opportunities and New Risks
A few years ago technology didn’t allow for subtle nuances in voice, hence the robotic sameness that existed for so long. Today, we can emulate these unique characteristics within individual voices and create vocal identities that have personality. As the ability to create synthetic voice approaches ultra-realistic quality, it brings with it both unparalleled opportunities for individuals and businesses but also increased risk in the form of fraud and deception. These concerns are on the minds of many technologists, especially those of us working in synthetic media. How can we ensure that the work we do is contributing positively, rather than fueling harmful and nefarious activities?
As we prepared for last year’s GDPR compliance we found ourselves realizing that compliance alone was not enough to combat misappropriation of technologies. We need more proactive engagement by the AI community, to ensure we ward off misuses well before they happen.
How do we do this? What are the steps? Well, it is important to note that what we think is adequate today is going to evolve… it will depend on the growth of the industry and the consumers of this technology. While we can’t possibly anticipate all potential misuses and build countermeasures to block them, we believe it is essential to draw attention to these complex issues in the formative stage of the synthetic media industry so that developers think through the unintended consequences and consumers have the awareness to carefully evaluate the products.
Starting the AI Ethics Conversation
"We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally."
We believe education and open ongoing communication is the way forward. With this in mind, we’ve partnered with Modulate to form the The AITHOS Coalition as a way for all of us working within synthetic media, from technologists and CEOs to sales and marketers, to help shape the field with the mindful intention of the overarching issues.
Together, we came up with a self-reflection guide to facilitate the critical discussions we were having within our organizations and amongst others in the field. Some of the topics may not be relevant to all, but we believe they are a starting point in the conversation and are worth considering when building products that have the potential to both disrupt and be misused.
VocaliD’s take on the AITHOS dialog
Why is this Synthetic Media Valuable in the First Place?
When we began, it was obvious. A young girl and a middle-aged man are so inherently different - unique in age, gender, and personality. Of course, they both deserved to have unique vocal identities to speak through their assistive devices. However, this isn’t what was happening. The devices had few voices available and too often, they didn’t represent the person or personality using the prosthesis. VocaliD set out to change that by building personalized synthetic voices.
As we continue to do this important work, we have begun listening to all of the voices around us and we’ve realized that the issue we were solving didn’t only exist in the assistive technology space, it exists in everyday life. All one needs to do is listen … and you will hear very quickly that synthetic voices we come in contact within our daily lives … are bland, at best. The current voice offerings are not representative of all of us, of our communities, of our needs. For voice actors, this means that only white men or women, or white-sounding men and women, are voicing nearly all of the virtual assistants, IVR systems, alerts and notification systems, and gaming avatars. VocaliD democratizes synthetic voice, creating vocal identities that are as colorful and diverse as the world we live in. Whether it is regional or social dialects or internationally accented English, today’s synthetic voice should sound like us - all of us.
Who or What Should Your Tech be Able to Emulate?
Scientists and artists have been fascinated with emulating speech for centuries. What’s different today than in the past is that we are not only emulating speech but also voice… how someone sounds, the actual vocal identity of the speaker, making it susceptible to misuse. What’s adding to the problem is that we are now saving and archiving audio data at an unprecedented rate. CEOs, political figures, influencers all have hours of relatively clean audio that is easily accessible to anyone. Moreover, there is a push to make tools and technologies open source to both fuel innovation and equalize access but with that comes risks. We use proprietary techniques that we purposefully do not open source as a precautionary measure.
Can we build the voices of children, influencers and other vulnerable populations? Yes, we can build anyone’s voice if we have enough data. We have policies in place to protect data and we have guidelines for voice building. From day one, our approach was to blend voices for those with disabilities. Today, as we work with enterprises, we require that the talent has consented to the use of their voice before we start the project. Moreover, we take proactive steps to ensure that should our data or tech fall into the wrong hands, it would be sufficiently obfuscated.
When Should You Share How Your Technology Works with the World?
Given anticipated consequences, we feel it is responsible to be cautious regarding what and how much we share about how the technology works. While we are committed to advancing the field through the dissemination of our technology and findings, we also understand that some aspects of the IP and know-how need to be undisclosed.
We are focused on educating consumers at this stage — before synthetic voices are indiscernible by humans. Engaging in damage control when audio deep fakes proliferate would be a dangerous proposition.
Where Can You Sell Your Technology While Still Ensuring it is Used Responsibly?
Should we sell just because we can? No. This is a matter of principle. We purposefully seek out engagements only with organizations and individuals that are aligned with our values. Our licensing agreements and business contracts also reflect these fundamentals.
What Data Does Your Machine Learning Process Use?
Voice is complex and highly personal. The very nature of our work is to build a more inclusive and diverse universe of synthetic voice. Our machine learning algorithms consider a broad demographic varying in age, gender, language background, geography to reflect real-world variations.
How can it be detected?
Synthetic voice has advanced considerably in the past few decades. Most recently, the availability of large datasets and machine learning tools has catapulted the field. Soon synthetic voice will become indistinguishable from real audio. We are working on a multi-pronged strategy that encompasses audio steganography (watermarking), voice blending and countermeasure tools. We began talking about this as the moonshot for voice AI - the need to build voices that are life-like without being deceptive, that unite us, rather than divide us. We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally. For us, we believe that change needs to be a net positive.
Just the Beginning
This is an exciting time in Voice. Advances in technology are coming at breakneck speeds, allowing us to offer world-class synthetic voices that truly represent all of us. For individuals and brands that have relied on the voice of only a few, this is amazing news. As voice-first interfaces continue to expand across all aspects of our lives, from customer service to health care and entertainment, customized synthetic voice will power these user experiences. Protecting synthetic media from potential abuse will require a joint, collaborative effort to create the most advanced and impenetrable barriers that protect us all. We hope you join the conversation and the AITHOS coalition.