Engineering the Future: The Promise and Perils of Voice AI
Engineering the Future: The Promise and Perils of Voice AI
By Rupal Patel
Early days at VocaliD
If we are honest, we started VocaliD to change the world, at least, to change the way voice is represented within the world. We are proud to know that for some individuals, our technology has changed THEIR world and improved their lives and the lives of those around them.
Building a technology company with the goal to change societal norms comes with significant responsibilities. When you do anything, you impact something. When you bring a new technology into the world, the ability to fully predict every impact is impossible. Chaos theory, the butterfly effect, … despite our best intentions, there will be unintended consequences. And so, as a company that began with the goal of empowering every voice to be heard, how do we limit the misuse of our work?
For us, we have asked ourselves hard questions early and often. What do we stand for? Are we opening up a pandora’s box? How will we act responsibly today to mitigate risks in the future? It is never an easy exercise, but nowhere is it as challenging as building a technology-based company within a nascent space. There isn’t a formula to follow and you have to be comfortable with the discomfort of not knowing.
In the early stages, we were just figuring out who we were and then, as the industry has grown around us, we needed to respond to the changing environment. We needed to decide how and whether to pivot or expand. We asked ourselves whether we can grow and still stay true to ourselves.
When we began this company in 2014, we could only begin to imagine the potential for voice. Our initial focus was on the needs of individuals living without speech, or with impaired speech. It soon became apparent that there was a need for unique synthetic voice beyond assistive technology.
Technological Advances Bring New Opportunities and New Risks
A few years ago technology didn’t allow for subtle nuances in voice, hence the robotic sameness that existed for so long. Today, we can emulate these unique characteristics within individual voices and create vocal identities that have personality. As the ability to create synthetic voice approaches ultra-realistic quality, it brings with it both unparalleled opportunities for individuals and businesses but also increased risk in the form of fraud and deception. These concerns are on the minds of many technologists, especially those of us working in synthetic media. How can we ensure that the work we do is contributing positively, rather than fueling harmful and nefarious activities?
As we prepared for last year’s GDPR compliance we found ourselves realizing that compliance alone was not enough to combat misappropriation of technologies. We need more proactive engagement by the AI community, to ensure we ward off misuses well before they happen.
How do we do this? What are the steps? Well, it is important to note that what we think is adequate today is going to evolve… it will depend on the growth of the industry and the consumers of this technology. While we can’t possibly anticipate all potential misuses and build countermeasures to block them, we believe it is essential to draw attention to these complex issues in the formative stage of the synthetic media industry so that developers think through the unintended consequences and consumers have the awareness to carefully evaluate the products.
Starting the AI Ethics Conversation
"We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally."
We believe education and open ongoing communication is the way forward. With this in mind, we’ve partnered with Modulate to form the The AITHOS Coalition as a way for all of us working within synthetic media, from technologists and CEOs to sales and marketers, to help shape the field with the mindful intention of the overarching issues.
Together, we came up with a self-reflection guide to facilitate the critical discussions we were having within our organizations and amongst others in the field. Some of the topics may not be relevant to all, but we believe they are a starting point in the conversation and are worth considering when building products that have the potential to both disrupt and be misused.
VocaliD’s take on the AITHOS dialog
Why is this Synthetic Media Valuable in the First Place?
When we began, it was obvious. A young girl and a middle-aged man are so inherently different - unique in age, gender, and personality. Of course, they both deserved to have unique vocal identities to speak through their assistive devices. However, this isn’t what was happening. The devices had few voices available and too often, they didn’t represent the person or personality using the prosthesis. VocaliD set out to change that by building personalized synthetic voices.
As we continue to do this important work, we have begun listening to all of the voices around us and we’ve realized that the issue we were solving didn’t only exist in the assistive technology space, it exists in everyday life. All one needs to do is listen … and you will hear very quickly that synthetic voices we come in contact within our daily lives … are bland, at best. The current voice offerings are not representative of all of us, of our communities, of our needs. For voice actors, this means that only white men or women, or white-sounding men and women, are voicing nearly all of the virtual assistants, IVR systems, alerts and notification systems, and gaming avatars. VocaliD democratizes synthetic voice, creating vocal identities that are as colorful and diverse as the world we live in. Whether it is regional or social dialects or internationally accented English, today’s synthetic voice should sound like us - all of us.
Who or What Should Your Tech be Able to Emulate?
Scientists and artists have been fascinated with emulating speech for centuries. What’s different today than in the past is that we are not only emulating speech but also voice… how someone sounds, the actual vocal identity of the speaker, making it susceptible to misuse. What’s adding to the problem is that we are now saving and archiving audio data at an unprecedented rate. CEOs, political figures, influencers all have hours of relatively clean audio that is easily accessible to anyone. Moreover, there is a push to make tools and technologies open source to both fuel innovation and equalize access but with that comes risks. We use proprietary techniques that we purposefully do not open source as a precautionary measure.
Can we build the voices of children, influencers and other vulnerable populations? Yes, we can build anyone’s voice if we have enough data. We have policies in place to protect data and we have guidelines for voice building. From day one, our approach was to blend voices for those with disabilities. Today, as we work with enterprises, we require that the talent has consented to the use of their voice before we start the project. Moreover, we take proactive steps to ensure that should our data or tech fall into the wrong hands, it would be sufficiently obfuscated.
When Should You Share How Your Technology Works with the World?
Given anticipated consequences, we feel it is responsible to be cautious regarding what and how much we share about how the technology works. While we are committed to advancing the field through the dissemination of our technology and findings, we also understand that some aspects of the IP and know-how need to be undisclosed.
We are focused on educating consumers at this stage — before synthetic voices are indiscernible by humans. Engaging in damage control when audio deep fakes proliferate would be a dangerous proposition.
Where Can You Sell Your Technology While Still Ensuring it is Used Responsibly?
Should we sell just because we can? No. This is a matter of principle. We purposefully seek out engagements only with organizations and individuals that are aligned with our values. Our licensing agreements and business contracts also reflect these fundamentals.
What Data Does Your Machine Learning Process Use?
Voice is complex and highly personal. The very nature of our work is to build a more inclusive and diverse universe of synthetic voice. Our machine learning algorithms consider a broad demographic varying in age, gender, language background, geography to reflect real-world variations.
How can it be detected?
Synthetic voice has advanced considerably in the past few decades. Most recently, the availability of large datasets and machine learning tools has catapulted the field. Soon synthetic voice will become indistinguishable from real audio. We are working on a multi-pronged strategy that encompasses audio steganography (watermarking), voice blending and countermeasure tools. We began talking about this as the moonshot for voice AI - the need to build voices that are life-like without being deceptive, that unite us, rather than divide us. We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally. For us, we believe that change needs to be a net positive.
Just the Beginning
This is an exciting time in Voice. Advances in technology are coming at breakneck speeds, allowing us to offer world-class synthetic voices that truly represent all of us. For individuals and brands that have relied on the voice of only a few, this is amazing news. As voice-first interfaces continue to expand across all aspects of our lives, from customer service to health care and entertainment, customized synthetic voice will power these user experiences. Protecting synthetic media from potential abuse will require a joint, collaborative effort to create the most advanced and impenetrable barriers that protect us all. We hope you join the conversation and the AITHOS coalition.
As Mr. Vedantam begins the podcast, "At some point in our lives, many of us realize that the way we hear our own voice isn't the way others hear us. And we begin to realize that our voices communicate so much more than mere information: they reveal our feelings, our temperament, our identity."
This sets the tone for the next 30 minutes in which voice as identity is looked at from several angles, including a transgender woman's struggle with hearing herself in her voice, a woman who experienced a drastic change in her voice after surgical intubation damaged her vocal cords, and in the case of speech disorders requiring speech generating devices to communicate, how the use of modern speech synthesis technology can provide these individuals with their own unique identifiable vocal identities.
"Voice is about who you are. Our voice signals how old we are. Our voice signals our gender. Our voice signals, you know, things about our personality."
An important part of any technology conversation is "how do you mitigate the unintended outcomes?" and this is something Rupal and Shankar briefly touch upon. With the increase in deep fakes across media, Ms. Patel discussed the vulnerabilities and risks of new voice technologies, from political to financial impacts. She further stated that along with advances, there are ethical responsibilities that companies building these technologies must consider, and how VocaliD has designed ethical AI into our business.
In summary, this podcast is a wonderful introduction into the concept of voice as identity. Be sure to subscribe to Hidden Brain for more fascinating episodes as Shankar Vedantam uses science and storytelling to reveal the unconscious patterns that drive human behavior, shape our choices and direct our relationships.
This one hour in depth interview was a deep dive into VocaliD, as well as, the history and science of speech synthesis, providing the listener with a solid understanding of the hows and whys of modern voice AI.
During the podcast, Rupal and Bret delved into the future of computer-generated voice and how it relates to the surge in voice-first products we are seeing (and hearing). The technological advances in machine learning will undoubtedly offer numerous benefits from both a consumer and brand standpoint.
One of the many interesting take aways was the impact that today's advances in speech synthesis will have on inclusivity and allowing communities to feel less disenfranchised.
Rupal explained that if you look at the past - the prototypes for radio and television broadcasting were a very limited voice or face. There wasn't much diversity in the beginning, but now you are seeing, and hearing, a far more diverse range of communities in these two mediums. This hadn't yet caught on in the synthetic voice world however, and Rupal is eager for what will come now that VocaliD can offer unique high quality diverse voices.
"Our world is diverse. From age, gender, sexual orientation, and accents, and we don't hear much of that at all in the synthetic voices we hear around us."
-Rupal Patel, CEO of VocaliD
Wrapping up this educational podcast, Ms. Patel discussed the ethical responsibilities that technology companies must be aware that they hold when creating new technologies that may bring unintended consequences - and how it is important to consider ways in which to build safeguards into the design of your technology to mitigate these risks.
In the past few decades, we have seen an explosion of voice interfaces. There are 500M speaking devices today and by 2021 they outnumber us. We are using voice to access financial accounts, health records, and other personal information. All of this is exists today because of a wild idea, a moonshot. Voice is changing, evolving, exploding and the opportunities and risks are limitless. Before we share our founder Rupal Patel’s vision for the Voice AI Moonshot, let’s have a look at the future of Voice AI.
Despite the sheer number of voice interfaces we have access to today, we are still treating voice as functional modality today → a way to transmit information. Even today’s spherical hardware devices that we refer to as conversational agents are merely for timely and topical information exchange. One monolithic voice – a butler of facts. How can we move past this and harness the true power of Voice AI?
"Our Voice AI Moonshot is a world where voice benefits all, not just some."
The future of voice AI lies in tapping into the intrinsic, human characteristics of voice as a social connector.
There is an evolutionary reason that we each have a unique voice. Our voice defines us – our age, size, cultural background, habits, sexual orientation, socioeconomic level and more. Specifically, voice is biometric data that can be used to predict and monitor physical and mental health, while also offering a window into cognition and learning readiness. This is the untapped power of voice.
The future of voice AI is about connection. To create contextually adapting voices that can calm or inspire with the flip of a bit.
The future of voice AI is about TRUST. To create relatable, compassionate voices that can engage a toddler and the aging lonely. It is important to note that these voices would not substitute for human contact, they would be an augmentation.
The future of voice AI is not one voice for all. It will be a multitude of vocal persona that capture the full range of human expression. Brands will design vocal persona that speak to their diverse audience, not just a few users. As individuals, will each have our own vocal avatars.
As we harness and emulate this awesome and powerful human trait, we must anticipate the unintended consequences. We must proactively identify and protect against the potential for nefarious use. Voice is identity that cannot be swapped like passwords PINs, it must be secured from the start.
The Voice AI Moonshot
For this reason, as technologists, it is important that we take seriously our role in the creations of new technologies. While our technology provides great social benefit, In every new advance in our technology we are proactively creating measures to ensure that it can not be misused. We are committed to being active in the shaping and realization of the Voice AI future and the Voice AI Moonshot.
To our founder, Rupal Patel, the Voice AI Moonshot is a universe of voices that are convincing without being deceptive. The Voice AI future she envisions is one in which voices connect us rather than divide us, and where these voices will celebrate our diverse yet common humanity. Finally, our Voice AI Moonshot is a world where voice benefits all, not just some.
To learn more about Rupal Patel and VocaliD, please read our company page.