Engineering the Future: The Promise and Perils of Voice AI
Engineering the Future: The Promise and Perils of Voice AI
By Rupal Patel
Early days at VocaliD
If we are honest, we started VocaliD to change the world, at least, to change the way voice is represented within the world. We are proud to know that for some individuals, our technology has changed THEIR world and improved their lives and the lives of those around them.
Building a technology company with the goal to change societal norms comes with significant responsibilities. When you do anything, you impact something. When you bring a new technology into the world, the ability to fully predict every impact is impossible. Chaos theory, the butterfly effect, … despite our best intentions, there will be unintended consequences. And so, as a company that began with the goal of empowering every voice to be heard, how do we limit the misuse of our work?
For us, we have asked ourselves hard questions early and often. What do we stand for? Are we opening up a pandora’s box? How will we act responsibly today to mitigate risks in the future? It is never an easy exercise, but nowhere is it as challenging as building a technology-based company within a nascent space. There isn’t a formula to follow and you have to be comfortable with the discomfort of not knowing.
In the early stages, we were just figuring out who we were and then, as the industry has grown around us, we needed to respond to the changing environment. We needed to decide how and whether to pivot or expand. We asked ourselves whether we can grow and still stay true to ourselves.
When we began this company in 2014, we could only begin to imagine the potential for voice. Our initial focus was on the needs of individuals living without speech, or with impaired speech. It soon became apparent that there was a need for unique synthetic voice beyond assistive technology.
Technological Advances Bring New Opportunities and New Risks
A few years ago technology didn’t allow for subtle nuances in voice, hence the robotic sameness that existed for so long. Today, we can emulate these unique characteristics within individual voices and create vocal identities that have personality. As the ability to create synthetic voice approaches ultra-realistic quality, it brings with it both unparalleled opportunities for individuals and businesses but also increased risk in the form of fraud and deception. These concerns are on the minds of many technologists, especially those of us working in synthetic media. How can we ensure that the work we do is contributing positively, rather than fueling harmful and nefarious activities?
As we prepared for last year’s GDPR compliance we found ourselves realizing that compliance alone was not enough to combat misappropriation of technologies. We need more proactive engagement by the AI community, to ensure we ward off misuses well before they happen.
How do we do this? What are the steps? Well, it is important to note that what we think is adequate today is going to evolve… it will depend on the growth of the industry and the consumers of this technology. While we can’t possibly anticipate all potential misuses and build countermeasures to block them, we believe it is essential to draw attention to these complex issues in the formative stage of the synthetic media industry so that developers think through the unintended consequences and consumers have the awareness to carefully evaluate the products.
Starting the AI Ethics Conversation
"We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally."
We believe education and open ongoing communication is the way forward. With this in mind, we’ve partnered with Modulate to form the The AITHOS Coalition as a way for all of us working within synthetic media, from technologists and CEOs to sales and marketers, to help shape the field with the mindful intention of the overarching issues.
Together, we came up with a self-reflection guide to facilitate the critical discussions we were having within our organizations and amongst others in the field. Some of the topics may not be relevant to all, but we believe they are a starting point in the conversation and are worth considering when building products that have the potential to both disrupt and be misused.
VocaliD’s take on the AITHOS dialog
Why is this Synthetic Media Valuable in the First Place?
When we began, it was obvious. A young girl and a middle-aged man are so inherently different - unique in age, gender, and personality. Of course, they both deserved to have unique vocal identities to speak through their assistive devices. However, this isn’t what was happening. The devices had few voices available and too often, they didn’t represent the person or personality using the prosthesis. VocaliD set out to change that by building personalized synthetic voices.
As we continue to do this important work, we have begun listening to all of the voices around us and we’ve realized that the issue we were solving didn’t only exist in the assistive technology space, it exists in everyday life. All one needs to do is listen … and you will hear very quickly that synthetic voices we come in contact within our daily lives … are bland, at best. The current voice offerings are not representative of all of us, of our communities, of our needs. For voice actors, this means that only white men or women, or white-sounding men and women, are voicing nearly all of the virtual assistants, IVR systems, alerts and notification systems, and gaming avatars. VocaliD democratizes synthetic voice, creating vocal identities that are as colorful and diverse as the world we live in. Whether it is regional or social dialects or internationally accented English, today’s synthetic voice should sound like us - all of us.
Who or What Should Your Tech be Able to Emulate?
Scientists and artists have been fascinated with emulating speech for centuries. What’s different today than in the past is that we are not only emulating speech but also voice… how someone sounds, the actual vocal identity of the speaker, making it susceptible to misuse. What’s adding to the problem is that we are now saving and archiving audio data at an unprecedented rate. CEOs, political figures, influencers all have hours of relatively clean audio that is easily accessible to anyone. Moreover, there is a push to make tools and technologies open source to both fuel innovation and equalize access but with that comes risks. We use proprietary techniques that we purposefully do not open source as a precautionary measure.
Can we build the voices of children, influencers and other vulnerable populations? Yes, we can build anyone’s voice if we have enough data. We have policies in place to protect data and we have guidelines for voice building. From day one, our approach was to blend voices for those with disabilities. Today, as we work with enterprises, we require that the talent has consented to the use of their voice before we start the project. Moreover, we take proactive steps to ensure that should our data or tech fall into the wrong hands, it would be sufficiently obfuscated.
When Should You Share How Your Technology Works with the World?
Given anticipated consequences, we feel it is responsible to be cautious regarding what and how much we share about how the technology works. While we are committed to advancing the field through the dissemination of our technology and findings, we also understand that some aspects of the IP and know-how need to be undisclosed.
We are focused on educating consumers at this stage — before synthetic voices are indiscernible by humans. Engaging in damage control when audio deep fakes proliferate would be a dangerous proposition.
Where Can You Sell Your Technology While Still Ensuring it is Used Responsibly?
Should we sell just because we can? No. This is a matter of principle. We purposefully seek out engagements only with organizations and individuals that are aligned with our values. Our licensing agreements and business contracts also reflect these fundamentals.
What Data Does Your Machine Learning Process Use?
Voice is complex and highly personal. The very nature of our work is to build a more inclusive and diverse universe of synthetic voice. Our machine learning algorithms consider a broad demographic varying in age, gender, language background, geography to reflect real-world variations.
How can it be detected?
Synthetic voice has advanced considerably in the past few decades. Most recently, the availability of large datasets and machine learning tools has catapulted the field. Soon synthetic voice will become indistinguishable from real audio. We are working on a multi-pronged strategy that encompasses audio steganography (watermarking), voice blending and countermeasure tools. We began talking about this as the moonshot for voice AI - the need to build voices that are life-like without being deceptive, that unite us, rather than divide us. We cannot afford to blur the lines between the virtual and physical world to the extent that comprises the core values of our society. We cannot blindly build technologies without understanding that these new tools will change us fundamentally. For us, we believe that change needs to be a net positive.
Just the Beginning
This is an exciting time in Voice. Advances in technology are coming at breakneck speeds, allowing us to offer world-class synthetic voices that truly represent all of us. For individuals and brands that have relied on the voice of only a few, this is amazing news. As voice-first interfaces continue to expand across all aspects of our lives, from customer service to health care and entertainment, customized synthetic voice will power these user experiences. Protecting synthetic media from potential abuse will require a joint, collaborative effort to create the most advanced and impenetrable barriers that protect us all. We hope you join the conversation and the AITHOS coalition.
The Top Leaders in Voice is a recognition bestowed upon those working in the voice space by a panel of journalists and voice technology peers. The 2019 Voice Leaders are segmented into four lists - Visionaries, Designers and Products, Technologists, and Influencers.
"Topping the list in the visionaries category are Jeff Bezos of Amazon and Adam Cheyer from Samsung and Viv Labs.
Designers and products leaders include Google’s Cathy Pearl and Mark Webster from Adobe.
The Technologists category includes another Googler, Brad Abrams, along with John Giannadrea from Apple and Rohit Prasad from Amazon.
Then there are the Influencers that are shaping the way consumers and corporate executives think about voice technology today such as Gary Vaynerchuk of Vayner Media, Dave Isbitski from Amazon, and Noelle LaChartie from Microsoft.
However, these lists are not limited to executives at big companies. Many have an outsized influence on the market today because of the reach and impact of their companies, but also included are startup founders from Audioburst, Audio Analytic, Clinc, Orbita, Pretzel Labs, and VocaliD among others."
Voicebot utilized a variety of methods to narrow down their list and determine the final Top 44 Leaders in Voice, including - surveys of voice professionals, panel suggestions, media coverage, and social media activity. None of the judges or writers affiliated with Voicebot were eligible for consideration. Each individual was then scored based on several factors and, if chosen, were then segmented into their leadership category.
Learn more about the selection process and the panel of judges, listed at the bottom of the overview page, linked above.
Spread the word
The SEARCH FOR RORI's VOICE
MICHELE J MARTIN
One of the founding missions of VocaliD is to give voice to the voiceless. It fuels everything we do. Every time we are able to build and deliver one of our personalized synthetic voices to an individual and improve their lives, we are reminded of why we do this.
Chasing the Cure is a new television show that brings the power of crowdsourcing to medical mysteries. Cases that have been undiagnosed or misdiagnosed, all desperate for a cure. The site features a list of different case files that visitors can read through and comment on. One of these case files was Rori.
We were so moved by Rori's case. As a speech scientist, our founder, Rupal Patel, was certain she could help, so she recorded a video message that we then posted on Twitter... hoping it would get in front of Rori and the team at Chasing the Cure.
Confident that we could help Rori to regain some of her lost vocal identity and independence, the video offered to build her a unique synthetic voice through the power of crowdsourced voices from contributors around the world. We made the offer and asked her to contact us if interested and then we waited.
On tonight's live premiere episode, our video made it in front of Rori.
As Mr. Vedantam begins the podcast, "At some point in our lives, many of us realize that the way we hear our own voice isn't the way others hear us. And we begin to realize that our voices communicate so much more than mere information: they reveal our feelings, our temperament, our identity."
This sets the tone for the next 30 minutes in which voice as identity is looked at from several angles, including a transgender woman's struggle with hearing herself in her voice, a woman who experienced a drastic change in her voice after surgical intubation damaged her vocal cords, and in the case of speech disorders requiring speech generating devices to communicate, how the use of modern speech synthesis technology can provide these individuals with their own unique identifiable vocal identities.
"Voice is about who you are. Our voice signals how old we are. Our voice signals our gender. Our voice signals, you know, things about our personality."
An important part of any technology conversation is "how do you mitigate the unintended outcomes?" and this is something Rupal and Shankar briefly touch upon. With the increase in deep fakes across media, Ms. Patel discussed the vulnerabilities and risks of new voice technologies, from political to financial impacts. She further stated that along with advances, there are ethical responsibilities that companies building these technologies must consider, and how VocaliD has designed ethical AI into our business.
In summary, this podcast is a wonderful introduction into the concept of voice as identity. Be sure to subscribe to Hidden Brain for more fascinating episodes as Shankar Vedantam uses science and storytelling to reveal the unconscious patterns that drive human behavior, shape our choices and direct our relationships.
BeSpoke Voices Named Finalist in 2019 Index Awards
MICHELE J MARTIN
VocaliD is honored that BeSpoke Voice has been named one of 42 finalists in the 2019 Index Awards. Nominated for the Body Category, BeSpoke Voice finds its place alongside innovative solutions such as Thumy, Petit Pli, MasSpec Pen, and Project Coelicolor.
The Index Award is a biennial award launched in 2005 by The Index Project, a global design event that was established in 2002 by Kigge Hvid. While it was originally crafted as a way to promote Denmark as an innovative design nation, the project quickly took on a more global footprint. Its mission is a simple yet lofty one - "Design to Improve Life".
Since their inception they have given out 42 awards in 5 categories: Body, Home, Work, Play & Learning, Community, and a People's Choice. Winning products and team span the globe. Some of the previous winners of the Index Awards include Tesla (the only team to win twice - once for the the Tesla Roadster and again for the Tesla Powerwall rechargeable battery), Raspberry Pi, Ethereum Foundation, Duolingo, Paperfuge, The Ocean Cleanup Array, Labster, Fresh Paper, and the " Copenhagen Climate Adaption Plan".
"I felt it was important to create a solution that would give every individual a unique voice."
When will we hear if we won? Well, the Index Award's ceremony takes place on September 6th, 2019 in Copenhagen, Denmark. Regardless of whether we win, we are proud to be recognized alongside our esteemed global peers in technology and innovation. We will be cheering everyone on... and hoping for a win, of course!
This one hour in depth interview was a deep dive into VocaliD, as well as, the history and science of speech synthesis, providing the listener with a solid understanding of the hows and whys of modern voice AI.
During the podcast, Rupal and Bret delved into the future of computer-generated voice and how it relates to the surge in voice-first products we are seeing (and hearing). The technological advances in machine learning will undoubtedly offer numerous benefits from both a consumer and brand standpoint.
One of the many interesting take aways was the impact that today's advances in speech synthesis will have on inclusivity and allowing communities to feel less disenfranchised.
Rupal explained that if you look at the past - the prototypes for radio and television broadcasting were a very limited voice or face. There wasn't much diversity in the beginning, but now you are seeing, and hearing, a far more diverse range of communities in these two mediums. This hadn't yet caught on in the synthetic voice world however, and Rupal is eager for what will come now that VocaliD can offer unique high quality diverse voices.
"Our world is diverse. From age, gender, sexual orientation, and accents, and we don't hear much of that at all in the synthetic voices we hear around us."
-Rupal Patel, CEO of VocaliD
Wrapping up this educational podcast, Ms. Patel discussed the ethical responsibilities that technology companies must be aware that they hold when creating new technologies that may bring unintended consequences - and how it is important to consider ways in which to build safeguards into the design of your technology to mitigate these risks.
The Flexibility of Synthetic Speech & Personalization of Human Talent
CUSTOM AI-BASED VOICES OFFER A DISTINCTIVE SOUND FOR PEOPLE AND BRANDS
By Rupal Patel
Darwinian forces of natural selection favor uniqueness – from birds to people to corporations. Enterprises capture this uniqueness in their brand. A distinctive logo, a preferred font, a color palate, and now more than ever, a vocal identity. With over 2.5 billion voice-enabled devices globally and 8 billion voice assistants projected globally by 2023, the demand for unique voice is skyrocketing. Corporations want to be heard and listeners want variety.
"In the voice-first era, companies that compromise on unique voice, simply will not be heard over their competition, and that’s not good for business."
Want to read the entire article that was published on Voicebot.ai?
Spread the word
The New Role of Voice in Brand Storytelling
The New Role of Voice in Brand Storytelling
MICHELE J MARTIN
Historically when the term voice has been used in business, we have talked about the overall tone and thread of messages across channels. Voice wasn’t literal, it was figurative. The voice-first explosion has changed that.
Until now, it was enough for a brand to consider their words and how they would be perceived and whether they seemed to complement and reflect who the company is. While storytelling is a buzzword that is often overused, when done well, it is a way of taking your audience on a journey with your brand. Each piece of copy and content was a paragraph or a chapter - coming together to make up the larger narrative of the brand story. Over time, with thanks to the advances in different converging technologies, storytelling left the page and went to video, went to music, and now… it has come to voice.
Voice AI is the application of state-of-the-art machine learning to speech blending algorithms to transform human voice recordings into the digital voices brands need today. VocaliD's digital, or ‘synthetic’ voices, have their roots in traditional text-to-speech, but Voice AI isn't yesterday's text-to-speech. Today's voices are expressive and diverse. No longer the robotic speech we’d come to associate with a synthesized voice.
The cutting-edge quality of today’s synthesized digital voices allows brands to extend their reach and be present and voice-consistent in every customer audio touch point... seamlessly.
"Voice AI is not a replacement of voice talent, in fact, it is a way to empower brands and talent to augment their current capabilities in order to meet the changing demands of the voice-first revolution and provide consumers with better Voice Experiences (VX)."
No matter where your brand is, you want your customer to associate positively with each touch point, whether that is written, visual, or auditory. If the digital voice you are using across voice-first devices, in your automated call centers, or in your marketing campaigns doesn’t denote unity, if it creates cognitive dissonance in your audience, you risk losing the attention of your consumer at a vital time in their customer journey.
Many brands today have their spokesperson - Flo, Mayhem, “Do you hear me now?”, and even Poo-Pouri… these brands have all incorporated very specific talent personas for their brands. Voice AI is not a replacement of this talent, in fact, it is a way to empower brands and talent to augment their current capabilities in order to meet the changing demands of the voice-first revolution and provide consumers with better Voice Experiences (VX).
The Possibilities of Voice AI
Imagine having Mayhem deliver real-time dynamic messaging to Allstate customers? Predicting every short script that the actor portraying Mayhem would need to record would be challenging and the delays in recording and delivery of the files would eliminate the ability to make on-the-fly changes. With licensing and royalty agreements in place, brands could now ensure that their spokesperson was available both in human and digital form, ensuring full omni-channel residency and voice consistency.
These technological advances and new products and applications have created both challenges and amazing opportunities for brands today. New roles are being created - Voice Experience Designers and Voice Strategists are two that we’ve seen emerge in the last year. More and more CMOs and other executives are talking about the need for their brand to develop an extensive Voice Strategy. The brands that choose to take control now and design their Vocal Personas to complement their overall story will rise above, their message amplified above the noise as others quickly get in line to join the voice-first movement.
Founded by acclaimed speech scientist Rupal Patel, VocaliD's team of speech scientists, technologists, and entrepreneurs set out to create a new universe of voices. Voices that felt alive and showcased the identity of the speaker, rather than a random robotic string of words that lacked in personality. Believing that everyone deserved a voice as unique as them, this work profoundly impacted us all and showed us first hand how important the voice experience (VX) is to an individual. Voice experiences didn’t just impact the user experience (UX), it had the power to change lives.
In early 2018, we realized that voice is more complex and layered today than when we first began our mission. With the explosion of voice-first devices and interfaces, voice has become a robust new touch point for businesses to reach their consumers. It has also become a vulnerability requiring a proactive approach to safeguard it against threats. As leaders in the voice synthesis space, we saw opportunities to continue to innovate and impact with our technology by expanding outside the consumer market we first launched in.
Now in 2019, we are bringing voice to brands, institutions, and organizations seeking out cutting-edge voice technology as a way to provide better products, services, and experiences for their customers.
"Certainly, voice technology is still fairly young and a lot of questions remain, however, what is becoming quite clear is that voice experience (VX) is the new strategy for marketing and c-suite to embrace."
At VocaliD, we foresee voice as a key in the unfolding battle to master and deliver great customer experiences. In a world populated with chatbots, smartphones, voice authentication systems, and smart devices, Voice is everywhere. Voice is impacting UX and making, or breaking, customer experiences.
In order to be heard above the noise in this new arena, brands and organizations will need to become thoughtful in the VX of their customers. It is evolving and becoming a pathway for business development and sales, marketing, research, and operations. Voice is quickly becoming the omni-channel tool businesses must build into their CX strategies. Certainly, voice technology is still fairly young and a lot of questions remain, however, what is becoming quite clear is that VX is the new strategy for marketing and c-suite to embrace.
Voice is more than a way to express words. Until recently, voice technology has used digital voice merely as an output modality. A means to an end. Consequently, businesses have underestimated the power of the voice experience. Voice is a social connection that brings people together; it empowers stories and bridges the gaps between businesses and consumers in order to enhance all of our lives.
Voice has power, particularly when it is unique and personal. People are diverse, shouldn’t the voice we use to connect with them be as well? Today, custom voice can create a new and vibrant world of opportunities for businesses and individuals. Tomorrow? We haven’t begun to get a glimpse over that horizon yet to fully appreciate the impact voice-first technologies will have. Certainly, what we have heard, we are excited for!