Technology

OpenAI built an audio cloning tool, but you can’t use it…yet

[ad_1]

As deepfakes spread, OpenAI is working to improve the technology used to reproduce voices, but the company insists it is doing so responsibly.

Today marks the debut of the first preview of OpenAI Audio engine, which is an expansion of the company’s existing text-to-speech API. In development for about two years, Voice Engine allows users to upload any 15-second audio sample to create a synthetic version of that voice. But there’s no date for when it will be made available to the public yet, giving the company time to respond to how the model is being used and abused.

“We want to make sure everyone feels good about how we deploy it — that we understand the landscape in which this technology poses a risk and we have mitigations for that,” said Jeff Harris, a member of the product team at OpenAI. TechCrunch in an interview.

Model training

The AI ​​model powering the voice engine has been hiding in plain sight for some time, Harris said.

The model itself supports the audio and “read aloud” capabilities of ChatGPT, OpenAI’s AI-powered chatbot, as well as preset voices available in OpenAI’s text-to-speech API. Spotify has been using it since early September to dub podcasts by prominent hosts like Lex Fridman into different languages.

I asked Harris where the training data for the model came from, which was a bit of a sensitive topic. It will just say that the Voice Engine model was trained on mix up From licensed and publicly available data.

Models like the Voice Engine are trained by running huge numbers of examples – in this case, speech recordings – typically obtained from public websites and datasets around the web. Many generators AI vendors view training data as a competitive advantage and therefore keep it and related information close to the chest. But the details of the training data are also a potential source of intellectual property lawsuits, which is another barrier to revealing too much.

OpenAI is actually being File a lawsuit against Due to allegations that the company violated intellectual property law by training its AI on copyrighted content including images, artwork, code, articles and e-books without giving the creators or owners credit or payment.

OpenAI has licensing agreements in place with some content providers, such as Shutterstock and news publisher Axel Springer, and allows webmasters to prevent its web crawler from scraping their site data for training data. OpenAI also allows artists to “opt out” and remove their works from datasets that the company uses to train its image generation models, including its latest version of DALL-E 3.

But OpenAI does not offer such an opt-out scheme in its other products. In a recent statement to the UK House of Lords, OpenAI suggested that it is “impossible” to create useful AI models without copyrighted material, asserting that fair use – the legal principle that allows copyrighted works to be used to create secondary works… As long as it is transformative – it protects it in relation to exemplary training.

Sound synthesis

Surprisingly, the audio engine not like that Trained or tuned to user data. This is partly due to the ephemeral way the model follows – a combination of the process of diffusion and… adapter – generates speech.

“We take a small voice sample and text and produce realistic speech that matches the native speaker,” Harris said. “The audio used is dropped after the request is completed.”

As he explained, the model simultaneously analyzes the speech data it pulls from and the text data to be read aloud, creating a matching audio without having to create a custom model for each speaker.

It’s not a new technology. A number of startups have been offering audio reproduction products for years, from ElevenLabs to Replica Studios to Papercup to Deepdub to Respeecher. So do big tech companies like Amazon, Google, and Microsoft — the last of which is a major investor in OpenAI.

Harris claimed that the OpenAI approach delivers generally higher-quality speech; However, TechCrunch was unable to evaluate this, because OpenAI declined multiple requests to provide access to the model or recordings for publication. Samples will be added as soon as they are published by the company.

We know it will be priced aggressively. Although OpenAI removed Voice Engine pricing from the marketing materials it published today, in documents seen by TechCrunch, Voice Engine is listed as costing $15 per million characters, or roughly 162,500 words. This would fit Dickens’ Oliver Twist with a little extra space. (The “HD” quality option costs twice as much, but confusingly, an OpenAI spokesperson told TechCrunch that there’s no difference between HD and non-HD sounds. Make of that what you want.)

This translates to about 18 hours of audio, which puts the price at roughly $1 per hour. This is actually cheaper than what one of the more popular competing vendors, ElevenLabs, charges – $11 per 100,000 characters per month. But he Do It comes at the expense of some customization.

Voice Engine does not provide controls to adjust the tone, pitch, or tempo of your voice. In fact, it does not provide any The knobs or dials are set in the moment, although Harris points out that any expression in the 15-second audio sample will persist across subsequent generations (for example, if you speak in an excited tone, the resulting synthetic sound will sound consistently excited ). We will see how the reading quality compares to other models when they can be compared directly.

Vocal talent as a commodity

Voice actor salaries on ZipRecruiter range from $12 to $79 per hour — much more expensive than a Voice Engine, even at the low end (actors with agents will command a much higher rate per project). If OpenAI succeeds in commoditizing voice labor. So, where does that leave actors?

The talent sector will not be entirely surprised, as it has been grappling with the existential threat of generative AI for some time. Voice actors are increasingly being asked to sign over the rights to their voices so that agents can use artificial intelligence to create artificial copies that can eventually replace them. Voice work — especially cheap, entry-level work — is in danger of being eliminated in favor of AI-generated speech.

Now, some voice AI platforms are trying to strike a balance.

Last year Replica Studios signed A.J Somewhat controversial Engage with SAG-AFTRA to create and license copies of AFTRA member voices. The arrangement established fair and ethical terms and conditions to ensure performer consent while negotiating terms for the use of synthetic voices in new works including video games, the organizations said.

Meanwhile, ElevenLabs hosts a marketplace for synthetic voices that allows users to create, verify, and publicly share audio. When others use audio, the original creators receive compensation, an amount set at $ per 1,000 characters.

OpenAI will not create any such union trades or markets, at least not in the near term, and only requires that users obtain “explicit consent” from the people whose voices are cloned, and provide “clear disclosures” referring to the AI-generated voices, Agree not to use the voices of minors, deceased persons, or political figures of their generation.

“How this intersects with the voice actor economy is something we’re watching closely and are curious about,” Harris said. “I think there’s going to be a lot of opportunities to expand your reach as a voice actor through this kind of technology. But these are all things we’ll learn as people deploy the technology and play with it a little bit.”

Ethics and deepfakes

Voice cloning apps can be abused in ways that go beyond simply threatening actors’ livelihoods.

The notorious message board 4chan, known for its conspiratorial content, user ElevenLabs is a platform for sharing hate messages that mimic celebrities like Emma Watson. The Verge’s James Vincent was able to exploit artificial intelligence tools to maliciously and quickly clone voices, generation Samples containing everything from violent threats to racist and transphobic statements. At Vice, reporter Joseph Cox documented the creation of an audio transcript convincing enough to fool the bank’s authentication system.

There are concerns that bad actors will try to influence the election by cloning the audio. And they are baseless: In January, a campaign used a fake President Biden phone call to deter New Hampshire citizens from voting, prompting the Federal Communications Commission to move to make such future campaigns illegal.

So, aside from banning deepfakes at the policy level, what steps, if any, is OpenAI taking to prevent Voice Engine abuse? Harris mentioned a few.

First, Voice Engine is only made available to a very small group of developers — about 100 — to get started. Harris says OpenAI prioritizes “low-risk” and “socially beneficial” use cases, such as those related to healthcare and accessibility, as well as “responsible” synthetic media experiences.

A few early users of Voice Engine include Age of Learning, an educational technology company that uses the tool to create voiceovers from previously cast actors, and HeyGen, a storytelling app that leverages Voice Engine for translation. Livox and Lifespan use Voice Engine to create voices for people with speech difficulties and disabilities, and Dimagi is building a tool based on Voice Engine to provide feedback to health workers in their primary languages.

Here are the sounds generated by Lifespan:


And this one from Livox:

Second, transcriptions created with Voice Engine are watermarked using technology developed by OpenAI that includes non-audible identifiers in the recordings. (Other vendors, including Resemble AI and Microsoft, use similar watermarks.) Harris didn’t promise that there were no ways to circumvent the watermark, but he described it as “tamper-resistant.”

“If there’s an audio clip, it’s easy for us to look at that clip and determine that it was created by our system and the developer who actually created that generation,” Harris said. He said. “Right now, this isn’t open source – we have it internally at the moment. We’re interested in making it publicly available, but obviously that comes with additional risks in terms of it being exposed and breaking.”

Third, OpenAI plans to provide members of its Red Team Network, a contracted group of experts who help inform a company’s AI model risk assessment and mitigation strategies, with access to Voice Engine to spot malicious uses.

Some experts Argues That AI red teaming is not comprehensive enough and that vendors must develop tools to defend against harms that their own AI may cause. OpenAI isn’t going that far with Voice Engine, but Harris stresses that the company’s “overarching principle” is to release the technology safely.

General release

Depending on how the preview and public reception of Voice Engine goes, OpenAI may release the tool to the broader developer base, but for now, the company is hesitant to commit to anything concrete.

Harris an act Take a peek at the Voice Engine roadmap, revealing that OpenAI is testing a safety mechanism that will have users read randomly generated text as proof that they are present and aware of how their voice is being used. This could give OpenAI the confidence it needs to bring Voice Engine to more people, or maybe this is just the beginning, Harris said.

“What will continue to move us forward in terms of actual voice matching technology will really depend on what we learn from the pilot, the safety issues that are uncovered and the mitigation measures that we take,” he said. “We don’t want people to confuse artificial sounds with actual human voices.”

On this last point we can agree.

[ad_2]

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button