Using AWS Polly to bring our game to life

Screenshot: Askutron - The host lending his name to the game.

When we decided to build Askutron: The Quiz Show Game we had to decide how we would get the voice over for all our questions and answers which are read out aloud by the host. We were planning to have a lot of questions. In fact at this point we have more than 10.000 of them. Being a tiny studio we don’t have a lot of money at our disposal. So we were looking for alternatives to traditional voice over work.

Speech synthesis has improved by a lot in the past decade. These days Siri, Google Assistant and Amazon’s own Echo are used by many people and are speaking quite naturally. So perhaps we can reduce costs by using speech synthesis? Besides the lower cost this would bring another big advantage in that it enables us to have voice overs even for player-generated content in our quiz. Moreover this way our game’s host is able to say arbitrary player names. Sounds good! So how do we do this?

Enter: Polly

After some research we stumbled upon Amazon’s Polly. Polly is a service used via a REST API which allows you to synthesise speech in over 20 languages including English, German, Russian and even Japanese. It does this at an astonishingly low price. 4$ per 1 million characters, that is round about 23 hours of speech. Using actual voice actors this would cost thousands of dollars.

How does it sound? You can listen to some samples here. That’s pretty neat! Generated audio files can be downloaded and re-used as often as you like at no additional cost. Our own questions we pre-generate and bundle with the game. User-generated quizzes require an online connection and are voiced on-the-fly using Polly.

In the following video you can hear how it sounds in the game.

That does sound pretty good. In some of our play tests people didn’t even realise that the voice was synthetic. Even when it’s obvious it’s synthetic it’s not a problem. It is for that reason we made our game host, Askutron, a robot.

Using Polly

Our backend quiz editor is a web application built using Ruby. In Ruby all you have to do is to add the aws-sdk to your Gemfile.

gem 'aws-sdk', '~> 2.9', '>= 2.9.14'

Once you have that you need an access key to use with the AWS SDK and you’re good to go. Generating audio is as simple as it gets. Here is a code snippet from our backend:

##
# Synthesizes audio for the given text.
#
# @param text [String] The text to synthesize audio for.
# @param output [String|IO] (optional) Path to file to write or IO object to stream to.
# @return [SynthesizeSpeechOutput] A struct containing `audio_stream` (IO) and `content_type` (String).
def say(text, language: nil, output: nil, ssml: false)
  client.synthesize_speech(
    response_target: output,
    output_format: "ogg_vorbis",
    sample_rate: "22050",
    voice_id: language_voice_id(language),
    text: text,
    text_type: ssml ? 'ssml' : 'text'
  )
end

##
# Polly client. Configured through the environment variables
# AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
def client
  @client ||= Aws::Polly::Client.new
end

def language_voice_id(lang)
  case lang
  when "de"
    "Hans"
  when "ru"
    "Maxim"
  when "en"
    "Brian"
  else
    "Brian"
  end
end

Depending on the language a voice is chosen. For each supported language Polly offers one or more different voices, mostly one male and one female. These voices are identified by names such as the obvious Hans for the male German voice. In this example we only support English, German and Russian.

Using this all you have to do to download the synthesised speech is:

say "Hello Polly", output: "hello.ogg"

This saves the audio in a file in the current work directory. Alternatively you can also omit the output parameter and read the audio data directly. We use this to stream audio in our API’s on-demand playback endpoint. Which, in a Ruby on Rails application, would look something like this:

class AudioController
  def playback
    ssml = SSMD.to_ssml params.require(:ssmd)
    result = Audio::Polly.say ssml, ssml: true, language: params[:language]

    if result
      send_data result.audio_stream.read, type: "audio/ogg", disposition: "inline"
    else
      render json: { errors: ["invalid SSML"] }, status: 400
    end
  end
end

SSML

Looking through the snippet you may have wondered what SSML or SSMD is. SSML is an XML-based markup language which offers a lot of options to influence the way a phrase is synthesised.

We are using SSML for questions where Polly just can’t manage on its own to pronounce a phrase the way we want it or to put emphasis on certain words in questions where that is important.

For example:

<?xml version="1.0"?>
<speak>
  Jeff: <p xml:lang="es">Puto</p>.
  Jim:  What? <emphasis>What</emphasis> did you just say?
  Jeff: <prosody volume="soft" rate="x-fast" pitch="x-high">I DIDN'T SAY ANYTHING</prosody>.
</speak>

In this example we want to let the speech synthesis engine know that “Puto” is a Spanish word and supposed to be pronounced accordingly. Also we want to put emphasis on the 2nd “What”. Lastly Jeff’s response is supposed to sound nervous, that is quiet, fast and in a high pitch.

Now as you can see the markup is a bit verbose. It’s not easy to read for editors and is tedious to write. To ease the pain a bit here we were looking for the SSML equivalent of Markdown which makes it easier and less verbose to write HTML formatted text.

Unfortunately we weren’t able to find anything like that. So we built our own tool for that: SSMD.

SSMD

SSMD is short for Speech Synthesis Markdown. As the name suggests it is inspired by Markdown. Using SSMD we can make the SSML example from above a bit more concise and readable:

Jeff: [Puto](es)
Jim:  What? *What* did you just say?
Jeff: [I DIDN'T SAY ANYTHING](vrp: 255).

SSMD allows a number of short-hand notations for all kinds of SSML constructs such as Emphasis. For more other things such as Language and Prosody words can be tagged by enclosing them in brackets and writing the desired tags behind in parens.

In the first line we can see that any 2-letter tag on its own will be interpreted as a language ISO code. Asterisks which are used for making text appear bold in Markdown are used for emphasis in SSML. In the last line the tag “vrp: 255” is used. “vrp” stands for “voice, rate, pitch” and the given digits indicate each of those. 1 being the lowest possible value (e.g. the softest volume) and 5 being the highest one (e.g. the highest pitch).

All our questions and answers are defined as SSMD. Our tool converts this SSMD text to SSML for the request to Polly to synthesise the audio.

Conclusion

We discussed in this article how we use Amazon Polly to synthesise audio for use in our trivia game Askutron. Questions and answers are written using SSMD which is converted to SSML and sent to the Polly API to download the synthesised audio files. The generated speech sounds quite natural most of the time and at a price of $4 per 1 million characters this is really affordable.

Check out how we actually use the audio in our game in Mathias’s article.

P.S.

I orginally wrote this article on Medium but have since made an effort to collect all the articles I’ve written here.