How to Make Unity Speak

Using The Tech Behind Amazon Alexa for Text-To-Speech in a Unity Game

Screenshot: A Japanese question in Askutron Quiz Show

This post is about using synthetic speech generated by Amazon Polly to read out text in our game Askutron Quiz Show. To find out how the synthetic speech audio is actually generated check out this blog post by my brother: Using AWS Polly to bring our game to life.

Making use of the synthetic speech generated on our server using Amazon Polly is relatively easy in Unity. For this we first need an URL that tells us where to download the audio files for each question and answer. While you could download or even stream the audio directly from Amazon’s web service, we’re using a Ruby app on our server as a mediator which caches once generated audio files to reduce costs and improve performance.

Our game gets these URLs in the quizzes which are supplied to the game as JSON files that look similar to this:

{
    "text": "Who is the most badass female character in the \"Harry Potter\" novels?",
    "language": "en",
    "mode": "opinion",
    "audio": "/audios/79470/download",                    
    "answers": [
        {
            "text": "Hermione Granger",
            "correct": true,
            "audio": "/audios/31/download"
        },
        {
            "text": "Ginny Weasley",
            "correct": true,
            "audio": "/audios/79471/download"
        },
        {
            "text": "Minerva McGonagall",
            "correct": true,
            "audio": "/audios/4918/download"
        },
        {
            "text": "Bellatrix Lestrange",
            "correct": true,
            "audio": "/audios/741/download"
        }
    ]            
}

For questions that have been created by us ahead of time the audio files have already been generated and an URI reference to the each of them is found in the JSON data. Questions created by users are generated on the fly, in which case the game will dynamically generate an URL such as /audios/playback?ssmd=You+said+*what*?&language=en to query the audio for arbitrary phrases.

A simplified example of how to load and play the questions this way is shown in the following code which defines a TextToSpeech class that uses our server to query audio as needed.

using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

namespace Goldsaucer.Askutron.TTS
{
    public class TextToSpeech
    {
        private const string ServerUrl = "https://example.com";
        
        public readonly string Text;
        public readonly string Language;
        public readonly string AudioUri;

        public AudioClip AudioClip { get; private set; }
        
        public TextToSpeech(string text, string language, string audioUri = null)
        {
            Text = text;
            Language = language;
            AudioUri = audioUri;
        }

        /// <summary>
        /// Generate HTTP headers required to fetch audio from the server.
        /// </summary>
        /// <param name="phrase">text for which to download the audio</param>
        /// <returns>A dictionary with the HTTP headers</returns>
        private static Dictionary<string, string> GenerateRequestHeaders(TextToSpeech phrase)
        {
            var headers = new Dictionary<string, string>();

            // [...] add required authentication headers etc.

            return headers;
        }

        /// <summary>
        /// Load the question's audio
        /// </summary>
        public IEnumerator Load()
        {
            var uri = AudioUri ?? $"/audios/playback?ssmd={Uri.EscapeUriString(Text)}&language={Language}";
            var headers = GenerateRequestHeaders(this);
            var request = new WWW(ServerUrl + uri, null, headers);

            yield return request;

            AudioClip = request.GetAudioClipCompressed(false, AudioType.OGGVORBIS);
        }

        /// <summary>
        /// Read the question out aloud
        /// </summary>
        public IEnumerator Play()
        {
            AudioSource.PlayClipAtPoint(AudioClip, Vector3.zero);

            yield return new WaitForSeconds(AudioClip.length);
        }
    }
}

A class similar to this is used in Askutron Quiz. As you can see the audio is either downloaded using the supplied audio URI or a dynamically generated URI that contains the text as a query argument. For security reasons these requests are only accepted on the server when the right HTTP headers such as an authentication token are provided.

Obviously this code works for any audio downloaded from the server, not just synthetic speech. If you don’t want to generate the audio on a server with Ruby like described in Markus’ blog post, there is an AWS C# SDK you can use to do this with C# and even directly in your game. For an example on how to achieve this see Chris Bitting’s great blog post about using Amazon Polly from .NET / C#.

The AudioClips are downloaded and created using Unity’s built-in WWW class and the GetAudioClipCompressed extension method. It doesn’t matter if the audio is loaded from the server or directly from the file system (using file:/// URLs). During the introduction of each round all questions and answers are loaded and subsequently the audio files are played in order to read the questions and answers out aloud.

public IEnumerator SpeakUp()
{
    var tts = new TextToSpeech("Hello World", "en");
    yield return tts.Load();
    yield return tts.Play();
}

The actual code used in our game naturally is a bit more complicated than that, though this is what it boils down to. Amongst other things it adds error handling and most importantly a local file cache to avoid downloading the same audio file twice.

Amazon Polly currently supports around 18 languages and some variants for English, Portuguese, Spanish and French. Thanks to this our game can read content in any of these languages and we can fully voice even quizzes created by players using the editor that is included with the game.

Overall I’m pretty happy with the quality of the synthetic speech generated by Amazon Polly. There are some rough edges here and there, but at times people didn’t even realize the speech was synthetic when we showcased the game. Mostly it just works. And if not we can still fine tune the pronunciation using SSMD.

If you want to see this used in action, feel free to check out Askutron Quiz Show on Steam!