Polly Workshop Instructions

Amazon Polly is a service that turns text into lifelike speech.

The instructions and demo code here help show how to quickly get started with Amazon Polly and easily integrate it into your existing applications.

What you'll do

We will first use the Polly console to convert sample sentences and SSML to speech.

We will then use the Polly API for programmatic generation of speech.

Permissions

The IAM user that you're logged in as will require perissmissions from whoever administers your AWS account. Please see http://docs.aws.amazon.com/polly/latest/dg/api-permissions-reference.html

Getting Started

Before proceeding, you'll need an AWS account. And for the purposes of this workshop, please choose Virginia (us-east-1) as your region.

Polly generates speech from both text and SSML (Speech Synthesis Markup Language). SSML allows you to customize and control things like pronunciation, volume, and speed.

SSML

Example 1

Our first SSML snippet below showcases tags to customize speech to account for phone numbers and variable speed.

<speak>
  Thank you for calling today. Your phone number is <say-as interpret-as="telephone">2122241555</say-as>. 
  <prosody rate="x-slow"> I know </prosody> this is not a number, hence I will not say 2122241555.
</speak>

Example 2

Our second SSML snippet showcases additional tags to customize speech with aliasing (saying World Wide Web Consortium instead when it sees W3C) and volume.

<speak>
  He was caught up in the game. <break time="1s"/> 
  In the middle of 10/3/2014, <sub alias="World Wide Web Consortium">W3C</sub> meeting, he shouted <prosody volume="x-loud">Score!</prosody> quite loudly. 
  When his boss stared at him, he repeated <amazon:effect name="whispered">"Score"</amazon:effect>, in a whisper.
</speak>

Additional Tags

For a full list of all supported SSML tags and what they do, please see Polly documentation here: http://docs.aws.amazon.com/polly/latest/dg/supported-ssml.html

API

You can easily integrate text-to-speech generation into your own apps by leveraging the Polly API.

You can use any language, but this example here uses Ruby. So, to run this example, you will need a modern version of Ruby and you will need to install the AWS SDK Ruby Gem.

Step 1

Make sure you have Ruby installed

➜  $ ruby --version
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin16]

Step 2

Install the aws-sdk ruby gem

gem install aws-sdk 

Step 3

Create a file called aws.yml and add your AWS credentials to it.

#
# change this to your specific id and secret
#
access_key_id: ABCDEFGHIJKLMNOPQRST 
secret_access_key: ABCDEFGHIJKabcdefghijk/abcdefghABCDEFGHI                   

Step 4

Create a synthesizer.rb file with the following code

#!/usr/bin/env ruby

#
# Install the following gems from your command line:
#     gem install 'aws-sdk'
#
require 'aws-sdk'
require 'YAML'

class Synthesizer
  SSML1 = <<-eod
    <speak>
      Thank you for calling today. Your phone number is <say-as interpret-as="telephone">2122241555</say-as>. <prosody rate="x-slow"> I know </prosody> this is not a number, hence I will not say 2122241555.
    </speak>
  eod

  SSML2 = <<-eod
    <speak>
      He was caught up in the game. <break time="1s"/> In the middle of 10/3/2014, <sub alias="World Wide Web Consortium">W3C</sub> meeting, he shouted <prosody volume="x-loud">Score!</prosody> quite loudly. When his boss stared at him, he repeated <amazon:effect name="whispered">"Score"</amazon:effect>, in a whisper.
    </speak>
  eod

  def initialize(region: 'us-east-1')
    #
    # N.B. 1: Do not hard-code credentials in source code. 
    # N.B. 2: Do not source control a credentials file.
    #
    creds       = YAML.load(File.read 'aws.yml')
    credentials = Aws::Credentials.new(creds['access_key_id'], creds['secret_access_key'])

    # Polly client
    @polly = Aws::Polly::Client.new(
      region: region,
      credentials: credentials
    )
  end

  # Method for Text to Speech conversion
  def synthesize(text:, response_target: 'speech.mp3', output_format: 'mp3', voice_id: 'Joanna')
    request = {
      response_target: response_target,
      output_format: output_format,
      voice_id: voice_id,
      text_type: 'text',
      text: text
    }

    @polly.synthesize_speech(request)
  end

  # Method for SSML to speech conversion
  def synthesize(ssml:, response_target: 'speech.mp3', output_format: 'mp3', voice_id: 'Joanna')
    request = {
      response_target: response_target,
      output_format: output_format,
      voice_id: voice_id,
      text_type: 'ssml',
      text: ssml
    }

    @polly.synthesize_speech(request)
  end
end

synth = Synthesizer.new
synth.synthesize response_target: 'speech1.mp3', ssml: Synthesizer::SSML1
synth.synthesize response_target: 'speech2.mp3', ssml: Synthesizer::SSML2

Step 5

Execute this file. You can do either of

$ ruby synthesizer.rb

or

$ chmod +x synthesizer.rb
$ ./synthesizer.rb

Step 6

This code will have used the Polly API and generated speech from the two SSML snippets in speech1.mp3 and speech2.mp3. Play the files with your in-built audio player.