We had good success bringing our Wunderlist skill live (see our blog post here) and want to explore the possibilities of Alexa further.

Alexa can stream music from sources like Spotify or Amazon Music, but a skill to stream from SoundCloud is missing. So I started to implement one!

Here is a demo.

Playing Your First Track With Alexa

Playing a track in response to a user’s intent simply requires adding a PlayDirective in the response (code written in Kotlin).

// These constructors don't actually exist, just used for brevity
val stream = Stream(url = "https://test.com/path/to/my/audio.mp3",
                    token = "audio.mp3",
                    offsetInMilliseconds = 0)
val audioItem = AudioItem(stream = stream)
val directive = PlayDirective(audioItem = audioItem, playBehavior = PlayBehavior.REPLACE_ALL)
val response = SpeechletResponse(directives = asList(directive))

The most important properties here are url and token. The token is used as sort of an identifier that will be available in all further requests while the stream is playing.

And that’s it. Alexa will now start to play the mp3 file.

Continuous Playback

Of course you usually want to listen to more than one track. So let’s say the initial intent of the user was:

Alexa, open SoundCloud and play my favorites

So how and when do we tell Alexa to play the next track? In case of an audio player enabled skill, Alexa will automatically send some special requests to the skill implementation. On the JVM we can react to these by implementing the AudioPlayer interface in our Speechlet. It provides a hook that will be called shortly before a track reaches the end of its playback.

class SoundcloudSpeechlet : SpeechletV2, AudioPlayer {

  override fun onPlaybackNearlyFinished(env: SpeechletRequestEnvelope<PlaybackNearlyFinishedRequest>): SpeechletResponse? {
    val state = env.context.getState(SystemInterface::class.java, SystemState::class.java)
    val userId = state.user.userId
    val nextTrack = loadNextTrack(userId)

    val stream = Stream(url = track.streamUrl,
                        token = track.id,
                        expectedPreviousToken = envelope.request.token,
                        offsetInMilliseconds = 0)
    val audioItem = AudioItem(stream = stream)
    val directive = PlayDirective(audioItem = audioItem, playBehavior = PlayBehaviour.ENQUEUE)

    val response = SpeechletResponse(directives = asList(directive))
    return response

  // Other overrides

The important parts are

  • Load the next track. Alexa developers will know the concept of a session. They might be tempted to put the list of tracks into a session attribute and make the loadNextTrack method access the session. Sadly this is not how sessions work – they are only intended for a „conversation“ of a user with Alexa. Starting to play music always ends the session. Instead the persistence of the list has to be implemented manually. See further down on how I did it.
  • Use the ENQUEUE play behavior. This way Alexa will play the next track seamlessly after the first one has finished.
  • For ENQUEUE to work correctly it is important to use the correct values for token and expectedPreviousToken. For a discussion see the Amazon audio player reference.

So how to store the list of tracks to play? Luckily we get a unique identifier with each request, even the ones without session: the user id. In a request that has a session it is available via envelope.session.user.userId. In requests without a session (i.e. everything from the AudioPlayer interface) it is a bit harder to access, at least in the Java SDK. It’s hidden away in the context. Above you can see how to load it from there.

Since my skill implementation is hosted on AWS Lambda I decided to use DynamoDB for the persistence. I used the userId as the primary key and stored the list of tracks to play, the current position and some other metadata under it.

Pause, Next, Previous, ..

In addition to the special requests from the AudioPlayer interface there are special intents that the user can say without needing the name of the skill that plays audio. For example AMAZON.PauseIntent and AMAZON.NextIntent can be just used with

Alexa, pause


Alexa, next

Since these are handled as normal intent requests the current token and play offset have to be accessed differently. They are stored in the audio player state.

val audioPlayerState = env.context.getState(AudioPlayerInterface::class.java, AudioPlayerState::class.java)

With this information it is possible to determine the next track in an intent request. Of course the user id is also accessible via the envelope.session.user property.

The update of the player state for a user is best done in the AudioPlayer request handlers, though. Only then you can be sure Alexa has reacted correctly.

override fun onIntent(env: SpeechletRequestEnvelope<IntentRequest>): SpeechletResponse {
  when (env.request.intent.name) {
    "AMAZON.PauseIntent" -> {
      // Do NOT store the current offset here, just tell Alexa to stop:
      return stopDirectiveResponse()

override fun onPlaybackStopped(env: SpeechletRequestEnvelope<PlaybackStoppedRequest>): SpeechletResponse? {
  // Now we are sure that playback has stopped: Store the offset
  val state = env.context.getState(SystemInterface::class.java, SystemState::class.java)
  storeOffset(state.user.userId, env.request.offsetInMilliseconds)
  return null // AudioPlayer methods may return null as responses

Next Steps

I have a good grasp on how to handle audio playing Alexa skills now. It is definitely harder than „normal“ skills since tracking the current state of the audio player and deciding what to persist when requires a lot of attention.

The next steps are the more traditional: I have to find a way to make music discoverable and provide more functionality, like

  • Play any playlist (not only favorites)
  • Play another user’s stream

If you have further suggestions or questions about Alexa skills please ask away in the comments!