How to build conversations via the Amazon Echo

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

If you read my post about the startup I was this year, then you know the toughest challenge we faced was the building the finite state machine that could handle conversations. So I read this about the Amazon Echo with great interest:

My task would have ended here if creating an event would only require a date and time. But to be useful, I would like to include a duration, a topic, perhaps even a location. To be even more useful I would like to invite one or more people. That becomes:

Alexa, ask FreeBusy to create an event for Thursday at 2 p.m. for 45 minutes regarding weekly team sync in conference room Bethesda with Michael, Cristi, and Brin.

Wow, that’s a mouthful! No one says all of that in a single utterance especially with the diction and cadence required by today’s voice recognition technology: don’t make long pauses! articulate! don’t smirk! stand up straight! Ok, that last one isn’t a requirement for Alexa, it’s just what my father would say. Even if people would be Ok to give long commands like this they won’t remember the particular order in which you recognize and map to the intent slots. So you have to provide permutations of the slots. In my example I have 6 slots which makes for 6! = 720 permutations. I really do have to supply Alexa with utterances for all those permutations because the grammar changes slightly as we change the order. For instance, when we start with startTime we say:

Alexa, ask FreeBusy to create an event at 2 p.m. for on Thursday

The only path forward is shorter utterances:

So, for a multitude of reasons, we need to break down the command in a conversation with multiple shorter exchanges which eventually accumulate to give us all the needed parameters. How about this:

Alexa, ask FreeBusy to schedule a meeting for Thursday at 2 p.m.
Ok, how long do you want the meeting to be?

45 minutes

Ok, but you have a conflict at 2:30 p.m. titled catch up over coffee with Dan Marino. Do you still want to schedule a new meeting?

Yes

Ok, what’s the topic for this event?

Weekly team sync

Ok, who should I invite to weekly team sync?

Michael, Cristi, and Brin

Which Michael? Michael Heather or Michael Cerney

Michael Cerney

Ok, but it looks like Michael Cerney is busy at that time. Should I not schedule the meeting or should I not invite Michael Cerney?

Don’t invite Michael

Ok, I created an event for Thursday at 2 p.m. titled weekly sync meeting and I invited Cristi and Brin.

This is very similar to the problem we faced with our user-conversation-hashmap, at Celolot:

Now, Alexa SDK offers a session object present in the request and response payloads which you can use to track state while still building a stateless service on your side. And we did try to use it according to the SDK samples but it yields unmaintainable, spaghetti code. Two realities make Session by itself unsuitable:

A single intent with an open literal slot serves to collect input of the meeting topic (“weekly team meeting”) and input of the attendees (“Michael, Cristi, Brin”) and that single intent is used in multiple conversations.

A single intent that captures the utterance “yes” (an another one that captures “no”) are used multiple times in the conversation (and across different conversations) and their use will have very different consequences depending where in the conversation they are uttered.

There solution comes close to being a finite state machine:

I propose we formally define a conversation as an ordered sequence of intents.

Except a true finite state machine offers 2 extra features:

1.) there is no fixed order, however…

2.) the FSM polices the state transitions and disallows certain transitions

Possibly they handled #1, because they write:

A linear conversation doesn’t care in what order its object model is filled by utterances and shouldn’t concern itself that an IntentRequest might bring part of the fields in its Slots payload and other fields in its Session payload. All it cares about is “do I have everything I need to carry out this command?”.

I find it somewhat reassuring that they struggled with the same issues and came to the same conclusions that we came to.

They add some code at the end to track which states have been passed through during the session. This amounts to an ad-hoc, informal finite state machine. This is how we started ourselves, and I think it is a good way to start. You dip your toe in to develop a feel for what is needed. Formalizing this as a real finite state machine is the next step.

Post external references

  1. 1
    https://freebusy.io/blog/building-conversational-alexa-apps-for-amazon-echo
Source