Alexa, design for ambient voice interactions… please?

By Zane Coffin

Alexa, Design for Ambient Voice Interactions | Edgar Allan

We at Edgar Allan are sometimes victims of our own excitement. Case in point: We recently took a stab at designing and building our own minimum viable Alexa Skill — one that would walk people through a project review using our Retro Cards. Along the way, we realized we knew a few things about talking but that we had some stuff to learn about how to design for voice.

Conversations from the Uncanny Valley of Voice

Ambient voice interactions have been around for a few years and still feel very new to most of us. Perhaps this is because the reality rarely lives up to the expectations. You can’t just talk to Alexa, Google, or Siri in public without first code-switching so aggressively that it jars you into shamefully thinking, “I’m talking to a robot. Out loud,” when you’d have no problem say, having a phone conversation with a friend in the same space.

I think it’s the uncanny valley of speech smacking its disembodied lips. You’ve seen the frightening robot face, but I think it's a disconnection we all feel viscerally, even if there are no visuals. “Is that even a thing,” you might ask? Yes (shiver). It is.

Think of any horror film. Great horror directors inject not just visually unsettling elements but odd speech patterns and sounds meant to add to your discomfort. Though not strictly horror, let’s use a classic scene provided by David Lynch for confirmation.

‍

Yep. That’s uncomfortable. But what does that mean for designing for ambient voice?

‍

Like UX without the UI

Practically speaking, designing and building an ambient voice experience feels like designing a website without the visuals. We used the Skill building interface, courtesy of Voiceflow Creator, and to their credit, it feels very similar to building a typical user flow. You have blocks that represent interaction points and linking flow lines that represent potential paths to the next block. The touch points are less involved than in site-building, but the potential number of paths out of each point can be much higher. Otherwise, the whole thing would be a familiar task for UX designers.

*The Vocieflow interface looks familiar enough to UX designers*

‍

One of our primary goals as designers is to maximize a users comfort with performing a task we want to enable them to perform. So, understanding the uncanny voice valley and the relative “newness” of talking to computers out-loud, let’s discuss a few specific barriers and opportunities for this when designing skills for Alexa.

The first question to ask is whether voice interaction is the right tool for the job.?

Tip 1: Minimize cognitive load.

Without a visual reference, a user has to keep a working memory of their options for moving forward while also recalling what was just asked of them. A long list of scripted options will inevitably slow the rate of interaction or worse, break it down to a full stop. Keep your interactions simple to maintain positive momentum. Making sure that every input and output is easy to say, hear, and remember is a good way to keep these barriers low. A moving bicycle is harder to topple than one that is standing still.

Tip 2: Don’t just ask the user questions to get them where you want them to go.

So, momentum is good. Essential, really. But focusing only on how quickly you can get to the end of an interaction can make the user feel like they are being interrogated. Something that happens in casual conversation is acknowledgement of responses before following up with another question or comment.

In our Alexa Skill example we asked the user, “Would you like to pick a question category or let fate decide?” For the sake of this example, let’s say they choose to let fate decide. To provide a natural and conversational experience, we will acknowledge the selection by saying “Great choice, let’s get started then.” This accomplishes a couple of things: First, it confirms that their choice was received and validated, and secondly, it provides the user with a more, well, conversational style of conversation.

Tip 3: Targeting speed to interaction over perfect interaction is more natural.

This is similar to the concept of breaking large tasks into small ones for typical visual interactions. A commonly used on-boarding tactic in web design is breaking a large form up into smaller sections you don’t overwhelm the user with a giant to-do list. The concept is simply to keep the anticipation of effort or fatigue as low as possible. There is a catch, however. You can become reductionist to the point of making the effort an endless procession of tiny interactions that never feel like they go anywhere. Just like the visual design version of this, it’s all about balance and compromise.

The real difference with voice is that you can’t just draw a line between listed options and place a rewarding “next” button at the end of a spoken statement. So how do you keep the Skill flexible while maintaining its utility?

Honestly, it seems that this lack of a “next” reward is what everyone designing for ambient voice is grappling with. The power of ambient voice is still limited, and the best way to avoid making a bad experience is simply knowing when to opt out.

A Side Note:

But…without visuals, how do we align with our brand, or create one?

It’s a good question. Aside from the seldom seen Skill icon, all we have to work with is the way our Skill sounds. This means custom narration is a must if your project has an associated brand.

Alexa Skill names are not own-able, so it is important to at the very least reinforce the name of your Skill often within the experience. Choose a name that is a useful utterance and brand reinforcement will take care of itself. There are other things you can do as well, using diction, word choice and terminology…the point is to get a little creative. All robots do not have to sound, creepily or un-creepily, alike.

Web designers have had decades to erect a glossy barrier between the people using their designs and the hard, cold code that sits behind them. It’s to everyones great advantage that we don’t have to make friends with the command line. The ambient voice-scape is still green and we’re pretty excited to talk to that first intuitive Verbal User Interface.