Undocumented Code: Making a Smart Speaker

In this exciting episode of "Making a Smart Speaker", we'll work on the AI that will do our bidding. We're already using some massive neural networks and pretrained models just do do voice recognition, but here is where we get to be creative about our AI. What do we want it to do? How freeform should we allow input? And, most importantly... how the hell are we going to do hotword detection?

The answer to this question was a single Google search away. The answer was Snowboy, an offline hotword detector that was designed with the internet of things in mind. It's free for non-commercial use and has a binding to our preferred language: javascript. So go to the site and make a model with your hotword. It takes no time at all.

It gave me a model to feed Snowboy to do hotword detection. Here's the basic flow of how things will work - it's an obvious flow but it's important to write these things down sometimes.

Device listens for the hotword
When the hotword is received, make a connection to Google Cloud Speech
When the first full line of text is transcribed (or 15 seconds of silence is reached), terminate the connection and process what we have.
Do what was asked (this part is hard but we'll get to this later)
Issue verbal response if required by action

The "do what was asked" part is going to be a doozy, that's for sure. One thing is certain, however, our freeform speech will be a far cry from what Alexa and, of course, what Google can do. They have great AI and massive amounts of data to train on. I have me.

First, let's write the hotword detection feature. Snowboy makes this really easy. Again, I'll be putting screenshots of my code here instead of actually putting my code here, all of the code is linked in the GitHub at the end. I'll also mark what the SHAs are of each step.

At 1d761d5, the hotword detection worked perfectly. When I said "Hey Calvin," it got it right. And it was surprisingly accurate as well. I even used my USB microphone. It worked really really well. Too well, in fact. It triggered on me just saying "Calvin". I don't know what quality of model I expected when it trained on only three samples.

The next step is to make the "I'm listening" sound. I want to play a WAV file when the device is listening, so I made a sound and uploaded it to the device. This was fairly straight forward using a process that involved loading a WAV and streaming it to the audio output. This was done at 54ccde1.

The next step was to integrate the speech example from Google into the mix. We've already tested this, and it worked really well. So I reorganized some stuff into functions and added the TTS engine. I'm using Pico - which I think is the same one built into Android - to do the synthesis. It works well, but by default it's a female voice. That won't work if I want to call this thing Calvin. But the initial framework of listening for the hotword, playing a sound to signify it's listening, and echoing a single command back is functional at 14eb4ab.

However, I wasn't happy with the lack of control I had over the speech recognition, so I looked for a more home-brewed solution. I decided to look at how Google Chrome does voice recognition, and it turns out they use a full-duplex API. You can see how it works at 02bb737. I stored the API Key in the configuration file, but didn't commit it to the repo. You can find what Chrome's API key is by inspecting HTTP traffic when Chrome is doing voice recognition.

It was at this point the USB microphone I ordered for the purpose of this smart speaker arrived. We'll drill a hole for it later, but it's a super duper cheap one. We'll continue to develop with this in mind instead of using my Blue Snowball.

Back to the programming. So the code can listen, transcribe, and speak. Now we have to make it understand and do. This is probably the hardest part of them all.

There are a couple of ways we could go about this, each approach being more tricky than the last. The easiest way is to have predetermined patterns for requests. Things like "play <song name> by <artist name>" or something like that. This allows for no wiggle room in the way you can talk to the device. However, because we're just looking for patterns, it'll be super easy to implement. But that wouldn't make our smart speaker very smart, would it?

No, so we can get fancier with this. Much like the way original Siri worked (I think, don't quote me on this) we can categorize requests based on the words they contain. For example, each category will have a set of words that could hint at the request's purpose and allow more rigidity with how to parse it. For example, if you ask about a stock price, the words "stock", "asking price", "trading", and maybe even something like "Dow Jones" would all add more weight to the "stocks" category. This allows for much more flexibility with what we can say (as long as each category parser is smart enough). We're not totally into freeform speech yet, but we're close enough.

The final way to do this is to go total balls to the wall machine learning on a large training dataset that categorizes and picks out specific attributes to a request. This is almost certainly how Google does it, but they have literally terabytes of data to train on. I do not. The results from building an AI based on machine learning with a small training set would suck (just look at the hotword detection using three voice samples).

However, there is another way (that, admittedly, I found out about five minutes before writing this paragraph). I was browsing the Google Cloud Machine Learning pages and found something called Dialogflow. It's a customer service agent AI that is designed for chat bots. And I think... I think it's free to use for text requests (which is all I'll be using since I'm doing my own voice recognition.) This is actually doable - I'm not sure what I was thinking in the above paragraphs. Yeah, sure... I'll just write my own AI engine. Ha.

I added some intents and entities and whatnot. I've included an exported version of the bot in the Github. I then hooked up the web service to Calvin so it could respond to small talk at 18cc5fa. It was working pretty well (except for the fact it was matching "Will you marry me" to a weather intent rather than to small talk. I decided that, for now this was more progress than I thought I'd make, and decided to wire up some services to Calvin so it could answer questions and do things for you.

The first thing to hook up was the weather system. When asked a weather question, Dialogflow returned date, location, and type of precipitation asked. So for simple weather questions (in more or less free form english) wouldn't be too hard to look up. I structured the files so that any additional modules could be found in the modules/ folder.

At 30c62ad, I was able to answer questions about the current weather from anywhere in the world. I was still having trouble with the weather forecast, so that would take some debugging. But this has gone post has gone on long enough I think. At this point it's just adding data processors and refining the Dialogflow model. Other than that, Calvin is pretty much done as far as the AI architecture goes. So what's next for Calvin?

In the final episode, we'll put everything together. We'll trim wires, hot glue boards, and maybe even add a little color to it. But for now, we at least have a format and a framework that can do stuff, and that's pretty okay if you ask me.

Undocumented Code

Wednesday, November 29, 2017

Making a Smart Speaker - Part 4

No comments:

Post a Comment