Home Assistant: Setting up Home Assistant Voice PE and using it in native language

I finally got interested about the Home Assistant Voice as I found it supports my native Language, Finnish, out of the box. That would be the first assistant that can be used to control Home Assistant with my native language where Google Home and Amazon Echo fails. I would just be so much easier to tell my Smart Home what to do in Finnish..

Keep reading if you want to find out more about Home Assistant Voice and my experiences with it!

Home Assistant Voice

Essentially, Home Assistant Voice is the system’s initiative to bring robust, privacy-focused voice control into your smart home. It’s about creating a truly conversational and intuitive way to interact with your home with your native language.

Home Assistant Voice can work fully local or by using Nabu Casa cloud. The cloud is a subscription based model and will make the voice recognition much faster than local, also supporting more spoken languages.

Key aspects of Home Assistant Voice include:

Privacy First:
- A core principle is to enable local voice processing, meaning your commands can stay within your home network. This reduces reliance on cloud services and enhances privacy.
- Optional Nabu Casa cloud is marketed as privacy focused where your data remains safe
Flexibility:
- Home Assistant Voice allows you to choose between different speech-to-text and text-to-speech engines, giving you control over how your voice data is processed.
Assist:
- The underlying voice assistant within Home assistant is called “Assist”. This is the part that takes your spoken words, understands them and then carries out your commands.
Open and Expandable:
- Being part of the Home Assistant ecosystem, it’s designed to be open and customizable, allowing for continuous improvement and expansion by the community.
- All three engines of the Voice feature can be configured separately: Text-to-speech, speech-to-text and conversation agent

Home Assistant Voice PE

Now, the Home Assistant Voice Preview Edition (PE) is a hardware piece that elevates this voice experience. Think of it as a purpose-built device designed to make using Home Assistant Voice as smooth as possible. Here’s what sets it apart:

Designed for Home Assistant:
- This isn’t a generic smart speaker; it’s specifically engineered to work with Home Assistant. This means streamlined setup and optimal performance.
Enhanced Audio:
- With dual microphones and advanced audio processing, it’s designed to accurately capture your voice, even in noisy environments.
Privacy Features:
- A physical mute switch provides a tangible way to ensure your privacy.
Community Driven:
- This device is very much a product of the Home Assistant community. So it is designed to be open, and expandable.
Easy Setup:
- The device is designed for easy plug and play use.
Expandability:
- With features like a Grove port, you can add additional sensors to the device.

Home Assistant Voice PE is based on a versatile ESP32 chip and is made fully on ESPHome platform making it open source.

In essence, Home Assistant Voice is the software, and the Voice Preview Edition is the hardware that brings it to life. It’s about empowering users with a voice control system that’s both powerful and respectful of their privacy.

Setting up the assistant with voice control

Now that we’ve cleared what are Home Assistant Voice and Home Assistant Voice PE, it’s time to get into action and actually set it up.

Home Assistant Voice PE is delivered with minimal contents in mind: the device, sticker and a quick start guide. No power source or usb-c cable included.

Once powered up the device is discovered on Home Assistant via BLE. Bluetooth is only used to setup the WiFi information, the actual configuration is done using WiFi.

Areas, Aliases and Entities

Home Assistant Voice heavily relies on areas and entities. Consider saying ‘turn on lights’. How does it know which lights to turn on? Well, if the voice assistant is set in the same room, it will automatically call the lights of that specified area. But you if you want to turn on the lights on another room, you would say ‘turn on living room lights’. This is reconigzed by the area set on the device (‘turn on [area] [entity]‘). So, if the areas are already set, great; If not, there’s a work to do to assign all required devices to corresponding areas.

Then there’s aliases. A device can have multiple names and/or in multiple languages. I’m using my Home Assistant in English, but I’d like to control the devices using my native language, Finnish. I don’t want to mix English and Finnish as my entities are named in English on, like ‘laita living room lights päälle’ (turn on the living room lights). To overcome this, I’ll just set an alias in Finnish to the entity and to the area. Home Assistant voice will look all the aliases before making decisions of the match.

Exposing entities

Home Assistant voice will not see every entity by default and the entities needs to be exposed to the assistant. To keep the assistant accurate, it’s a good idea to only expose those entities you want to control by voice. If exposing ‘everything’, the context will get huge causing slow voice processing, mixing up devices, commands and it can even cost more if using 3rd party AI agent in a form of bigger token count.

Cloud vs Local

A selection to be made is whether to use local only or cloud based assistant. Currently local is only available for only few major languages like English. Local assistant is also much slower in processing and can take even 10 seconds to get the command processed with a low-end hardware.

Cloud based assistant supports even more languages as the recognition is offloaded the Nabu Casa cloud server. It is also much faster as more computing power is available. My ‘only‘ option here is the cloud based solution as Finnish is not supported by the local AI at the moment. In theory, one could use own whisper AI data model fully locally, but that would need some model training and big amount of local GPU power available.

Using cloud the spoken data is transferred to the Nabu Casa cloud, but according to their terms of service, that data is not stored anywhere permanently or used to train AI models. So your data is and will be safe always.

Cloud subscription costs as little as 7,5€/month (or 75€/annum). That also gives you more benefits like secure remote connection to your Home Assistant instance.

Using in Native language (Finnish)

Home Assistant voice uses a wake word to start listening. There’s three wake words available: ‘Okay Nabu‘, ‘Hey MyCroft‘ or ‘Hey Jarvis‘. Luckily “Okay Nabu” sounds kind of same in Finnish than in English. Custom wake word can’t be used yet* _{(well it can as it’s open source, but requires custom language model training and will be a tricky process).}

First tests proved that the Home Assistant voice actually understands Finnish quite well and it’s easy to communicate with Home Assistant. However, sentences said need to be quite accurate and no extra words or similar can be added. Few examples: “laita valot päälle” (turn on the lights) works fine, but “laita valot takaisin päälle” (turn the lights back on) won’t work as there’s something extra in the middle. The assistant will just inform that the area ‘takaisin‘ (back) is not found.

It’s not a big deal once you get to know your assistant and understand the limitations.

Enhancing conversation with 3rd party AI

There’s one more option for make the voice even smarter, by connecting a 3rd party LLM (large language model). Using better AI the assistant can understand context even better. The example I used ealier, ‘laita valot takaisin päälle’ (turn the lights back on), is understood by Google Gemini and the lights did turn on.

3rd party AI models like Google Gemini or OpenAI ChatGPT are superior compared to the Home Assistant cloud. That comes with a cost though, every request will cost something, depending of the context window size (tokens). With small amount of devices one request will only cost fractions of a cents so it’s not a big deal if you really want to make the conversion with assistant more natural.

It’s even possible to connect local AI if you got the GPU power in your system and have the will power to make it work.

Currently there’s few misses with using a 3rd party LLM:

You can’t keep up the conversation. The speaker requires always another wake word even when responding on a follow up question asked by the AI. The conversation stays alive, but the device just needs to be woken again.
The chat context can get too big too quickly and I’ve not yet found a way to clear the chat history. Once chat history is 4 to 5 conversations long the AI might start hallucinating a bit and won’t do the requests properly anymore. There’s definitely some kind of history clearance existing, but I haven’t found the logic there yet.

Setting up custom intents

Basic controlling is fine with the Home Assistant Voice out of the box, with native language, but how about something more custom? Like telling your robot vacuum to start cleaning a specific room. There’s no Home Assistant entity directly for that and it would need some more configurations.

The way Home Assistant has implemented custom intents is by automations. You can set a custom voice sentence to trigger automation. It’s even possible to pass parameters on the command. The problem again is that the told command should be (almost) identical with the trigger phrase. Again it’s not an issue if you know what to say, for more conversational commands it’s a no go currently, but I’m sure this will get a huge leap during next iterations as well..

Conclusion

It’s obvious that the Home Assistant voice is still work in progress and requires some improvements and has some issues: integrating external services with Voice and AI still gives a head ache and debugging is rather complicated. Luckily the direction of the development is right on: focusing on privacy and supporting many different languages.

For now I only got one Home Assistant Voice smart speaker on my office, but if things goes forward as they look, I’m going to be replacing my Google Home speakers with Home Assistant Voice in every room very soon. It’s just so much easier to talk native language to voice assistant instead of English all the time.

There are still open questions around that I need to find answer like controlling music players around the house. Eventually I would like to tell my Sonos speakers to play Finnish artists by name as currently I can’t do that with English only Google Home smart speakers.

Next stop with the voice for me is to get to know the Music Assistant and use voice control with it. But that will be another full text article later on as stepping on another big functionality..