Build a Wake Word: “hey chitti” (Simple Guide)
We’re building a small voice assistant that runs on your laptop. It continuously listens to the mic and waits for the wake word “hey chitti.” When you say it (like “hey Alexa”), the app beeps, listens to your sentence, and converts what you say into text. A tiny MFCC-based model detects the wake word quickly, and a simple voice-activity check decides when you start and stop speaking, while Whisper handles the transcription.
- Goal: Say “hey chitti” to wake your app. It beeps, listens to your next sentence, and writes down what you said.
- Code: Full source is in my repo → GitHub link. This post shows only short commands/snippets.
What we will do
- Record short audio clips for training.
- Turn each clip into MFCC features (numbers that describe the sound).
- Train a tiny model to say “wake word” or “not wake word”.
- Run a live detector: slide a 1‑second window over the mic, check every 0.25 s.
- When it hears “hey chitti”, play a beep, then listen for your command.
- Use Whisper to turn that command into text.
Quick setup
- Python 3.9+
- Install packages:
pip install numpy scipy librosa scikit-learn sounddevice matplotlib pandas joblib
- System tools: PortAudio + FFmpeg
Linux:sudo apt install -y portaudio19-dev ffmpeg
macOS:brew install portaudio ffmpeg
Make folders like this:
hey-chitti/
data/positive/ # “hey chitti” clips (1.0–1.2 s each)
data/negative/ # other speech/noise/silence
models/
src/ # all scripts
Step 1: Record audio
We’ll capture short clips (about 1.0–1.2 s each) for two folders:
- positive/ — you saying “hey chitti”
- negative/ — anything else (other speech, silence, room noise)
First, find your mic index:
python src/record_samples.py --list-devices
1) Manual
You press Enter for each clip, then say “hey chitti.” Best when you want full control.
# 40 positives
python src/record_samples.py --label positive --count 40 --device <IDX> --start manual --seconds 1.2
# 80 negatives
python src/record_samples.py --label negative --count 80 --device <IDX> --start manual --seconds 1.2
2) Countdown (auto-start after 3-2-1)
The script shows a short countdown, then records a fixed-length clip—nice rhythm, fewer mistakes.
# Positive with a 3-second countdown before each clip
python src/record_samples.py --label positive --count 40 --device <IDX> --start auto --countdown 3 --seconds 1.2
# Negative with a 2-second countdown
python src/record_samples.py --label negative --count 80 --device <IDX> --start auto --countdown 2 --seconds 1.2
3) Auto (voice-activated)
Recording starts when you begin speaking (VAD - Voice Activity Detection) and keeps a small pre-roll so the first “hey” isn’t cut.
# Positive, voice-activated start, keeps 0.2 s before your voice starts
python src/record_samples.py --label positive --count 40 --device <IDX> --start vad --seconds 1.2 --prepad 0.2
# Negative, voice-activated start (speak anything or make noise)
python src/record_samples.py --label negative --count 80 --device <IDX> --start vad --seconds 1.2 --prepad 0.2
Tips
- Speak at normal volume, mic ~10–20 cm away, slightly off-axis.
- Mix rooms/backgrounds so the model learns to ignore noise.
- Aim for at least 40 positives / 80 negatives to start; add more negatives if you see false triggers.
Step 2: Make features (MFCC)
Run:
python src/extract_features.py
# creates features.csv
MFCC converts sound into a small set of numbers that are easy for a simple model to learn from.
Step 3: Train the model
Run:
python src/train.py
# saves models/scaler.pkl and models/model.pkl
This trains a tiny classifier (logistic regression). It’s fast and good enough for a wake word.
Step 4: Detect “hey chitti” live
Run:
python src/detect_rt.py --device <IDX> --threshold 0.6 --consec 2 --window 1.0 --hop 0.25 --cooldown 1.5 --cmd-start-timeout 4.0
Simple meanings:
- window 1.0 → look at 1 second of audio at a time.
- hop 0.25 → move forward 0.25 s each check (windows overlap).
- threshold 0.6, consec 2 → need 2 strong matches in a row.
- cooldown 1.5 → after a detection, ignore old audio for 1.5 s so it doesn’t trigger again.
- cmd-start-timeout 4.0 → if you don’t start talking within 4 s after the beep, cancel and go back to listening.
Step 5: Listen for the command (after the beep)
Right after the wake word, the app waits for your voice to start, then it stops when you pause. It also keeps a small pre‑roll so the first word isn’t cut.
You can tune these in the CLI:
--cmd-prepad 0.3 # keep 0.3 s before your voice starts
--cmd-silence-duration 0.8 # stop when silence lasts 0.8 s
--cmd-max-seconds 6.0 # hard limit
Step 6: Transcribe the command (Whisper)
Install whisper.cpp once:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make -j
./models/download-ggml-model.sh base.en # English only (fast)
# or: ./models/download-ggml-model.sh small # multilingual, higher quality
Run detector with Whisper:
python src/detect_rt.py --device <IDX> --threshold 0.6 --consec 2 --window 1.0 --hop 0.25 --cooldown 1.5 --whisper-bin "<PATH>/whisper.cpp/build/bin/whisper-cli" --whisper-model "<PATH>/whisper.cpp/models/ggml-base.en.bin" --lang en
What to tweak
- Too many false alarms? Raise
--threshold
(e.g., 0.65) or require more--consec
(e.g., 3). Record more negative clips. - Detector is slow to react? Lower
--hop
(e.g., 0.2). - First word gets chopped? Increase
--cmd-prepad
(e.g., 0.4). - Triggers again right away? Increase
--cooldown
(e.g., 2.0). - English OK, Tamil bad? Use a bigger multilingual Whisper model (
small
→medium
→large-v3
) and keep the room quiet.
Next steps
- Replace the tiny model with a small CNN for even better accuracy.
- Add actions (turn lights on/off, call an API, etc.).
- Publish the full code in GitHub and link it here: <REPO_URL>.
That’s it. Say “hey chitti”, wait for the beep, speak your command, and watch the text appear.