ClawHub Voice and Audio Skills: Five Tools That Give Your AI Agent Ears and a Mouth

Text is finally starting to feel like a limitation. Most OpenClaw agents read and write strings, but the world they need to act in is full of phone calls, voice notes, podcast clips, meeting recordings, and ambient audio. Over the last few weeks a cluster of new ClawHub skills has shown up in the awesome-openclaw-skills and sundial-org registries that close that gap, giving agents the ability to listen, speak, transcribe, and even sing on cue. Here are five worth installing this week.

1. whisper-stream: Real-time transcription for long-running agents

The whisper-stream skill wraps a streaming build of Whisper-v3-turbo and exposes it to an OpenClaw agent as a single listen() tool. Unlike the older one-shot transcription skills, whisper-stream maintains a rolling context window so an agent can be invoked once at the start of a meeting and keep dictating into a notes buffer for hours. It supports speaker diarization out of the box and emits partial hypotheses every 300ms, which is short enough that downstream skills can react to spoken commands in near real time.

npx clawhub@latest install whisper-stream

Source: sundial-org/awesome-openclaw-skills. Best for: meeting assistants, accessibility tooling, and any agent that needs to follow a long human conversation.

2. piper-voice: Local TTS that actually sounds human

If whisper-stream gives your agent ears, piper-voice gives it a mouth. Piper is a small, fast neural text-to-speech engine that runs entirely on CPU, and the ClawHub skill bundles 40+ pretrained voices across a dozen languages. The big win over cloud TTS skills is latency: a typical sentence comes back in well under 200ms on a modern laptop, which is finally fast enough for back-and-forth voice conversations without that awkward two-second pause.

npx clawhub@latest install piper-voice

The skill exposes speak(text, voice) and save_wav(text, path, voice), and ships with a small benchmark script so you can compare voices side by side. Source: VoltAgent/awesome-openclaw-skills.

3. callbridge: Make and receive phone calls from an OpenClaw agent

The callbridge skill plugs an agent into Twilio’s Voice API and gives it three tools: place_call, answer_call, and hangup. Once a call is connected, the audio stream is piped through whisper-stream and piper-voice automatically, so from the agent’s perspective a phone call looks like a regular text conversation with timestamps. Practical uses include appointment-booking bots, after-hours support lines, and the kind of dreary follow-up calls nobody wants to make themselves.

npx clawhub@latest install callbridge

Safety note: callbridge can place outbound calls that cost real money and reach real people. Most teams scope it to a whitelist of phone numbers during development and gate the place_call tool behind a human approval step in production. The skill’s README walks through both patterns. Source: LeoYeAI/openclaw-master-skills.

4. podscribe: Podcast ingestion and semantic search

podscribe is the skill to install if your agent needs to actually understand the podcast firehose, not just download episodes. Point it at an RSS feed and it will fetch new episodes, transcribe them, chunk by speaker turn, and embed the result into a local vector store. The skill exposes a search_podcast tool that returns timestamped quotes with a deep link back to the original audio, which is incredibly useful for research agents and journalists.

npx clawhub@latest install podscribe

Source: sundial-org/awesome-openclaw-skills. The maintainers note that podscribe will happily eat several hundred gigabytes of disk if you point it at a year of daily news podcasts, so configure the retention window before you turn it loose.

5. soundscope: Non-speech audio understanding

The most surprising entry in this roundup is soundscope, which gives an agent the ability to classify and describe non-speech audio. Under the hood it uses a CLAP-style audio-language model, and its describe_audio tool returns natural-language descriptions like “a dog barking twice, then a door closing” or “acoustic guitar, fingerstyle, in D major”. Pair it with a microphone feed and your agent can react to environmental events, not just words.

npx clawhub@latest install soundscope

Source: openclaw/skills official archive. Soundscope is still marked experimental and the model weights are large (about 1.4GB), but it is the first ClawHub skill that lets an agent meaningfully perceive the world through a microphone the way it perceives a webpage through the DOM.

Putting it all together

The interesting thing about this batch of skills is how naturally they compose. whisper-stream and piper-voice turn any text-mode agent into a voice agent in about ten minutes. Layer callbridge on top and you have a phone agent. Add podscribe and soundscope and you have an agent that can not only converse, but also research what was said on every relevant podcast last week and react to the dog barking in the background while it is talking. None of these are hypothetical; all five are installable from ClawHub today, and all five have been added to at least one of the major awesome-openclaw-skills registries within the last month.

If you are building a voice-first OpenClaw agent, the right starting stack is probably whisper-stream plus piper-voice for the conversation loop, podscribe for memory, and soundscope for ambient awareness, with callbridge added only once you are confident the rest of the system behaves itself. Install them in that order and you will have a surprisingly capable audio agent in an afternoon.