Overview
This guide explains the concept of an AI‑driven presentation coach, the reasons for using Vision Agents, and provides a practical, reproducible workflow for building a real‑time feedback system that evaluates speech, posture, gestures and more.
What is a Presentation Coach?
An AI presentation coach is a software agent that observes a speaker through video and audio streams, analyses key performance indicators, and delivers concise, actionable feedback while the speaker practices.
- Monitors filler words, speaking pace, vocal variety, and clarity.
- Analyzes body language: posture, hand gestures, and eye contact.
- Provides feedback via text overlay and synthesized voice.
Why Use Vision Agents?
Vision Agents is an open‑source framework that unifies video transport, multimodal AI models and chat interfaces, making it ideal for real‑time coaching applications.
- Low‑latency video via Stream’s WebRTC edge network.
- Simple integration of large language models (OpenAI Realtime) and computer‑vision models (YOLO pose).
- Agent‑centric architecture that handles streaming, processing and response loops automatically.
How to Build the Coach
The implementation consists of four logical stages: preparation, environment setup, model configuration, and agent programming.
- Prerequisites
- Python 3.9+ installed locally.
- Free Stream account (API key & secret).
- OpenAI API key with Realtime access.
- Basic familiarity with Python and command‑line tools.
- Set Up the Development Environment
- Create a project folder (e.g.,
presentation-coach). - Install
uv(recommended installer) and initialise a virtual environment:uv init && uv venv && source .venv/bin/activate - Add required packages:
uv add vision-agents[getstream,openai,ultralytics] python-dotenv
- Create a project folder (e.g.,
- Configure Secrets
- Create a
.envfile with:STREAM_API_KEY=…STREAM_API_SECRET=…OPENAI_API_KEY=…CALL_ID=practice-room
- Create a
- Download and Prepare YOLO Pose Model
- Create
download_yolo_pose.pythat runs:from ultralytics import YOLO
model = YOLO('yolo11n-pose.pt')
model.export() - Run the script to cache
yolo11n-pose.ptlocally.
- Create
- Write Coaching Instructions
- In
instructions/coach.mddefine the agent’s personality, feedback cadence and style (short, positive, actionable, silence‑aware).
- In
- Implement the Main Agent (main.py)
- Load environment variables with
dotenv.load_dotenv(). - Instantiate a
Userobject for the coach (name, avatar, ID). - Configure the
Agent: edge=getstream.Edge()– connects to Stream video.instructions=Path('instructions/coach.md').read_text()llm=OpenAI.Realtime(model='gpt-4o-mini', voice='alloy', frame_rate=6)processors=[YOLOProcessor(model_path='yolo11n-pose.pt')]- Join a call with
await agent.join_call(CALL_ID)and start the real‑time loop withawait agent.finish().
- Load environment variables with
- Run and Test
- Start the agent:
python main.py. - Open the generated Stream call URL in a browser, enable webcam and microphone.
- Deliver a short presentation; observe on‑screen text and voice feedback.
- Start the agent:
Best Practices and Tips
Follow these recommendations to maximise the coach’s usefulness.
- Keep the
coach.mdinstructions concise; the agent respects length limits. - Use a quiet environment to reduce background noise for the Realtime speech recogniser.
- Adjust the silence detection threshold (3‑5 seconds) if feedback feels too frequent.
- Periodically retrain or fine‑tune the YOLO pose model for specific camera angles.
- Log feedback sessions to a file for later self‑review and progress tracking.