Skip to content

iqbal-sk/voice-web-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Voice Web Agent

Voice-enabled browser automation agent built with Next.js and TypeScript. It parses natural-language commands into a structured plan, executes steps in a live browser session (via Browserbase), and streams events and screenshots to the UI. State is persisted to SQLite for replay and debugging.

App screenshot

Features

  • Natural language to actions: lightweight parser with optional LLM fallback
  • Planner → Executor pipeline with retry and backtrack on failure
  • Live browser via Browserbase (viewer URL + MJPEG streaming endpoint)
  • Streaming event log (SSE) with steps and screenshots
  • Session capture controls (capture on error, capture every N steps)
  • SQLite persistence for sessions, commands, events, executions, and advisory selector memory
  • Fully typed (TypeScript), tested with Vitest and Testing Library

Tech Stack

  • Next.js 14 (App Router), React 18, Tailwind CSS
  • TypeScript, Zod for config validation
  • SQLite via better-sqlite3 and Drizzle ORM (manual schema)
  • Playwright runner abstraction, Browserbase CDP integration
  • Vitest + JSDOM for unit tests

Getting Started

Prerequisites

  • Node.js >= 18.17
  • npm (or pnpm/yarn). Examples below use npm.

Optional (only if you plan to run a local Playwright browser instead of Browserbase):

  • Playwright browsers: npx playwright install --with-deps

Install

  1. Install dependencies
npm install
  1. Configure environment

Copy .env.example to .env and edit as needed. Key variables:

  • Runners
    • PLAYWRIGHT_ENABLED: set to true to enable a local Playwright runner (not the default route).
    • BROWSERBASE_ENABLED: set to true to use Browserbase for the live browser.
  • Browserbase
    • BROWSERBASE_API_KEY: required if using the SDK to create sessions.
    • BROWSERBASE_PROJECT_ID: optional, depends on your project setup.
    • BROWSERBASE_WS_ENDPOINT: optional direct CDP endpoint; if set, SDK is bypassed.
  • Persistence
    • SQLITE_FILE: path to SQLite file (default ./data.sqlite).
  • Screenshots
    • SCREENSHOT_EVERY: number of steps between screenshots (0/off if unset).
  • LLM fallback (optional)
    • LLM_FALLBACK_ENABLED: set true to enable parsing fallback with OpenAI.
    • OPENAI_API_KEY: OpenAI API key for fallback parsing.
    • LLM_MODEL: model name (default gpt-4o-mini).

Note: With BROWSERBASE_ENABLED=false and PLAYWRIGHT_ENABLED=false, the executor uses a no-op runner. You’ll still see planning and events, but no real browsing will occur.

  1. Run the dev server

This repo does not define dev/build scripts; you can invoke Next directly:

npx next dev -p 3000

Then open http://localhost:3000.

Build/start for production:

npx next build
npx next start -p 3000

Basic Workflow

  1. Create a session using the “New Session” button.
  2. Enter a command, e.g., “search amazon for headphones under 200 and sort by price low to high”.
  3. Review the parsed command and computed plan.
  4. Watch live events and screenshots as steps execute.
  5. If using Browserbase, click “Open Live Browser” to view the cloud session.
  6. Adjust capture settings (capture on error or every N steps) per session.

If the command targets a domain not in the session allowlist, the UI will prompt for confirmation before planning/execute.

Architecture Overview

  • Parser (src/lib/parser.ts): heuristics to extract intent, entities (site, query, filters, sort), and safety flags.
  • Planner (src/lib/planner.ts): maps a Command to an ActionPlan of steps (NAVIGATE, WAIT_FOR, FILL, PRESS, APPLY_FILTER, SORT, CLICK).
  • Executor (src/runner/executor.ts): runs steps with retry/backtrack, emits STATE/STEP/SCREENSHOT/ERROR events, and records executions/steps in SQLite.
  • Runner abstraction (src/runner/runner.ts): interface implemented by a Playwright-based runner. In production the route uses Browserbase via createBrowserbaseRunner.
  • Browserbase integration (src/runner/createBrowserbaseRunner.ts): creates or attaches to a Browserbase session (SDK or direct BROWSERBASE_WS_ENDPOINT) and returns a Playwright-backed runner. Persists viewerUrl for the UI.
  • Event pipeline
    • Services (src/server/services.ts) emit events and manage in-memory sessions.
    • SSE API (src/app/api/events/route.ts) replays backlog from SQLite and streams live events.
    • Live MJPEG (src/app/api/live/route.ts) for periodic page screenshots if a live page is registered.
  • Persistence (src/server/persistence/sqlite.ts): minimal schema creation; stores sessions, commands, events, executions, and advisory selector memory.
  • UI (Next.js App Router): session controls, command panel, plan card, event log, screenshots, timeline, and capture settings.

API Endpoints

  • POST /api/sessions{ sessionId }
  • POST /api/commands body { sessionId, utterance }{ command, plan }
  • POST /api/confirmations body { sessionId, commandId, confirmed, passphrase? }{ status, plan? }
  • GET /api/events?sessionId=...&live=1 → text/event-stream (SSE)
  • POST /api/cancel body { sessionId }{ status: 'cancelling' }
  • GET /api/viewer-url?sessionId=...{ viewerUrl: string | null }
  • GET /api/live?sessionId=... → multipart/x-mixed-replace stream (JPEG frames)

Testing

  • Run all tests:
npm test
  • Watch mode:
npm run test:watch

Some tests exercise API routes and SSE using JSDOM, and runner logic in isolation. For Playwright E2E you would need to provision browsers and wire a direct runner path.

Notes and Tips

  • Browserbase
    • Ensure BROWSERBASE_ENABLED=true and set BROWSERBASE_API_KEY (or provide BROWSERBASE_WS_ENDPOINT).
    • The route persists the session’s viewerUrl so the UI can open “Live Browser”.
  • Playwright (local)
    • The default /api/commands route currently prefers Browserbase. If you want a local browser, you can adapt the route to construct a PlaywrightRunner directly and set PLAYWRIGHT_ENABLED=true.
    • Install browsers: npx playwright install --with-deps.
  • SQLite
    • File is controlled by SQLITE_FILE (default ./data.sqlite). Schema is created on startup if missing.
  • LLM Fallback
    • Enable with LLM_FALLBACK_ENABLED=true and set OPENAI_API_KEY. Used when parser confidence is low or intent is unknown.

Folder Structure

  • src/app/ Next.js app (pages, components, APIs)
  • src/lib/ parser, planner, types, events
  • src/runner/ runner interface + Playwright integration and execution engine
  • src/server/ services, persistence (SQLite), live streaming
  • src/memory/ advisory selector memory (mem0)

Troubleshooting

  • No live browser view
    • Ensure Browserbase is enabled and configured; check /api/viewer-url returns a URL.
  • No screenshots
    • Set capture policy in the UI (Capture Settings), or set SCREENSHOT_EVERY; screenshots on error are enabled by default.
  • Database errors
    • Verify write permissions to SQLITE_FILE path and that only one process is writing.
  • Playwright errors
    • Install browsers (npx playwright install) and verify your runner path uses a local Playwright page/context.

Design Docs

  • Design summary.md, Detailed-Design.md, Project-Requirements.md
  • Playwright-Runner-Design.md, Browserbase-Integration.md
  • Intents.md for NLP intents and examples

Made for rapid prototyping of voice-driven web automation. Contributions and suggestions welcome.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published