What Aura AI is, in three paragraphs.
Aura AI is Sumeru’s AI Concierge — an enterprise-grade conversational presence layer that turns a business’s existing content into a face-driven, voice-led representative. It sells, supports, onboards, and interviews. It is, by design, far beyond a chatbot: an immersive conversational video experience that adapts in real time to who is on the other end.
It exists because customers and internal teams today lose 20–30% of their time navigating fragmented information — websites, PDFs, internal docs, disconnected tools. Aura collapses that into a single conversation. Powered by next-gen agentic AI, it instantly retrieves the most relevant answer from your knowledge and delivers it through a lifelike avatar — the same way a great human concierge would.
The platform is a six-layer stack — knowledge, conversation, voice, avatar, visual intelligence, and deployment — designed to clear a sub-second time-to-first-response budget and to render a 30 fps photoreal head on commodity GPUs. It is built for teams who need control, performance, and presence at scale: enterprises that demand SOC 2, GDPR, and HIPAA-aware data handling, and platform teams who want a self-hosted GPU option for sovereign deployments.
Six layers, one representative.
Each layer is independently scalable, observable, and replaceable. The orchestrator binds them with a single representative manifest.
Knowledge
Vectorised content store. URL, PDF, video, and internal-doc sources, refreshed on a schedule or on demand.
Conversation
Agentic routing. A small, fast model handles routing and clarifications; a larger model handles substantive replies.
Voice
Streaming TTS with phoneme timestamps for lip-sync. Sub-300 ms first-audio after token start.
Avatar
GPU-rendered photoreal head, 30 fps, lip-sync locked to voice phonemes, gaze and blink driven by conversation state.
Visual Intelligence
Detects visitor appearance, emotion, and surroundings in real time to hyper-personalize each turn.
Deployment
Hosted page, widget embed, WordPress, Shopify, Flutter SDK, or self-hosted GPU pod.
Your content becomes the representative’s memory.
A representative is grounded by between one and eight sources. URLs are crawled to a depth you control. PDFs and Markdown are parsed with structure preserved — headings, lists, tables. Plain prompts are accepted for cases where the knowledge is the brand voice itself. Each source can be tagged for routing — a sales rep doesn’t read the engineering wiki by default.
Content is chunked at semantic boundaries, embedded into a 1024-dimension vector index, and attributed at retrieval time so every reply can carry a citation. Indexes refresh hourly by default, with on-demand reindex for time-sensitive launches. Pricing pages, inventory, and docs that change daily are first-class sources, not afterthoughts.
Two models, one conversation.
Every user turn first hits a routing pass on a 7B-parameter model. It classifies the turn — clarification, factual question, objection, action request — and selects the slice of the knowledge index to retrieve. A typical routing pass completes in under 80 ms.
The substantive reply runs on a 70B-parameter model with the retrieved context, the representative’s persona prompt, and the live conversation history. Output is streamed token-by-token to the voice layer so audio synthesis can begin before the reply finishes generating.
The orchestrator enforces representative scope. A sales rep cannot answer engineering questions; a support rep cannot quote sales discounts. Out-of-scope turns are gracefully redirected with a copy you control. Structured output — book a demo, capture a qualified lead — is emitted as a JSON tool call that downstream systems can consume.
The face — and the eyes — on top of the model.
Aura is a conversational video experience, not a chat thread with extra steps. Three distinct subsystems give it presence: a streaming voice that speaks in 28 languages, a photoreal avatar that lip-syncs to that voice, and a perception layer that watches the visitor in real time so the conversation can adapt to who is on the other end.
Voice synthesis
Streaming TTS with phoneme timestamps. The first audio chunk leaves the server within 280 ms of the first token from the reply model. We support 28 languages out of the box, with real-time translation between any pair; voice cloning is available for Studio and Enterprise plans with verified consent.
- FIRST AUDIO≤ 280 ms
- SAMPLE RATE22 kHz · 16-bit PCM
- LANGUAGES28 · LIVE TRANSLATE
- VOICE CLONINGStudio +
Avatar rendering
A photoreal head rendered on a single GPU pod, 30 frames per second at 720p. Lip-sync is driven by the voice layer’s phoneme timestamps; gaze and blink are driven by conversation state. The avatar runs as a WebRTC stream on the client, with a fallback to MJPEG for restricted networks.
- FRAME BUDGET≤ 33 ms
- RESOLUTION720p (1080p β)
- TRANSPORTWebRTC, MJPEG fb
- GPU PER REP1 · autoscaled
Visual intelligence
When the visitor consents to camera, Aura perceives presence, emotion, and ambient context — and tunes the conversation accordingly. Reading the room is what separates a concierge from a script. The signals are processed on-device or in your tenant; raw video never leaves the perimeter you choose.
- SIGNALSpresence · emotion · env
- CONSENTexplicit, per-session
- PROCESSINGon-device or tenant
- RAW VIDEOnever persisted
A second of presence, broken down.
Time-to-first-audio measured from the user’s last word. Targets are p95 on production traffic.
| Stage | Budget (ms) |
|---|---|
| ASR | 90 |
| ROUTE | 80 |
| FETCH | 70 |
| LLM | 220 |
| TTS | 280 |
| NET | 60 |
| Total | 800 |
Five surfaces. One representative manifest.
A representative is described by a single manifest — its sources, persona, voice, avatar, and guardrails — and that manifest is the only thing that needs to travel between surfaces. Every surface listed below pulls from the same backend; switching surfaces does not require re-training, re-indexing, or re-uploading anything.
Hosted page
A standalone URL we host. Zero-integration option.
Website widget
A floating bubble that opens to a panel on any site.
Inline embed
An iframe-free div that lives inside your existing page.
WordPress plugin
Drop-in plugin, single-shortcode placement.
Shopify app
Storefront app for product Q&A and assisted sale.
Flutter SDK
Mobile-native rendering for iOS and Android apps.
Self-hosted GPU pod
Run the entire stack inside your own VPC. Bring-your-own model weights for Enterprise. Sovereign deployments by request.
Self-host quote →Built for teams that have to answer to a CISO.
Data is encrypted in transit (TLS 1.3) and at rest (AES-256). Conversation transcripts are tenant-isolated and retained only as long as your policy requires — Studio defaults to 30 days, Enterprise is configurable down to zero retention. Voice and avatar streams are never recorded by default. PII fields can be redacted before they reach the reply model. We are SOC 2 Type II audited; HIPAA-aware data handling and GDPR data residency in EU and US regions ship today; FedRAMP is on the 2026 roadmap.
SOC 2 Type II
AUDITED · ANNUAL
GDPR
EU & US RESIDENCY
HIPAA-aware
BAA AVAILABLE
AES-256 / TLS 1.3
TRANSIT + REST
SSO / SAML / SCIM
ENTERPRISE
PII Redaction
PRE-MODEL
Self-host option
BYO INFRASTRUCTURE
Configurable Retention
0 — UNLIMITED
Where the conversation goes.
Aura emits structured events for every conversation milestone — turn started, intent detected, lead captured, demo booked, escalation requested. Events fan out to your existing systems without polling. The Intelligent Agent Analytics dashboard gives you live engagement, conversion, and resolution metrics across every representative.
View event schema
representative.session.started representative.turn.completed representative.intent.detected representative.lead.captured representative.demo.booked representative.escalation.requested representative.transcript.archived representative.session.ended
What’s shipping, and what’s next.
A non-binding view of the next two quarters. Dates are intent, not contract.
NOW · Q2 2026
- 1080p avatar streams
- Voice cloning, GA
- BYO weights, Enterprise
- Slack & Segment integrations
- EU residency, GA
NEXT · Q3 2026
- On-device avatar (Apple Silicon)
- Multilingual voice cloning
- Self-host operator UI
- Zapier, Pipedrive, Zendesk
- APAC residency
LATER · Q4 2026+
- FedRAMP Moderate
- Real-time translation, 28 ↔ 28
- Multi-rep orchestration
- Direct CRM sync (bidirectional)
- Sovereign cloud partnerships
Engineering questions, answered short.
Can we run Aura in our own VPC?
Yes. The Self-hosted GPU pod option ships the entire stack — knowledge, conversation, voice, avatar — as a Kubernetes operator. You bring the GPUs, we bring the operator. Air-gapped deployments are supported for Enterprise.
Do you train on our conversations?
No. Customer conversations are never used to train shared models. Per-tenant fine-tuning is opt-in and isolated to your tenant.
What happens if a model goes down?
The orchestrator falls back to a smaller backup model and emits a degradation event. The avatar continues to render; the voice layer keeps streaming. End-users see a slightly slower response, not a broken experience.
How do you handle PII?
PII fields can be redacted at the orchestrator before any text reaches the reply model. Redaction patterns are configurable per representative. Transcripts can be configured to exclude PII at archive time.
Can we use our own LLM?
Enterprise customers can bring their own model weights for the reply layer. The routing layer remains Aura-managed for orchestration consistency.
What's the smallest deployment?
One representative, one source, the Free tier. Indexing typically completes in under two minutes for a single-page URL. The hosted page is live the same minute the index completes.
