TECHNICAL BRIEF · v1.0 · MAY 2026

Aura AI, under the hood.

A technical overview for engineering and platform teams. The six-layer system that turns a knowledge base into a face — its architecture, latency budgets, security posture, and the surfaces it deploys to. About a ten-minute read.

LAST UPDATED · MAY 5, 2026≈ 10 MIN READINTERNAL REVIEW · COMPLETE
01 · EXECUTIVE SUMMARY

What Aura AI is, in three paragraphs.

Aura AI is Sumeru’s AI Concierge — an enterprise-grade conversational presence layer that turns a business’s existing content into a face-driven, voice-led representative. It sells, supports, onboards, and interviews. It is, by design, far beyond a chatbot: an immersive conversational video experience that adapts in real time to who is on the other end.

It exists because customers and internal teams today lose 20–30% of their time navigating fragmented information — websites, PDFs, internal docs, disconnected tools. Aura collapses that into a single conversation. Powered by next-gen agentic AI, it instantly retrieves the most relevant answer from your knowledge and delivers it through a lifelike avatar — the same way a great human concierge would.

The platform is a six-layer stack — knowledge, conversation, voice, avatar, visual intelligence, and deployment — designed to clear a sub-second time-to-first-response budget and to render a 30 fps photoreal head on commodity GPUs. It is built for teams who need control, performance, and presence at scale: enterprises that demand SOC 2, GDPR, and HIPAA-aware data handling, and platform teams who want a self-hosted GPU option for sovereign deployments.

02 · ARCHITECTURE

Six layers, one representative.

Each layer is independently scalable, observable, and replaceable. The orchestrator binds them with a single representative manifest.

Knowledge

Vectorised content store. URL, PDF, video, and internal-doc sources, refreshed on a schedule or on demand.

Conversation

Agentic routing. A small, fast model handles routing and clarifications; a larger model handles substantive replies.

Voice

Streaming TTS with phoneme timestamps for lip-sync. Sub-300 ms first-audio after token start.

Avatar

GPU-rendered photoreal head, 30 fps, lip-sync locked to voice phonemes, gaze and blink driven by conversation state.

Visual Intelligence

Detects visitor appearance, emotion, and surroundings in real time to hyper-personalize each turn.

Deployment

Hosted page, widget embed, WordPress, Shopify, Flutter SDK, or self-hosted GPU pod.

03 · KNOWLEDGE

Your content becomes the representative’s memory.

A representative is grounded by between one and eight sources. URLs are crawled to a depth you control. PDFs and Markdown are parsed with structure preserved — headings, lists, tables. Plain prompts are accepted for cases where the knowledge is the brand voice itself. Each source can be tagged for routing — a sales rep doesn’t read the engineering wiki by default.

Content is chunked at semantic boundaries, embedded into a 1024-dimension vector index, and attributed at retrieval time so every reply can carry a citation. Indexes refresh hourly by default, with on-demand reindex for time-sensitive launches. Pricing pages, inventory, and docs that change daily are first-class sources, not afterthoughts.

GLOBE — URLFILE — PDFHASH — MARKDOWNDB — STRUCTUREDMSG — PROMPTRSS — RSS / SITEMAP
04 · CONVERSATION

Two models, one conversation.

Every user turn first hits a routing pass on a 7B-parameter model. It classifies the turn — clarification, factual question, objection, action request — and selects the slice of the knowledge index to retrieve. A typical routing pass completes in under 80 ms.

The substantive reply runs on a 70B-parameter model with the retrieved context, the representative’s persona prompt, and the live conversation history. Output is streamed token-by-token to the voice layer so audio synthesis can begin before the reply finishes generating.

The orchestrator enforces representative scope. A sales rep cannot answer engineering questions; a support rep cannot quote sales discounts. Out-of-scope turns are gracefully redirected with a copy you control. Structured output — book a demo, capture a qualified lead — is emitted as a JSON tool call that downstream systems can consume.

05 · VOICE, AVATAR & VISION

The face — and the eyes — on top of the model.

Aura is a conversational video experience, not a chat thread with extra steps. Three distinct subsystems give it presence: a streaming voice that speaks in 28 languages, a photoreal avatar that lip-syncs to that voice, and a perception layer that watches the visitor in real time so the conversation can adapt to who is on the other end.

Voice synthesis

Streaming TTS with phoneme timestamps. The first audio chunk leaves the server within 280 ms of the first token from the reply model. We support 28 languages out of the box, with real-time translation between any pair; voice cloning is available for Studio and Enterprise plans with verified consent.

  • FIRST AUDIO≤ 280 ms
  • SAMPLE RATE22 kHz · 16-bit PCM
  • LANGUAGES28 · LIVE TRANSLATE
  • VOICE CLONINGStudio +

Avatar rendering

A photoreal head rendered on a single GPU pod, 30 frames per second at 720p. Lip-sync is driven by the voice layer’s phoneme timestamps; gaze and blink are driven by conversation state. The avatar runs as a WebRTC stream on the client, with a fallback to MJPEG for restricted networks.

  • FRAME BUDGET≤ 33 ms
  • RESOLUTION720p (1080p β)
  • TRANSPORTWebRTC, MJPEG fb
  • GPU PER REP1 · autoscaled

Visual intelligence

When the visitor consents to camera, Aura perceives presence, emotion, and ambient context — and tunes the conversation accordingly. Reading the room is what separates a concierge from a script. The signals are processed on-device or in your tenant; raw video never leaves the perimeter you choose.

  • SIGNALSpresence · emotion · env
  • CONSENTexplicit, per-session
  • PROCESSINGon-device or tenant
  • RAW VIDEOnever persisted
06 · LATENCY

A second of presence, broken down.

Time-to-first-audio measured from the user’s last word. Targets are p95 on production traffic.

p95 · time-to-first-audioTOTAL · 800 ms
Latency budget per layer (p95)
StageBudget (ms)
ASR90
ROUTE80
FETCH70
LLM220
TTS280
NET60
Total800
07 · DEPLOYMENT

Five surfaces. One representative manifest.

A representative is described by a single manifest — its sources, persona, voice, avatar, and guardrails — and that manifest is the only thing that needs to travel between surfaces. Every surface listed below pulls from the same backend; switching surfaces does not require re-training, re-indexing, or re-uploading anything.

Hosted page

A standalone URL we host. Zero-integration option.

Website widget

A floating bubble that opens to a panel on any site.

Inline embed

An iframe-free div that lives inside your existing page.

WordPress plugin

Drop-in plugin, single-shortcode placement.

Shopify app

Storefront app for product Q&A and assisted sale.

Flutter SDK

Mobile-native rendering for iOS and Android apps.

Self-hosted GPU pod

Run the entire stack inside your own VPC. Bring-your-own model weights for Enterprise. Sovereign deployments by request.

Self-host quote →
08 · SECURITY

Built for teams that have to answer to a CISO.

Data is encrypted in transit (TLS 1.3) and at rest (AES-256). Conversation transcripts are tenant-isolated and retained only as long as your policy requires — Studio defaults to 30 days, Enterprise is configurable down to zero retention. Voice and avatar streams are never recorded by default. PII fields can be redacted before they reach the reply model. We are SOC 2 Type II audited; HIPAA-aware data handling and GDPR data residency in EU and US regions ship today; FedRAMP is on the 2026 roadmap.

SOC 2 Type II

AUDITED · ANNUAL

GDPR

EU & US RESIDENCY

HIPAA-aware

BAA AVAILABLE

AES-256 / TLS 1.3

TRANSIT + REST

SSO / SAML / SCIM

ENTERPRISE

PII Redaction

PRE-MODEL

Self-host option

BYO INFRASTRUCTURE

Configurable Retention

0 — UNLIMITED

09 · INTEGRATIONS

Where the conversation goes.

Aura emits structured events for every conversation milestone — turn started, intent detected, lead captured, demo booked, escalation requested. Events fan out to your existing systems without polling. The Intelligent Agent Analytics dashboard gives you live engagement, conversion, and resolution metrics across every representative.

WEBHOOKS
SLACK
SEGMENT
HUBSPOT
SALESFORCE
INTERCOM
ZAPIER
PIPEDRIVE
ZENDESK
SHIPPING ROADMAP
View event schema
representative.session.started
representative.turn.completed
representative.intent.detected
representative.lead.captured
representative.demo.booked
representative.escalation.requested
representative.transcript.archived
representative.session.ended
10 · ROADMAP

What’s shipping, and what’s next.

A non-binding view of the next two quarters. Dates are intent, not contract.

NOW · Q2 2026

  • 1080p avatar streams
  • Voice cloning, GA
  • BYO weights, Enterprise
  • Slack & Segment integrations
  • EU residency, GA

NEXT · Q3 2026

  • On-device avatar (Apple Silicon)
  • Multilingual voice cloning
  • Self-host operator UI
  • Zapier, Pipedrive, Zendesk
  • APAC residency

LATER · Q4 2026+

  • FedRAMP Moderate
  • Real-time translation, 28 ↔ 28
  • Multi-rep orchestration
  • Direct CRM sync (bidirectional)
  • Sovereign cloud partnerships
11 · FAQ

Engineering questions, answered short.

Can we run Aura in our own VPC?

Yes. The Self-hosted GPU pod option ships the entire stack — knowledge, conversation, voice, avatar — as a Kubernetes operator. You bring the GPUs, we bring the operator. Air-gapped deployments are supported for Enterprise.

Do you train on our conversations?

No. Customer conversations are never used to train shared models. Per-tenant fine-tuning is opt-in and isolated to your tenant.

What happens if a model goes down?

The orchestrator falls back to a smaller backup model and emits a degradation event. The avatar continues to render; the voice layer keeps streaming. End-users see a slightly slower response, not a broken experience.

How do you handle PII?

PII fields can be redacted at the orchestrator before any text reaches the reply model. Redaction patterns are configurable per representative. Transcripts can be configured to exclude PII at archive time.

Can we use our own LLM?

Enterprise customers can bring their own model weights for the reply layer. The routing layer remains Aura-managed for orchestration consistency.

What's the smallest deployment?

One representative, one source, the Free tier. Indexing typically completes in under two minutes for a single-page URL. The hosted page is live the same minute the index completes.

READY?

You’ve read the brief. Now let’s build a representative.

Talk to engineering for a deep technical fit, or self-serve in under four minutes with the avatar configurator.