Technology

How Lumyeye works.

Lumyeye combines multimodal AI (vision + voice), real-time streaming, and a deliberately minimalist UX built for blind users. Behind the scenes, here is the stack that lets the app read a label or describe a busy street in under a second.

Multimodal AI core

Each photo or video frame is processed by enterprise-grade large multimodal models (OpenAI, Anthropic, Google) that turn the image into a text answer. We tune prompts for accuracy on documents, labels, faces, scenes, currency, medications.

Voice in / voice out

Voice questions are transcribed locally on iPhone (Speech framework) or with cloud STT. Answers come back as text, then read by a premium neural voice (Pro) or the system voice (Classic). Less than 1.5 seconds end-to-end on a fast connection.

Vision Live (Lumyeye Pro)

We stream video frames at 1–2 fps, with adaptive throttling. Each frame is sent to a streaming-capable multimodal model. The conversation is stateless: you can interrupt anytime by tapping once.

Zero server storage

Images are sent over TLS, processed, and immediately dropped. No persistence, no training. Voice transcripts are kept in encrypted memory for the active session only.

VoiceOver / TalkBack first

We design every screen as if it had no visible UI. Labels, focus order and gestures are validated weekly with blind users. Months of work on Pro to deliver the smoothest VoiceOver experience on the market.

Multilingual

13 input/output languages including English, Spanish, French, German, Portuguese, Italian, Arabic, Mandarin, Japanese.

Hosted in Europe

Edge proxies in EU (OVH) and US (AWS us-east) for low latency. Subprocessors bound by DPA.

What we don't do

No face recognition database.
No emotion detection.
No tracking, no resale of personal data.
No model training on user data.

Try the technology free.

Download Lumyeye