How accurate is on-device Whisper transcription for English?

On-device Whisper achieves 95%+ word accuracy on clear English speech and runs at 1.5-3x real-time speed on current flagship devices. The accuracy is comparable to Google Speech-to-Text for standard English. It drops in noisy environments or with strong accents, which also affects cloud APIs. For enterprise field documentation in English, on-device Whisper accuracy meets production requirements for most use cases.

What does cloud speech transcription cost at enterprise scale?

Cloud speech APIs from Google, AWS, and Azure cost $0.004-$0.016 per 15 seconds of audio. A team of 100 field workers each dictating 5 minutes of notes per day generates 8,333 minutes of audio per month. At those rates, cloud API costs run $3,333-$13,333 per month — $40,000-$160,000 per year. On-device Whisper has zero per-minute cost after the build investment.

Does on-device Whisper work offline?

Yes. Whisper runs entirely on the device with no network connection. For field technicians working in buildings without reliable connectivity, this is a production requirement that cloud APIs cannot meet. Wednesday built offline-first voice transcription in Off Grid — audio is transcribed locally on the device regardless of network state.

What are the HIPAA implications of cloud voice transcription for clinical apps?

Audio of patient encounters is protected health information under HIPAA. Sending that audio to a cloud speech API requires a Business Associate Agreement (BAA) with the API vendor and assurance that audio is not retained beyond the transcription period. Most cloud speech API vendors offer HIPAA BAAs, but terms vary — particularly around data retention and the use of audio for model improvement. On-device transcription eliminates this risk category entirely: audio never leaves the device.

Can on-device Whisper handle specialised medical or technical vocabulary?

Standard Whisper performs reasonably well on medical terminology because it was trained on a wide variety of text. Accuracy on very specialised technical vocabulary or rare proper nouns is lower. Fine-tuned Whisper variants exist for medical use cases and can be deployed on-device. Cloud APIs from Google and AWS offer custom vocabulary and domain-specific models that may outperform base Whisper for highly specialised applications. This is the main scenario where cloud APIs retain an advantage over out-of-the-box on-device Whisper.

Writing

On-Device Voice Transcription vs Cloud Speech APIs: Privacy, Latency, and Cost for US Enterprise Mobile 2026

Your field technicians are dictating work orders. Your clinicians are documenting patient encounters. Every word is going to a cloud server — and you may not have approved that.

Anurag Rathod · Technical Lead, Wednesday Solutions

9 min read·Published Feb 6, 2026·Updated Feb 6, 2026

4xfaster with AI

2xfewer crashes

10xmore work, same cost

4.8on Clutch

Trusted by teams at

In this article

The voice data problem
On-device Whisper: what it delivers
Cloud speech APIs: the accuracy and feature case
The full cost model
Latency: the overlooked variable
Decision matrix
What this means for regulated industries
How Wednesday builds voice transcription features

A field technician on your team dictates a work order. That audio — describing what is broken, what parts were used, where the job was — travels to a Google or AWS data center on every single dictation. If you are in healthcare, every word of that clinical note is leaving the device. Most enterprise teams approved a voice feature without reading the data flow.

Key findings

On-device Whisper achieves 95%+ accuracy on English speech at 1.5-3x real-time speed. For standard enterprise documentation use cases, accuracy is comparable to Google Speech-to-Text.

A field service team with 100 technicians each dictating 5 minutes of notes per day pays $3,333-$13,333 per month in cloud speech API costs. On-device transcription costs zero per minute after the build investment.

Cloud voice transcription sends audio to a third-party server on every call. In healthcare, that audio is protected health information and requires a BAA. In any regulated environment, it is a compliance event per dictation.

Wednesday shipped on-device Whisper transcription in Off Grid with no server dependency. Audio never leaves the device. The integration is production-tested at scale.

The voice data problem

Voice transcription in enterprise mobile apps is not a new feature. What is new is the scale at which it is being deployed and the regulatory scrutiny now being applied to where audio data goes.

Every cloud speech API call makes a copy of audio on external infrastructure. Google Speech-to-Text, AWS Transcribe, and Azure Speech all process audio on their servers. The transcription is returned to the app. What happens to the audio after that depends on each vendor's retention policy — policies that most enterprise teams have not read and that change periodically.

For a retail app doing voice search, this is inconvenient but manageable. For a clinical app where clinicians dictate patient encounter notes, sending audio to a third-party server is a HIPAA event on every dictation. For a legal app where lawyers dictate case notes, it is attorney-client privilege crossing a wire to a commercial server. For a financial services app, it is customer financial conversations logged on external infrastructure.

The CISO who blocked your cloud AI deployment probably had this in mind.

On-device Whisper: what it delivers

Whisper is an open-source speech recognition model published by OpenAI in 2022 and now maintained by the community. The model weights are public. The inference runs locally. OpenAI itself does not receive audio when you run Whisper on-device.

On current flagship devices — iPhone 15 Pro, Pixel 8 Pro, Galaxy S24, or any device with a Snapdragon 8 Gen 2 or Apple A17 Pro chip — Whisper transcribes English speech at 95%+ word accuracy and 1.5-3x real-time speed. A 2-minute dictation transcribes in 40-80 seconds. A 30-second voice note transcribes in 10-20 seconds.

That speed is fast enough for the common enterprise voice use case: a user dictates, reviews the transcript, and approves or edits. It is not suitable for live real-time captioning where sub-second latency is required — that use case still requires cloud APIs.

Wednesday integrated on-device Whisper into Off Grid. The implementation handles variable background noise, devices without NPU acceleration, and language-switching mid-session. That is not a hello-world Whisper integration. It is a production implementation that has run on 50,000+ devices across a wide device compatibility matrix.

The smallest Whisper model that runs comfortably on current devices uses approximately 150MB of device storage. Larger models with higher accuracy use 500MB-1.5GB. For enterprise apps targeting 2021+ devices, the medium Whisper model provides a practical balance of accuracy and storage footprint.

Cloud speech APIs: the accuracy and feature case

Cloud speech APIs from Google, AWS, and Azure have real advantages that matter in specific scenarios.

Specialised vocabulary adaptation. A hospital that needs AI transcription of cardiothoracic surgery procedures can train a custom language model on domain-specific terminology. Cloud providers offer custom vocabulary features that out-of-the-box on-device Whisper cannot match for highly specialised technical language.

Real-time speaker diarisation. Identifying who spoke which words in a multi-speaker recording — "Dr. Smith said X, the patient said Y" — is more capable in cloud APIs than current on-device models. For clinical documentation that needs structured speaker attribution, cloud APIs have an advantage.

Live captioning with sub-second latency. Cloud streaming transcription APIs can return partial results in 200-500ms, enabling real-time captions for live meetings or customer calls. On-device Whisper processes in chunks, with latency that depends on chunk size and device performance.

For enterprise apps that need these specific capabilities, cloud APIs are the right tool. The compliance cost is then to negotiate appropriate data processing terms and, in healthcare, a BAA.

Not sure if your voice transcription use case needs cloud APIs or works on-device? A 30-minute call produces a written recommendation with accuracy and cost estimates.

Get my recommendation →

The full cost model

Cloud speech APIs charge by audio duration. The range across major providers is $0.004-$0.016 per 15 seconds.

To understand what that means at enterprise scale, run the math on a realistic deployment.

A logistics company with 100 field technicians. Each technician dictates an average of 5 minutes of notes per day — work orders, inspection reports, customer handoff notes. That is 500 minutes of audio per day, or 10,000 minutes per month.

At $0.004 per 15 seconds: 10,000 minutes = 40,000 fifteen-second segments = $160 per month. At $0.016 per 15 seconds: the same volume costs $640 per month.

Scale that to a healthcare system with 1,000 clinicians each dictating 15 minutes of clinical notes per day. That is 15,000 minutes per day, 300,000 minutes per month. Cloud API cost: $4,800-$19,200 per month. Per year: $57,600-$230,400.

On-device Whisper costs zero per minute. The entire build investment to integrate on-device Whisper into an existing enterprise mobile app is $30,000-$60,000 in engineering. At the healthcare system scale above, that investment pays back in 2-4 months.

The indirect costs amplify the comparison. Each cloud API call in a healthcare app is a potential HIPAA audit finding. Legal review of cloud speech vendor data terms and negotiation of a BAA costs $8,000-$25,000. Annual re-review when vendor terms change adds ongoing overhead. These costs do not exist with on-device transcription.

Latency: the overlooked variable

Latency in voice transcription matters differently depending on the use case.

For asynchronous documentation — a technician dictates a note, reviews the transcript, taps approve — latency in the 10-30 second range is invisible to users. On-device Whisper at 1.5-3x real-time handles this fine. Cloud APIs handle it fine too.

For synchronous use cases — live captions, real-time meeting transcription, voice commands with immediate response — latency matters in hundreds of milliseconds. Cloud streaming APIs win here. On-device Whisper's chunk-based processing cannot match cloud streaming latency for live use cases.

There is also a network latency component for cloud APIs that varies by environment. In an area with poor connectivity — the field environment most enterprise mobile apps target — cloud speech API latency spikes unpredictably. A 2-minute audio file may take 30 seconds to upload, 2 seconds to process, and 5 seconds to return a transcript over a 3G connection. On-device processing is deterministic regardless of connectivity.

For field service, healthcare, and logistics — the primary enterprise mobile voice use cases — on-device latency is acceptable and cloud latency is unreliable in the environments where these workers operate.

Decision matrix

Factor	On-device Whisper	Cloud speech APIs
English accuracy on clean speech	95%+ — production quality	95-99% — marginal improvement
Works offline	Yes — full functionality	No — requires connectivity
Privacy — audio on external server	Never	Always, per transcription
HIPAA BAA required	No	Yes
Cost per minute	$0 after build	$0.016-$0.064 per minute
Custom vocabulary / domain adaptation	Limited — fine-tuning possible	Strong — cloud vendor feature
Real-time streaming captions	Not suitable	Well-suited
Speaker diarisation	Basic	Strong
Works on mid-range devices (2021+)	Yes — with appropriate model size	Yes
Adds latency risk in poor connectivity	No	Yes

What this means for regulated industries

Healthcare, financial services, and legal are the three enterprise sectors where voice transcription creates the most compliance exposure with cloud APIs.

In healthcare, audio of patient encounters is protected health information. Under HIPAA, it cannot be processed by a business associate without a signed BAA. Most major cloud speech APIs offer HIPAA BAAs, but the process takes time and the terms require legal review. More importantly, a BAA does not guarantee the audio is not used for model training — it only guarantees appropriate data handling under HIPAA's definition. Some cloud vendor terms have allowed model improvement use even under BAA. Audit the specific terms before signing.

In financial services, audio of client conversations about accounts and investments may be subject to FINRA and SEC retention requirements. Sending that audio to a cloud vendor creates questions about who controls the audit trail. On-device transcription keeps audio under direct enterprise control.

In legal, attorney-client privilege may not survive transmission of audio to a commercial cloud server. This is an unsettled area of law, but cautious legal departments treat it as a risk. On-device transcription eliminates the question.

Case study — Clinical digital health platform

0patient logs lost offline — seizures logged anywhere, synced automatically

“They really cared and felt like an extension of our team. The quality of the work was top notch, and they were receptive to shifting priorities.”

Founder, Digital health platformRead the case study →

How Wednesday builds voice transcription features

Every voice transcription feature Wednesday scopes starts with the same question: what is the compliance classification of the audio content?

If audio contains regulated data — patient information, financial account details, privileged communications — on-device transcription is the default recommendation. The cost and accuracy case is strong. The compliance case is conclusive.

If audio is not regulated — user preferences, search queries, non-sensitive field notes — and the use case requires specialised vocabulary adaptation or real-time captions, cloud APIs may be appropriate with the right vendor agreements.

For the majority of enterprise voice transcription use cases, on-device Whisper is the right choice. Wednesday's Off Grid implementation is the reference architecture. Device compatibility, model size selection, background noise handling, and the integration pattern for an existing React Native app are all solved problems, not open questions.

The build cost is bounded — $30,000-$60,000 to integrate on-device Whisper into an existing mobile app. The compliance savings are immediate. The monthly infrastructure savings compound from day one.

Ready to add voice transcription that your CISO and legal team can approve without a BAA negotiation? Book a 30-minute call and get a written scoping estimate.

Book my 30-min call →

4.8 on Clutch

4x faster with AI2x fewer crashes100% money back

Frequently asked questions

The writing archive covers cost models, vendor comparisons, and compliance frameworks for enterprise mobile AI decisions.