Trusted by teams at
In this article
Across 50+ enterprise mobile vendor evaluations, the five dimensions that consistently predict delivery quality are: scale proof, team stability, quality ownership, shipping frequency, and compliance posture. Most evaluation frameworks assess two of the five. The rest of the process is gut feel dressed up as due diligence. Those decisions look defensible in the boardroom and fall apart six months into the engagement. This checklist gives you a pass/fail criterion for each dimension so you walk into every agency meeting with the same standard and leave with a score you can actually defend.
Key findings
Most enterprise mobile vendor evaluations assess scale and price, then stop. Team stability, quality ownership, and compliance posture are left to gut feel. Those are the three dimensions most correlated with mid-engagement failure.
Agencies with annual attrition above 25% cannot guarantee team continuity across a 12-month engagement. Most will not volunteer this number. Ask for it directly.
Quality ownership is the single most revealing dimension to probe. Agencies that treat defects as shared responsibility produce longer fix cycles and more contested scope discussions than agencies where quality ownership is structurally assigned.
Compliance that is an add-on costs more and ships slower than compliance that is standard practice. The question to ask: which of your SOC 2 controls were in place before you had a client who required them?
Why most evaluation frameworks fail
Most vendor evaluation frameworks are too subjective to produce a defensible decision because they measure impressions rather than evidence. "Culture fit" and "communication style" are judgment calls that two evaluators from the same company will score differently after the same meeting. Those dimensions are real. They cannot anchor a decision, though, when a $30K/month contract is on the table and your board wants to know why you picked this agency over the other two.
The second failure mode is incomplete coverage. A standard evaluation covers portfolio quality and price. It skips team stability because attrition data is uncomfortable to ask for. It skips quality posture because the buyer is not sure what question to ask. It skips compliance because the agency's SOC 2 page looks reassuring enough. Those skipped dimensions are exactly where mid-engagement problems originate.
The checklist that follows covers five dimensions: scale proof, team structure, quality posture, speed, and compliance. Each dimension has two to four line items. Each line item has a pass criterion, a fail criterion, and an interpretation note for what the answer signals about the agency's delivery model. Run this across every vendor on your shortlist and the scores will diverge in ways that gut feel alone would not have caught.
Section 1: Scale proof
Scale proof answers one question: has this agency shipped to users at a volume and complexity similar to your engagement? A portfolio of consumer apps does not predict performance on an internal field operations platform. A portfolio of MVPs does not predict performance on a 200K-user app in regulated data environments. Ask for the specific evidence, not the general claim.
| Checklist item | Pass | Fail | What it signals |
|---|---|---|---|
| At least one active client with 50K+ monthly active users | Agency names a specific active client and offers a reference | Agency references completed projects only, or cannot name active clients at scale | Active clients at scale mean current delivery accountability, not past performance |
| Portfolio includes at least one enterprise internal app (field ops, sales tools, or similar) | Agency can describe the data integrations and offline requirements of a past internal app | Agency portfolio is consumer-facing only | Internal apps carry complexity drivers that consumer app experience does not cover: offline sync, MDM, and directory integration |
| At least one client engagement longer than 18 months | Agency can name an ongoing or multi-year engagement | Longest engagement in the portfolio is under 12 months | Long-running engagements indicate the agency earns renewal. Short engagements may indicate the agency does not. |
| Outcome data attached to portfolio entries (not just delivery dates) | Agency can state a measurable outcome for at least two portfolio entries: crash rate, user retention, or how often the app shipped to users | Portfolio entries list features shipped, not outcomes delivered | An agency tracking outcomes is an agency that understands what the client was actually trying to achieve |
Section 2: Team structure
Team structure predicts continuity. The risk in any agency engagement is not that the agency is bad at the start. The risk is that the team changes six months in and the new team does not have context. Continuity risk compounds in long engagements. Ask about structure and attrition before you ask about team size.
| Checklist item | Pass | Fail | What it signals |
|---|---|---|---|
| Named delivery lead assigned before contract signing | Agency names a specific person and offers to introduce them on the evaluation call | Agency references a "dedicated point of contact" without naming them | A named lead before signing means the agency has thought about your engagement specifically. An unnamed lead means you are buying a process, not a team. |
| Annual engineer attrition rate below 20% | Agency provides the number and can explain the retention practices behind it | Agency declines to share attrition data or gives a non-specific answer | Attrition above 20% in a 12-month engagement statistically means at least one key engineer on your team turns over. That turnover has a cost. |
| Team composition includes a dedicated QA function | Agency describes a QA role that is separate from the engineering build function | Agency describes engineers who "also handle testing" | Co-located QA and engineering responsibilities produce lower defect detection rates before release |
| Escalation path does not require your delivery lead to be available | Agency describes a named backup and an escalation protocol | Escalation path is described as "contact your delivery lead" with no named alternative | A single point of failure in communication means your engagement stalls whenever that person is unavailable |
Section 3: Quality posture
Quality posture is about ownership, not process. Every agency has a QA process. The question is who owns a defect when it reaches the user: the engineering team, the QA team, or the client. The answer to that question determines how fast defects get fixed and how contested the scope conversation is when they do.
| Checklist item | Pass | Fail | What it signals |
|---|---|---|---|
| Defects found after release are fixed at agency cost, not billed as new scope | Agency confirms this in writing and points to a contract clause | Agency describes a "bug triage process" without confirming cost ownership | If post-release defects are billable, the agency has an economic incentive not to find them before release |
| Agency can produce crash rate data for at least one active engagement | Agency shares a number and a time range | Agency describes qualitative quality practices without quantitative data | An agency without crash rate data for active clients does not monitor what ships to users |
| Automated testing is standard, not an add-on | Agency describes automated test coverage as part of their default workflow | Agency offers automated testing as a premium tier or an optional add-on | Testing infrastructure that is optional is testing infrastructure that clients often skip to reduce cost |
| Release review process includes a non-engineer sign-off before user release | Agency describes a documented review gate that involves someone outside the engineering team | Agency describes an internal engineering review only | A single-discipline review gate is more likely to miss user-impact issues that non-engineers would catch |
Tell us where your current vendor is failing. We will show you exactly how Wednesday's quality posture compares on the dimensions that produced the failure.
Get my evaluation guide →Section 4: Speed
How often an agency ships to users is the most predictive metric for roadmap velocity, and the easiest to verify. The difference between shipping every week and shipping every month is not a matter of style. It is a 4x difference in how fast feedback reaches the team and how fast your users see fixes. Every agency will claim they ship frequently. Ask for dates, not descriptions.
| Checklist item | Pass | Fail | What it signals |
|---|---|---|---|
| Agency can produce the last six release dates for an active client | Agency shares specific dates within 48 hours of being asked | Dates arrive vague, involve completed projects only, or do not arrive at all | An agency with a real shipping record can produce dates on demand. One without a consistent record cannot. |
| Gap between releases averages two weeks or less | Release dates show a consistent interval of 14 days or less | Release history shows gaps of 30 days or more | A monthly release interval means problems surface to users, sit for weeks, and require a new cycle to fix. That is a structural delay. |
| Agency distinguishes between internal build releases and user-facing releases | Agency describes both cycles separately and can explain what controls the gap between them | Agency conflates internal build frequency with user release frequency | Build frequency and user release frequency are different numbers. An agency that treats them as the same is either misinformed or intentionally conflating them. |
| First working build delivered within two weeks of engagement start | Agency can point to a recent engagement where working software shipped in the first two weeks | Agency describes a multi-week "discovery and planning" phase before any build begins | A two-week delay to first working software is normal. A six-week delay before first build is not a process. It is a billing structure. |
Section 5: Compliance
Compliance that is built into an agency's standard workflow is different from compliance that is an add-on. Add-on compliance costs more, ships slower, and produces gaps in coverage that only surface during an audit. For US enterprise engagements, especially internal apps handling employee data, field operations data, or financial data, the distinction matters.
| Checklist item | Pass | Fail | What it signals |
|---|---|---|---|
| SOC 2 Type II report available, not just "in progress" | Agency provides a completed SOC 2 Type II report on request | Agency is SOC 2 "compliant" but cannot produce a Type II report, or references a Type I report only | Type I reports attest to design. Type II reports attest to operating effectiveness over time. They are not equivalent. |
| Data handling practices are documented and transferable | Agency can provide a data handling policy that covers where data is stored, who has access, and what happens to data when the engagement ends | Agency describes data handling verbally without documentation | Verbal data handling commitments do not survive personnel changes at the agency or in your legal team |
| GDPR and CCPA requirements are handled by default, not on request | Agency includes data privacy handling in their standard contract | Agency offers compliance support as an optional add-on or scoped separately | Privacy compliance that is optional is compliance your team has to remember to request. That is a gap waiting to open. |
| Security review is part of the release process, not a separate engagement | Agency describes automated security scanning as part of their standard build process | Agency offers security reviews as a separate billable engagement | A security posture that requires a separate engagement is a security posture that clients often skip when timelines tighten |
How to score the checklist and what it predicts
Score each checklist item as 1 (pass), 0 (fail), or 0.5 (partial, meaning the agency meets part of the criterion or provides a credible explanation for the gap). Each section contains four items. The maximum score per section is 4. The maximum total score is 20.
16-20: The agency passes on all five dimensions. The remaining differentiator is price and timezone overlap. Both are negotiable. A score in this range means the agency has the infrastructure to run an enterprise engagement without structural risk.
11-15: The agency has meaningful gaps in at least two dimensions. Name the gaps explicitly and ask the agency to explain them before shortlisting. Some gaps are fixable: an agency without SOC 2 Type II today may have one in 90 days. Others are not. An agency with 30% attrition is unlikely to solve that problem inside your engagement timeline.
6-10: The agency fails on two or more critical dimensions. The gaps in this range are structural, not fixable through negotiation. Scale proof gaps mean the agency has not run an engagement like yours. Team structure gaps mean continuity risk is high. A score in this range is a signal to remove the vendor from the shortlist, not to negotiate harder.
0-5: The agency passes on fewer than one section. This score typically indicates either a very early-stage agency or a vendor that is pitching outside their actual capability. Remove from the shortlist.
One note on scoring: a fail on Section 1 (scale proof) or Section 2 (team structure) is more predictive of failure than a fail on Section 5 (compliance). Weight the first two sections accordingly. An agency without scale proof at your level and with high attrition risk is unlikely to succeed on your engagement regardless of how well they score on the remaining three dimensions.
The checklist is not a replacement for reference calls or contract review. It is a filter that runs in parallel with both. Use it to narrow a shortlist of five vendors to two, then run the reference calls and contract review on the final two. The agencies that score above 16 are the ones worth the time that the reference call and contract review require.
Bring your shortlist. We will walk you through how Wednesday scores on every dimension in this checklist, with the data to back it.
Book my 30-min call →Frequently asked questions
Not ready for a call yet? Browse vendor scorecards, switching frameworks, and contract guides for enterprise mobile development.
Read more decision guides →About the author
Bhavesh Pawar
LinkedIn →Technical Lead, Wednesday Solutions
Bhavesh is a Technical Lead at Wednesday Solutions with hands-on depth across React Native, iOS, Android, and Flutter. He has shipped mobile products and enterprise AI solutions across edtech, entertainment, and medtech, and reviews architecture across Wednesday engagements.
30 minutes with an engineer. You leave with a squad shape, a monthly cost, and a start date.
Get your start date →Keep reading
Shipped for enterprise and growth teams across US, Europe, and Asia