Evaluating frontier models on core B2B loan underwriting tasks.
Real world financial data by experts from Citi, JP Morgan, Barclays, US Bank, and more.
Flex’s ongoing underwriting history includes: ~86,000 loan applications representing ~$2B in requested credit limits.
T1/T2 overall · T3 key risks · T4 narrative
Claude Opus 4.7 is the top performer on Senior Underwriting Agent, scoring 55.4% overall accuracy.
Frontier models systemically under-approve in US B2B loan underwriting.
The hardest task is not identifying bad credit: it is approving creditworthy edge cases that require judgment beyond policy compliance.
Accuracy by Case Type
(avg across all models)
About Flex Labs
Flex originates real, continuously growing stream of US B2B credit decisions at scale (>$4B in transaction volume). Flex Labs turns that proprietary decision history into the industry's most grounded benchmark for AI underwriting.
Sample task (1 out of 150)
You are a senior credit underwriter at Flexbase. Base your decision on the information provided, applying sound underwriting judgment and general business knowledge where relevant. Treat null or missing values as data gaps that increase uncertainty, not as neutral or favorable signals.
The following application packet includes: the full Flex credit policy, a precomputed FICO policy check, and structured application signals across 6 data sections. Evaluate independently with no shared context from prior cases.
Based on the application below, provide:
T1: Your decision — "approved" or "declined"
T2: If approved, recommended credit limit (USD)
T3: Top 3 risk factors that most influenced your decision
T4: 2–3 sentence underwriting narrative
--- APPLICATION PACKET: FLEX_016 ---
S1 — Business Info:
Industry: Online Retail / E-Commerce (Consumer Durables & Apparel)
State: FL | Business age: 58 months | Stage: Small business
Owner count: 1 (100% ownership) | KYB: Approved
S2 — Financials:
TTM Revenue: $4,040,347
TTM Net Income: $3,111,847
TTM Gross Profit: $1,857,779
Yearly revenue trend: $517K → $1.94M → $4.06M (strong growth)
Yearly gross margin trend: 40.9% → 43.9% → 46.6% (expanding)
Total assets: $644,798 | Total liabilities: $1,189,297
Total equity: -$544,499 (negative)
Cash at decision: $182,234
Current ratio: 0.55 | Cash ratio: 0.20
Working capital: -$407,481
S3 — Banking:
Plaid avg 60d balance: $145,011
Plaid current balance: $76,325
Rutter avg 60d balance: $87,000
Monthly revenue (last 6): $62K, $268K, $227K, $314K, $1,165K, $318K
Monthly net income (last 6): $69K, $231K, $170K, $187K, $826K, $256K
S4 — Credit:
FICO: 693 [POLICY CHECK: 650 minimum for $3M-$5M ARR tier — MEETS MINIMUM]
Total hard pulls: 0 | Hard pulls last 12mo: 0
Total tradelines: 16 | Open: 12 | Revolving: 11 | Installment: 1
Issues count: 0
Bureau reasons: High revolving utilization; number of accounts with delinquency;
length of time revolving established; proportion of loan balances too high
Frozen: No
S5 — Fraud / KYC:
Sardine customer score: 1 | Customer level: Low
Phone: Low | Bot: Low | Device: Low | Address: Low (valid) | IP: US
Sardine rules fired: 9 | KYC persons verified: 1
KYC documents uploaded: 0
S6 — Request:
Requested limit: $80,000 | Tier: Tier 1 | Signed PG: NoThis sample task is provided for illustration only. Domain scores represent the average across 150 held-out anonymized cases. Sample tasks are passed to models with the full Flex credit policy document.