Evaluation run — `2026-04-20-gpt-oss-120b-json_schema`¶

Source: evals/reports/2026-04-20-gpt-oss-120b-json_schema/REPORT.md. Component: nsf-award-notice-extraction-udm.

Date: 2026-04-20 15:52 UTC
Model: openai/gpt-oss-120b
OCR: mindrouter (Mindrouter /v1/ocrmd, dots.OCR backend)
Prompt: prompt.md — sha256 ceec486b1fe2
Temperature: 0.1
Replicates per doc: 5
Documents: 28

1. Run-level headline¶

Success rate: 138/140 (98.6%) — 0 API errors, 2 JSON-parse errors.
OCR latency: p50 0.0s, p95 0.0s (min 0.0s, max 0.0s).
Chat latency: p50 63.5s, p95 89.7s (min 19.9s, max 98.2s).
Prompt tokens: p50 5661, p95 6306 (min 4697, max 7226).
Completion tokens: p50 7472, p95 10367, max 12256 (cap: 16384). 0 replicates over 80% of cap, 0 over 95%.

completion tokens

2. Structural validity (JSON Schema)¶

Validated every replicate against ../../../schema.json with jsonschema (Draft 2020-12).

Strict pass rate: 138/138 (100.0%)
Pass rate ignoring top-level extra keys: 138/138 (100.0%) — this isolates structural/type errors from naming drift.

Required/declared top-level keys absent from outputs:

schema key missing	# replicates missing it (of 138)
`amendment_date`	138
`sponsor_award_number`	138
`award_status`	138
`proposal_number`	138
`end_date`	118
`start_date`	118
`cost_share_approved_amount`	61
`award_date`	42
`expenditure_limitation`	14
`indirect_cost_rate_percent`	13
`indirect_cost_base`	12
`total_obligated_to_date`	7
`award_received_date`	4
`amount_obligated_this_amendment`	3
`total_intended_amount`	2

No document had any invalid replicate.

3. Within-doc consistency (5 replicates per doc)¶

consistency heatmap

per-field agreement

3a. Per-field agreement rollup (top 15 worst)¶

field	docs probed	% full agreement	2 distinct	≥3 distinct
`current_budget_period.period_number`	12	0%	12	0
`cost_share_approved_amount`	28	25%	21	0
`award_date`	28	36%	18	0
`expenditure_limitation`	28	43%	16	0
`start_date`	28	54%	13	0
`end_date`	28	54%	13	0
`indirect_cost_base`	28	68%	8	1
`cfda_name`	28	79%	5	1
`total_obligated_to_date`	28	79%	6	0
`amount_obligated_this_amendment`	28	89%	3	0
`indirect_cost_rate_percent`	28	89%	3	0
`cfda_number`	28	93%	0	2
`funding_opportunity_number`	28	93%	2	0
`total_intended_amount`	28	93%	2	0
`recipient_organization.uei`	28	93%	2	0

3b. Worst docs (most fields disagreeing)¶

document	disagreeing / probed	%
award_01	8 / 39	20%
award_19	8 / 37	22%
award_10	7 / 37	19%
award_15	7 / 37	19%
award_02	6 / 39	15%
award_05	6 / 37	16%
award_11	6 / 37	16%
award_12	6 / 37	16%
award_13	6 / 39	15%
award_18	6 / 39	15%

3c. Array-length stability across replicates¶

array	mean CV	max CV	% docs stable (CV=0)	worst docs
`budget_categories`	0.166	0.287	11%	award_05 ([47, 23, 29, 47, 29]); award_07 ([29, 26, 30, 47, 48]); award_23 ([30, 48, 30, 47, 27])
`special_conditions`	0.153	1.225	46%	award_27 ([0, 1, 1, 0, 0]); award_26 ([4, 7, 3]); award_06 ([1, 1, 1, 2, 1])
`terms_and_conditions`	0.107	0.267	25%	award_26 ([5, 6, 3]); award_05 ([3, 5, 3, 3, 3]); award_12 ([2, 3, 3, 3, 4])
`linked_awards`	0.076	0.816	89%	award_12 ([1, 1, 0, 0, 1]); award_18 ([1, 1, 0, 0, 1]); award_27 ([2, 2, 2, 2, 0])
`project_personnel`	0.000	0.000	100%	—
`sponsor_contacts`	0.000	0.000	100%	—
`subawards`	0.000	0.000	100%	—

4. Field coverage¶

coverage bar chart

4a. Scalar fields — least-populated first¶

field	% non-null	n / total
`sponsor_award_number`	0%	0 / 138
`award_status`	0%	0 / 138
`proposal_number`	0%	0 / 138
`amendment_date`	0%	0 / 138
`award_received_date`	14%	20 / 138
`start_date`	14%	20 / 138
`end_date`	14%	20 / 138
`expenditure_limitation`	42%	58 / 138
`cost_share_approved_amount`	56%	77 / 138
`indirect_cost_rate_percent`	58%	80 / 138
`award_date`	69%	95 / 138
`indirect_cost_base`	84%	116 / 138
`total_obligated_to_date`	95%	131 / 138
`fees`	96%	132 / 138
`is_research_and_development`	96%	133 / 138

4b. Array fields — least-populated first¶

array	% non-empty	n / total
`subawards`	2%	3 / 138
`linked_awards`	25%	35 / 138
`special_conditions`	94%	130 / 138
`budget_categories`	96%	133 / 138
`project_personnel`	100%	138 / 138
`sponsor_contacts`	100%	138 / 138
`terms_and_conditions`	100%	138 / 138

Reproduction¶

python scripts/extract_only.py \
  --pdf-dir <local-pdf-dir> \
  --prompt components/nsf-award-notice-extraction-udm/prompt.md \
  --model openai/gpt-oss-120b \
  --ocr mindrouter \
  --replicates 5 \
  --max-tokens 16384 \
  --run-name 2026-04-20-gpt-oss-120b-json_schema

Evaluation run — 2026-04-20-gpt-oss-120b-json_schema¶