Evaluation run — `2026-04-18-gpt-oss-120b-dots-r5`¶

Source: evals/reports/2026-04-18-gpt-oss-120b-dots-r5/REPORT.md. Component: nsf-award-notice-extraction-udm.

Date: 2026-04-18 22:31 UTC
Model: openai/gpt-oss-120b
OCR: mindrouter (Mindrouter /v1/ocrmd, dots.OCR backend)
Prompt: prompt.md — sha256 f828e83e0e8c
Temperature: 0.1
Replicates per doc: 5
Documents: 28

1. Run-level headline¶

Success rate: 140/140 (100.0%) — 0 API errors, 0 JSON-parse errors.
OCR latency: p50 19.1s, p95 22.6s (min 10.0s, max 29.3s).
Chat latency: p50 66.9s, p95 89.1s (min 18.0s, max 126.6s).
Prompt tokens: p50 5446, p95 6074 (min 4465, max 6994).
Completion tokens: p50 7821, p95 9464, max 14073 (cap: 16384). 1 replicates over 80% of cap, 0 over 95%.

completion tokens

2. Structural validity (JSON Schema)¶

Validated every replicate against ../../../schema.json with jsonschema (Draft 2020-12).

Strict pass rate: 0/140 (0.0%)
Pass rate ignoring top-level extra keys: 78/140 (55.7%) — this isolates structural/type errors from naming drift.

Top-level naming drift — keys the model emits that the schema does not declare:

extra key emitted	occurrences	likely schema counterpart
`total_intended_award_amount`	135	`total_intended_amount`
`total_amount_obligated_to_date`	126	`total_obligated_to_date`
`total_approved_cost_share_or_matching_amount`	113	—
`period_of_performance_end`	50	—
`period_of_performance_start`	50	`start_date`
`award_period_end_date`	35	`end_date`
`award_period_start_date`	35	`start_date`
`award_period_end`	29	—
`award_period_start`	29	`start_date`
`period_of_performance_end_date`	12	`end_date`
`period_of_performance_start_date`	12	`start_date`
`award_period_of_performance_end_date`	6	`end_date`
`award_period_of_performance_start_date`	6	`start_date`
`award_period_of_performance_end`	5	—
`award_period_of_performance_start`	5	`start_date`
`total_award_amount`	5	—
`total_approved_cost_share`	4	—
`total_approved_cost_share_amount`	4	—
`award_end_date`	2	`end_date`
`award_start_date`	2	`start_date`
`total_cost_share_approved_amount`	2	`cost_share_approved_amount`

Required/declared top-level keys absent from outputs:

schema key missing	# replicates missing it (of 140)
`proposal_number`	140
`end_date`	140
`total_intended_amount`	140
`award_status`	140
`start_date`	140
`sponsor_award_number`	140
`total_obligated_to_date`	133
`amendment_date`	115
`award_date`	32
`expenditure_limitation`	17
`cost_share_approved_amount`	5

Top validation errors (error-key × occurrences):

rule @ pointer	count
`<root> :: additionalProperties`	140
`linked_awards/0 :: required`	34
`linked_awards/1 :: required`	18
`special_conditions/2/category :: enum`	8
`budget_categories/25/label :: minLength`	8
`budget_categories/26/label :: minLength`	6
`special_conditions/1/category :: enum`	5
`linked_awards/0 :: additionalProperties`	5
`linked_awards/1 :: additionalProperties`	5
`award_title :: type`	5
`current_budget_period/end_date :: type`	5
`current_budget_period/start_date :: type`	5
`expenditure_limitation :: type`	5
`special_conditions/4/category :: enum`	4
`special_conditions/5/category :: enum`	2

Documents with any schema-invalid replicate:

document	# invalid / 5
award_01	5
award_02	5
award_03	5
award_04	5
award_05	5
award_06	5
award_07	5
award_08	5
award_09	5
award_10	5
award_11	5
award_12	5
award_13	5
award_14	5
award_15	5
award_16	5
award_17	5
award_18	5
award_19	5
award_20	5
award_21	5
award_22	5
award_23	5
award_24	5
award_25	5
award_26	5
award_27	5
award_28	5

3. Within-doc consistency (5 replicates per doc)¶

consistency heatmap

per-field agreement

3a. Per-field agreement rollup (top 15 worst)¶

field	docs probed	% full agreement	2 distinct	≥3 distinct
`expenditure_limitation`	28	36%	18	0
`amendment_date`	28	46%	15	0
`award_date`	28	50%	14	0
`cfda_name`	28	82%	1	4
`indirect_cost_base`	28	82%	4	1
`cfda_number`	28	82%	5	0
`cost_share_approved_amount`	28	82%	5	0
`funding_opportunity_number`	28	89%	3	0
`total_obligated_to_date`	28	89%	3	0
`is_collaborative_research`	28	93%	2	0
`indirect_cost_rate_percent`	28	96%	1	0
`award_id`	28	100%	0	0
`award_number`	28	100%	0	0
`sponsor_award_number`	28	100%	0	0
`award_title`	28	100%	0	0

3b. Worst docs (most fields disagreeing)¶

document	disagreeing / probed	%
award_19	6 / 40	15%
award_02	5 / 40	12%
award_06	4 / 40	10%
award_10	4 / 40	10%
award_16	4 / 40	10%
award_18	4 / 40	10%
award_23	4 / 40	10%
award_05	3 / 40	8%
award_09	3 / 40	8%
award_11	3 / 40	8%

3c. Array-length stability across replicates¶

array	mean CV	max CV	% docs stable (CV=0)	worst docs
`special_conditions`	0.181	1.225	25%	award_27 ([1, 0, 0, 0, 1]); award_19 ([0, 3, 3, 3, 3]); award_06 ([2, 1, 2, 3, 3])
`terms_and_conditions`	0.140	0.298	11%	award_21 ([2, 4, 2, 3, 4]); award_10 ([2, 2, 3, 4, 3]); award_20 ([3, 3, 3, 4, 2])
`budget_categories`	0.137	0.243	7%	award_07 ([48, 30, 30, 29, 48]); award_05 ([27, 48, 48, 31, 48]); award_02 ([30, 48, 48, 48, 30])
`linked_awards`	0.054	0.500	89%	award_09 ([0, 1, 1, 1, 1]); award_12 ([0, 1, 1, 1, 1]); award_18 ([0, 1, 1, 1, 1])
`project_personnel`	0.000	0.000	100%	—
`sponsor_contacts`	0.000	0.000	100%	—
`subawards`	0.000	0.000	100%	—

4. Field coverage¶

coverage bar chart

4a. Scalar fields — least-populated first¶

field	% non-null	n / total
`sponsor_award_number`	0%	0 / 140
`award_status`	0%	0 / 140
`proposal_number`	0%	0 / 140
`start_date`	0%	0 / 140
`end_date`	0%	0 / 140
`total_intended_amount`	0%	0 / 140
`total_obligated_to_date`	5%	7 / 140
`award_received_date`	14%	20 / 140
`amendment_date`	18%	25 / 140
`expenditure_limitation`	54%	75 / 140
`indirect_cost_rate_percent`	62%	87 / 140
`award_date`	77%	108 / 140
`indirect_cost_base`	92%	129 / 140
`award_title`	96%	135 / 140
`is_research_and_development`	96%	135 / 140

4b. Array fields — least-populated first¶

array	% non-empty	n / total
`subawards`	4%	5 / 140
`linked_awards`	26%	37 / 140
`special_conditions`	94%	131 / 140
`budget_categories`	96%	135 / 140
`project_personnel`	100%	140 / 140
`sponsor_contacts`	100%	140 / 140
`terms_and_conditions`	100%	140 / 140

Reproduction¶

python scripts/extract_only.py \
  --pdf-dir <local-pdf-dir> \
  --prompt components/nsf-award-notice-extraction-udm/prompt.md \
  --model openai/gpt-oss-120b \
  --ocr mindrouter \
  --replicates 5 \
  --max-tokens 16384 \
  --run-name 2026-04-18-gpt-oss-120b-dots-r5

Evaluation run — 2026-04-18-gpt-oss-120b-dots-r5¶