Skip to content

Extraction accuracy — NSF Award Notice Extraction (UDM)

Source: evals/reports/2026-04-20-vandalizer-crossref/REPORT.md. Component: nsf-award-notice-extraction-udm.

Generated: 2026-04-20T20:32:54Z
Ground-truth corpus: 20 distinct (award, amendment) cases at real/nsf_awards/extractions
Headline run (json_schema): matched 24 source docs (0 validated, 24 in progress)
Schema: components/nsf-award-notice-extraction-udm/schema.json

Ground-truth source. Cases marked validated are human-verified extractions used as truth. Cases marked in_progress are awaiting human verification — the reference values come from an automated second-opinion pipeline, so they serve as an agreement signal rather than an accuracy signal. Validation state is tracked in per-case metadata.yaml and partitioned below.

1. Headline — accuracy on validated cases

No cases have completed human validation yet. Every case currently falls into the in-progress partition in §2. As you validate cases (flip validation_state: validated in their metadata.yaml), they will appear here.

2. Pending validation — agreement on in-progress cases

These 24 cases are still awaiting human verification. Numbers below measure agreement with the automated reference pipeline, not accuracy. Use them to spot candidate-for-review items — where the extractor and the reference differ, one of them is wrong.

Scalar fields

24 in-progress cases · overall 97.8% agreement (2088/2134 replicate-field pairs).

scalar-field agreement heatmap — in progress

Budget line items

Overall 98.8% budget-line agreement (2574/2605 replicate-slot pairs).

budget-line agreement heatmap — in progress

3. Headline by run mode

run / mode matched cases scalar compared / correct scalar accuracy budget compared / correct budget accuracy
none 24 of 28 source docs (0 validated, 24 in progress) 1848 / 1900 97.3% 2578 / 2660 96.9%
json_object 24 of 28 source docs (0 validated, 24 in progress) 1916 / 1952 98.2% 2615 / 2658 98.4%
json_schema 24 of 28 source docs (0 validated, 24 in progress) 2088 / 2134 97.8% 2574 / 2605 98.8%

4. Incorrect-extraction examples — json_schema

Up to 5 examples per field where the extractor produced a value different from the reference. Includes both validated (definite errors) and in-progress (candidate errors) cases.

cfda_name — 84% (95/113)

award truth extractor truth raw extractor raw
2441918 engineering grants (predominant source of funding for sefa reporting) engineering grants (predominant source for sefa reporting) Engineering Grants (Predominant source of funding for SEFA r Engineering Grants (Predominant source for SEFA reporting)
2514552 geosciences (predominant source of funding for sefa reporting), 47.076 education and human resources geosciences (predominant source of funding for sefa reporting); education and human resources Geosciences (Predominant source of funding for SEFA reportin Geosciences (Predominant source of funding for SEFA reportin
2514552 geosciences (predominant source of funding for sefa reporting), 47.076 education and human resources geosciences (predominant source of funding for sefa reporting) Geosciences (Predominant source of funding for SEFA reportin Geosciences (Predominant source of funding for SEFA reportin
2514552 geosciences (predominant source of funding for sefa reporting), 47.076 education and human resources geosciences (predominant source of funding for sefa reporting); education and human resources Geosciences (Predominant source of funding for SEFA reportin Geosciences (Predominant source of funding for SEFA reportin
2514552 geosciences (predominant source of funding for sefa reporting), 47.076 education and human resources geosciences (predominant source of funding for sefa reporting); education and human resources Geosciences (Predominant source of funding for SEFA reportin Geosciences (Predominant source of funding for SEFA reportin

is_collaborative_research — 91% (107/118)

award truth extractor truth raw extractor raw
2531886 False True False True
2531886 False True False True
2531886 False True False True
2531886 False True False True
2531886 False True False True

award_title — 92% (108/117)

award truth extractor truth raw extractor raw
2511003 equipment: mri: track 1 acquisition of element aviti system to enable multi-omics research and research training. equipment: mri: track 1 acquisition of element a viti system to enable multi-omics research and research training Equipment: MRI: Track 1 Acquisition of Element AVITI System Equipment: MRI: Track 1 Acquisition of Element A VITI System
2511003 equipment: mri: track 1 acquisition of element aviti system to enable multi-omics research and research training. equipment: mri: track 1 acquisition of element a viti system to enable multi-omics research and research training Equipment: MRI: Track 1 Acquisition of Element AVITI System Equipment: MRI: Track 1 Acquisition of Element A VITI System
2511003 equipment: mri: track 1 acquisition of element aviti system to enable multi-omics research and research training. equipment: mri: track 1 acquisition of element a viti system to enable multi-omics research and research training Equipment: MRI: Track 1 Acquisition of Element AVITI System Equipment: MRI: Track 1 Acquisition of Element A VITI System
2511003 equipment: mri: track 1 acquisition of element aviti system to enable multi-omics research and research training. equipment: mri: track 1 acquisition of element a viti system to enable multi-omics research and research training Equipment: MRI: Track 1 Acquisition of Element AVITI System Equipment: MRI: Track 1 Acquisition of Element A VITI System
2511003 equipment: mri: track 1 acquisition of element aviti system to enable multi-omics research and research training. equipment: mri: track 1 acquisition of element a viti system to enable multi-omics research and research training Equipment: MRI: Track 1 Acquisition of Element AVITI System Equipment: MRI: Track 1 Acquisition of Element A VITI System

cfda_number — 95% (107/113)

award truth extractor truth raw extractor raw
2514552 47.050 47.050, 47.076 47.050 47.050, 47.076
2514552 47.050 47.050; 47.076 47.050 47.050; 47.076
2514552 47.050 47.050, 47.076 47.050 47.050, 47.076
2514552 47.050 47.050, 47.076 47.050 47.050, 47.076
2514552 47.050 47.050; 47.076 47.050 47.050; 47.076

funding_opportunity_number — 98% (111/113)

award truth extractor truth raw extractor raw
2527135 nsf 25-509 25-509 NSF 25-509 25-509
2316126 nsf 22-633 22-633 NSF 22-633 22-633

5. Appendix — per-field rollup tables across modes

Scalar field accuracy/agreement by mode | field | `none` | `json_object` | `json_schema` | |---|---|---|---| | `sponsor_award_number` | — | — | — | | `award_status` | — | — | — | | `proposal_number` | — | — | — | | `amendment_type` | — | — | — | | `amendment_date` | — | — | — | | `amendment_description` | — | — | — | | `award_received_date` | — | — | — | | `start_date` | — | — | 100% (19/19) | | `end_date` | — | — | 100% (19/19) | | `total_intended_amount` | — | — | 100% (111/111) | | `expenditure_limitation` | — | — | — | | `indirect_cost_base` | — | — | — | | `fees` | — | 100% (85/85) | 100% (82/82) | | `cfda_name` | 84% (96/115) | 87% (100/115) | 84% (95/113) | | `cfda_number` | 90% (103/115) | 96% (110/115) | 95% (107/113) | | `is_collaborative_research` | 91% (109/120) | 92% (110/120) | 91% (107/118) | | `award_title` | 96% (110/115) | 96% (110/115) | 92% (108/117) | | `funding_opportunity_number` | 96% (110/115) | 99% (114/115) | 98% (111/113) | | `award_id` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `award_number` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `sponsor_name` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `managing_division` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `award_instrument` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `is_research_and_development` | 100% (115/115) | 100% (115/115) | 100% (113/113) | | `funding_opportunity_title` | 100% (115/115) | 100% (115/115) | 100% (113/113) | | `amendment_number` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `award_date` | 100% (92/92) | 100% (77/77) | 100% (81/81) | | `amount_obligated_this_amendment` | 100% (115/115) | 100% (115/115) | 100% (110/110) | | `total_obligated_to_date` | 100% (2/2) | 100% (2/2) | 100% (102/102) | | `cost_share_approved_amount` | 100% (106/106) | 100% (88/88) | 100% (52/52) | | `indirect_cost_rate_percent` | 100% (55/55) | 100% (55/55) | 100% (50/50) |
One-sided emissions (coverage asymmetry) Extractor emitted a value where truth/reference has null (**hallucinated**) vs. truth/reference has a value the extractor emitted as null (**missing**). Per mode. | field | `none` halluc. / missing | `json_object` halluc. / missing | `json_schema` halluc. / missing | |---|---|---|---| | `sponsor_award_number` | — | — | — | | `award_status` | — | — | — | | `proposal_number` | — | — | — | | `amendment_type` | 120 / 0 | 120 / 0 | 118 / 0 | | `amendment_date` | 20 / 0 | 9 / 0 | — | | `amendment_description` | 120 / 0 | 120 / 0 | 118 / 0 | | `award_received_date` | 20 / 0 | 20 / 0 | 20 / 0 | | `start_date` | 0 / 120 | 0 / 120 | 0 / 99 | | `end_date` | 0 / 120 | 0 / 120 | 0 / 99 | | `total_intended_amount` | 0 / 115 | 0 / 115 | 5 / 2 | | `expenditure_limitation` | — | — | 1 / 0 | | `indirect_cost_base` | 114 / 0 | 113 / 0 | 103 / 0 | | `fees` | 0 / 85 | 35 / 0 | 35 / 1 | | `cfda_name` | 0 / 5 | 0 / 5 | 0 / 5 | | `cfda_number` | 0 / 5 | 0 / 5 | 0 / 5 | | `is_collaborative_research` | 0 / 0 | 0 / 0 | 0 / 0 | | `award_title` | 0 / 5 | 0 / 5 | 0 / 1 | | `funding_opportunity_number` | 0 / 5 | 0 / 5 | 0 / 5 | | `award_id` | 0 / 0 | 0 / 0 | 0 / 0 | | `award_number` | 0 / 0 | 0 / 0 | 0 / 0 | | `sponsor_name` | 0 / 0 | 0 / 0 | 0 / 0 | | `managing_division` | 0 / 0 | 0 / 0 | 0 / 0 | | `award_instrument` | 0 / 0 | 0 / 0 | 0 / 0 | | `is_research_and_development` | 0 / 5 | 0 / 5 | 0 / 5 | | `funding_opportunity_title` | 0 / 5 | 0 / 5 | 0 / 5 | | `amendment_number` | 0 / 0 | 0 / 0 | 0 / 0 | | `award_date` | 0 / 28 | 0 / 43 | 0 / 37 | | `amount_obligated_this_amendment` | 5 / 0 | 5 / 0 | 5 / 3 | | `total_obligated_to_date` | 0 / 108 | 0 / 108 | 9 / 6 | | `cost_share_approved_amount` | 10 / 4 | 7 / 22 | 8 / 56 | | `indirect_cost_rate_percent` | 25 / 0 | 25 / 0 | 25 / 3 |
Budget-line accuracy/agreement by mode Rows are NSF-format `code.subcode` slots. Columns show **correct / compared** replicate-slot pairs per mode. | slot | `none` | `json_object` | `json_schema` | |---|---|---|---| | `A` | 60% (48/80) | 79% (63/80) | 82% (64/78) | | `B` | — | — | — | | `B.GraduateStudents` | 100% (103/103) | 100% (105/105) | 100% (101/101) | | `B.Other` | 100% (107/107) | 100% (107/107) | 100% (104/104) | | `B.OtherProfessionals` | 100% (98/98) | 100% (100/100) | 100% (97/97) | | `B.PostDoctoral` | 100% (93/93) | 100% (95/95) | 100% (92/92) | | `B.SecretarialClerical` | 100% (108/108) | 100% (108/108) | 100% (105/105) | | `B.UndergraduateStudents` | 100% (108/108) | 100% (110/110) | 100% (106/106) | | `C` | 56% (65/115) | 77% (88/114) | 85% (94/111) | | `D` | 100% (115/115) | 100% (115/115) | 100% (113/113) | | `E.Domestic` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `E.International` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `F` | — | — | — | | `F.Other` | 100% (100/100) | 100% (100/100) | 100% (98/98) | | `F.Stipends` | 100% (115/115) | 100% (115/115) | 100% (113/113) | | `F.Subsistence` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `F.Total` | 100% (62/62) | 100% (57/57) | 100% (63/63) | | `F.Travel` | 100% (120/120) | 100% (120/120) | 100% (118/118) | | `G` | — | — | — | | `G.ComputerServices` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.ConsultantServices` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.MaterialsSupplies` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.Other` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.Publication` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.Subawards` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `G.Total` | 100% (46/46) | 100% (42/42) | 100% (42/42) | | `H` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `I` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `J` | 100% (80/80) | 100% (80/80) | 100% (80/80) | | `K` | — | — | — | | `L` | 100% (85/85) | 100% (85/85) | 100% (83/83) | | `M` | 100% (85/85) | 100% (85/85) | 100% (83/83) |

Methodology

  • Matching. Source documents are matched to ground-truth cases by the tuple (award_number, amendment_number), majority-voted across replicates. This matters for amendment series: three source documents may share an award_number but represent the base award, Mod 1, and Mod 2 respectively.
  • Validation partition. Each ground-truth case carries a sibling <results-N>.meta.yaml file with validation_state: validated | in_progress. Validated cases feed the §1 headline; in-progress cases feed the §2 appendix. As cases are validated, they migrate from §2 to §1 without any code change.
  • Scoring. Every (award × field × replicate) tuple is assigned exactly one of: correct, incorrect, hallucinated (extractor emitted; truth null), missing (truth emitted; extractor null), correct_absent (both null). Accuracy percentages use correct / (correct + incorrect) — i.e., only replicates where both sides emit a value. Hallucination and missing counts are surfaced separately (orange/blue cells in the heatmaps, §5 in the appendix tables).
  • Scalar scope. All top-level scalar fields declared in the UDM schema.
  • Budget scope. The budget_categories array is compared by (code, subcode) — both sides already emit native UDM shape, so no label mapping is required. The top-level fees scalar is also scored. Nested objects (recipient_organization, current_budget_period) and other arrays (project_personnel, subawards, terms_and_conditions, special_conditions) remain out of scope for this pass.
  • Normalization. Currency ($584,845584845), percentages (50.0000%50.0), US dates (08/18/20252025-08-18) are coerced. "N/A", "", and null are all treated as null.

Comparison script: scripts/compare_to_vandalizer.py
Plotter: scripts/plot_vandalizer_heatmaps.py (in the prompt-library repo) reads summary.json and writes the PNGs under charts/. Regenerate with python scripts/plot_vandalizer_heatmaps.py --summary <path>/summary.json --mode json_schema.