vandalizer-to-udm-translation¶
vandalizer-to-udm-translation0.1.00.1.0Tags: nsf award notice udm translation vandalizer structured-transformation json
Audience: ingest-pipelines, sponsored-programs-staff
Manifestations in repo: prompt.md
Converts a Vandalizer NSF-extraction JSON object (flat key/value form) into a single JSON object conforming to the nsf-award-notice-extraction-udm schema. This is a pure transformation — no information is invented, and fields Vandalizer does not capture emit as null or documented defaults.
Output contract: see schema.json (local wrapper delegating to nsf-award-notice-extraction-udm v1.1.0)
Inputs¶
One JSON object produced by the Vandalizer NSF extraction task. Shape:
- Flat, single-level key/value — all values are strings.
- Standard Vandalizer convention:
"N/A"denotes absent values. - US-format dates (
MM/DD/YYYY), currency with$and commas ("$584,845"), percentages with%("50.0000%"), semicolon-delimited lists for multi-value fields (Co-PI names/emails/organizations). - Flat NSF-format budget line items keyed by label (
Senior Personnel Amount,Post Doctoral Scholars Count, etc.), mirroring the 18-category table. - A trailing review-metadata field:
"what data was highlighted yellow in the original document?".
See evals/cases/vandalizer-trial-2511003/ for the seed input.
Outputs¶
A single JSON object conforming to the local schema.json wrapper, which delegates to nsf-award-notice-extraction-udm v1.1.0. See prompt.md for the field-by-field translation rules.
Scope and non-scope¶
In scope. Deterministic field-by-field translation. Format normalization (ISO dates, plain currency, plain percents, typed booleans). NSF-format budget line → UDM budget_categories code/subcode assignment. Subaward inference using the UDM rule (Co-PI at non-recipient org + non-zero G.Subawards). Carrying the Vandalizer review-highlight annotation through source_provenance.review_annotations.
Out of scope. Re-extracting missing fields from the original PDF. Vandalizer does not capture amendment metadata (number, type, date, description), recipient address/UEI/email, proposal number, or the email header's received-date. The translator emits null / documented defaults for these and does not attempt to recover them. Downstream systems that need these fields should run the full nsf-award-notice-extraction-udm extractor on the PDF instead, or extend Vandalizer's output schema.
Defaults and data-quality notes¶
amendment_numberis required by UDM but absent from Vandalizer output. Translator emits"000"(new project / initial obligation) by default. Do not deploy this translator against amendment notices without first adding amendment fields to Vandalizer.recipient_organization.legal_nameis taken fromPrincipal Investigator Organization. Address, email, and UEI are alwaysnull. Ingest consumers should treat recipient records produced by this translator as needing enrichment from a separate organization-resolution step.subawardsentries are alwaysinferred: true(Vandalizer never itemizes subrecipients).linked_awardsis always[].feesis populated from Vandalizer'sFeesfield (schema v1.1.0 scalar).source_provenance.extractor = "vandalizer-to-udm-translation",upstream_extractor = "Vandalizer". The Vandalizer's"what data was highlighted yellow in the original document?"field, when not"N/A", is emitted as areview_annotationsentry withlabel: "highlighted-yellow".
Relationship to other components¶
| Concern | nsf-award-notice-extraction-udm |
vandalizer-to-udm-translation |
|---|---|---|
| Input | NSF Award Notice PDF (or pasted text) | Vandalizer flat-JSON extraction output |
| Category | extraction | transformation |
| Output schema | owns it | conforms to it |
| Field coverage | Full UDM contract | Subset — amendment metadata, recipient contact info, received-date all null |
| Subaward entries | explicit or inferred | always inferred |
Manifestations¶
prompt.md— canonical, LLM-agnostic prompt
Evals¶
See evals/cases/vandalizer-trial-2511003/ for the seed case — FAIN 2511003 (MRI: Track 1 AVITI System), a Vandalizer extraction of an NSF Standard Grant Amendment 000 notice from 2025.
Provenance¶
Authored 2026-04-20 in response to a trial Vandalizer extraction produced against an NSF-26-508-era award notice at the University of Idaho. The schema v1.1.0 bump (fees, source_provenance) was driven by gaps surfaced during the initial translator spec review.
Contract scope¶
-
Output format:
json_object -
Contract scope:
delegated_repo_local_schema -
Validation surfaces:
json_schema,golden_eval_cases -
Schema entrypoints:
# -
Notes: Translator wrapper contract. The local schema delegates to nsf-award-notice-extraction-udm/schema.json so downstream consumers have a concrete contract surface inside this component directory.
-
Machine-readable catalog entry:
component_catalog.json
Triad integration¶
-
UDM alignment:
delegated_repo_local_schema— Output aligns to the repo-local NSF award notice extraction schema, which itself is UDM-aligned but maintained in prompt-library. -
Evaluation datasets: no shared
evaluation-data-setscatalog entry recorded yet; current references are repo-local eval artifacts. -
Harness notes: Treat this as a transformation component, not a source-of-truth extractor. Validate output against the local wrapper schema and remember that missing fields default to null or documented defaults based on the upstream Vandalizer payload.
-
Related component:
nsf-award-notice-extraction-udm(delegates_output_schema_to) — This translator's output contract is the award-notice schema.
Prompt body¶
Source: prompt.md.
Show prompt
Vandalizer → UDM NSF Award Notice Translation — Prompt¶
Purpose: Convert a Vandalizer NSF-extraction JSON object (flat key/value form) into a single JSON object conforming to the
nsf-award-notice-extraction-udmschema v1.1.0. This is a pure transformation — no information is invented, and fields Vandalizer does not capture emit asnullor documented defaults.Expected input: One JSON object produced by the Vandalizer NSF extraction task (flat, string-valued fields;
"N/A"used for missing values).Expected output: One JSON object validating against
schema.json, which delegates tonsf-award-notice-extraction-udmv1.1.0. No prose, no markdown outside the JSON.
Prompt¶
You are a deterministic translator. Read one Vandalizer NSF extraction object and emit one UDM NSF Award Notice JSON object. Do not paraphrase, re-order, or summarize; map field by field using the rules below. The same input must always produce the same output.
Output contract¶
Emit one JSON object. No preamble, no trailing commentary. If the runtime requires a fenced block, wrap the object in a single ```json ... ``` block and emit nothing else. Every required UDM array (project_personnel, sponsor_contacts, budget_categories, subawards, linked_awards, terms_and_conditions, special_conditions) must be present; emit [] when empty.
Normalization rules¶
-
Missing values. Treat any Vandalizer value of
"N/A","", or the literal string"null"(case-insensitive) as absent. Emitnullfor scalars and omit the would-be list item for arrays. -
Currency. Strip
$, commas, whitespace."$584,845"→584845."$0"→0(notnull). -
Dates. US
MM/DD/YYYY→ ISOYYYY-MM-DD. Missing →null. -
Percent. Strip
%."50.0000%"→50.0. -
Booleans.
"Yes"→true,"No"→false, absent →null. -
JSON types. Emit
0not"0",falsenot"false",nullnot"null".
Scalar mappings¶
-
award_number←Award Number(required; never null). -
award_id←"NSF-" + Award Number. -
sponsor_name←"National Science Foundation"(constant for this translator). -
sponsor_award_number←null(Vandalizer does not distinguish it fromaward_number). -
award_title←Project Title. -
award_instrument←Award Instrument. -
managing_division←Managing Division Abbreviation. -
award_status←null. -
is_research_and_development←Research And Development Award(Yes/No/absent). -
is_collaborative_research←trueiffProject Titlestarts with"Collaborative Research:"(case-insensitive), elsefalse. -
proposal_number←null. -
award_date←Award Date. -
award_received_date←null(Vandalizer does not capture the email header date). -
start_date←Award Period Start Date. -
end_date←Award Period End Date. -
amount_obligated_this_amendment←Amount Obligated By This Amendment. -
total_intended_amount←Total Intended Award Amount. -
total_obligated_to_date←Total Amount Obligated To Date. -
cost_share_approved_amount←Total Approved Cost Share Or Matching Amount(emit0, notnull, when input is"$0"). -
expenditure_limitation←Expenditure Limitation. -
indirect_cost_rate_percent←Indirect Cost Rate(percent stripped). -
indirect_cost_base←"MTDC"ifModified Total Direct Costsis a numeric amount (presence of the field as a number); otherwisenull. The Vandalizer field names the base by way of being present; when it is"N/A"we cannot determine the base. -
fees←Fees(currency stripped)."$0"→0; absent →null.
Funding Opportunity split¶
Tokenize Funding Opportunity on whitespace. The funding_opportunity_number is the sponsor-style prefix up through the first token that matches ^[A-Z0-9][A-Z0-9\-]*$ after a leading alpha prefix (in practice the first two tokens: "NSF 23-519", "PD 23-221Y"). The remainder is the funding_opportunity_title; strip trailing punctuation (:, .).
-
Example:
"NSF 23-519 Major Research Instrumentation Program:"→ number"NSF 23-519", title"Major Research Instrumentation Program". -
Example:
"PD 23-221Y Growing Research Access for Nationally Transformative Equity and Diversity"→ number"PD 23-221Y", title"Growing Research Access for Nationally Transformative Equity and Diversity".
If the input value does not match this pattern, emit the whole stripped string as funding_opportunity_title and funding_opportunity_number as null.
Assistance Listing split¶
Split Assistance Listing Number And Name on the first whitespace run following the leading dotted number (^\d{2}\.\d{3}). The number goes into cfda_number; the remainder, verbatim including any trailing parenthetical annotation, goes into cfda_name.
- Example:
"47.074 Biological Sciences (Predominant source of funding for SEFA reporting)"→ number"47.074", name"Biological Sciences (Predominant source of funding for SEFA reporting)".
Amendment fields¶
Vandalizer does not capture amendment metadata. Emit the following defaults:
-
amendment_number="000"(required by UDM; represents initial obligation). -
amendment_type=null. -
amendment_date=null. -
amendment_description=null.
This default is correct for new-project notices. If a translator operator later determines the input represents an amendment, the amendment_number must be overridden out-of-band; this prompt does not infer amendment status.
recipient_organization¶
-
legal_name←Principal Investigator Organization. -
address,email,uei←null(Vandalizer does not extract these).
Fallback: if Principal Investigator Organization is absent, use the first semicolon-separated value of Co Principal Investigator Organization. If still absent, emit legal_name as "UNKNOWN" — the ingest service will surface this as a data-quality issue.
current_budget_period¶
Populate from the scalars:
-
period_number=1 -
period_label=null -
start_date←Award Period Start Date -
end_date←Award Period End Date -
direct_cost←Total Direct Costs -
indirect_cost←Indirect Costs -
obligated_amount←Amount Obligated By This Amendment
Emit current_budget_period: null only when Award Period Start Date, Award Period End Date, or Amount Obligated By This Amendment is absent (the UDM schema requires these three).
project_personnel¶
Emit one entry for the PI when Principal Investigator Name is present:
{"role": "PI", "name": <PI Name>, "email": <PI Email or null>, "organization": <PI Org or null>, "is_at_recipient_institution": true}
Then split Co Principal Investigator Name, Co Principal Investigator Email, and Co Principal Investigator Organization on ";", trim whitespace, and zip by index. Emit one role: "co-PI" entry per name.
-
When there are fewer emails than names, the trailing entries get
email: null. -
When there are fewer organizations than names (the common case — Vandalizer often collapses a shared recipient org into a single string), reuse the last organization string for all trailing entries.
-
is_at_recipient_institution:truewhen the entry'sorganizationequalsrecipient_organization.legal_namecompared case-insensitively and stripped of punctuation;falseotherwise.
sponsor_contacts¶
For each of the three contact blocks, emit an entry only when Name is not absent:
-
Managing Grants Official Name/Email/Phone→ role"Managing Grants Official" -
Awarding Official Name/Email→ role"Awarding Official"(phone:null) -
Managing Program Officer Name/Email/Phone→ role"Managing Program Officer"
Emit sponsor_contacts: [] when all three blocks are absent.
budget_categories¶
Emit the following entries in this order, preserving amount as a number. Skip an entry only when ALL its source fields are absent ("N/A"); a stated $0 or 0.00 is data and must be emitted as 0.
| Source key(s) | code | subcode | label |
|---|---|---|---|
| Senior Personnel Amount (+ Count, Calendar/Academic/Summer Months) | "A" | null | "Senior Personnel" |
| Post Doctoral Scholars Amount/Count/Months | "B" | "PostDoctoral" | "Post Doctoral Scholars" |
| Other Professionals Amount/Count/Months | "B" | "OtherProfessionals" | "Other Professionals" |
| Graduate Students Count/Amount | "B" | "GraduateStudents" | "Graduate Students" |
| Undergraduate Students Count/Amount | "B" | "UndergraduateStudents" | "Undergraduate Students" |
| Secretarial Clerical Count/Amount | "B" | "SecretarialClerical" | "Secretarial - Clerical" |
| Other Personnel Count/Amount | "B" | "Other" | "Other" |
| Fringe Benefits | "C" | null | "Fringe Benefits" |
| Equipment | "D" | null | "Equipment" |
| Travel Domestic | "E" | "Domestic" | "Domestic Travel" |
| Travel International | "E" | "International" | "International Travel" |
| Participant Support Costs Stipends | "F" | "Stipends" | "Participant Support Costs - Stipends" |
| Participant Support Costs Travel | "F" | "Travel" | "Participant Support Costs - Travel" |
| Participant Support Costs Subsistence | "F" | "Subsistence" | "Participant Support Costs - Subsistence" |
| Participant Support Costs Other | "F" | "Other" | "Participant Support Costs - Other" |
| Total Number Of Participants (count only; amount: null) | "F" | "TotalParticipants" | "Total Number of Participants" |
| Total Participant Costs | "F" | "Total" | "Total Participant Costs" |
| Materials Supplies | "G" | "MaterialsSupplies" | "Materials and Supplies" |
| Publication Costs | "G" | "Publication" | "Publication Costs" |
| Consultant Services | "G" | "ConsultantServices" | "Consultant Services" |
| Computer Services | "G" | "ComputerServices" | "Computer Services" |
| Subawards | "G" | "Subawards" | "Subawards" |
| Other Direct Costs Other | "G" | "Other" | "Other" |
| Total Other Direct Costs | "G" | "Total" | "Total Other Direct Costs" |
| Total Direct Costs | "H" | null | "Total Direct Costs" |
| Indirect Costs | "I" | null | "Indirect Costs" |
| Total Direct And Indirect Costs | "J" | null | "Total Direct and Indirect Costs" |
| Total Amount Of Request | "L" | null | "Amount of this Request" |
| Cost Sharing Proposed Level | "M" | null | "Cost Sharing Proposed Level" |
Do not emit entries for Total Salaries And Wages or Total Salaries Wages Fringe Benefits — they are computed rollups, not letter-coded lines in the NSF form, and are recoverable from the component rows.
Do not emit a budget_categories entry for Fees; it lives in the top-level fees scalar. The UDM budget code enum is ^[A-M]$.
subawards¶
Apply the UDM subaward inference rule using the project_personnel entries produced above and the G.Subawards line:
- If at least one co-PI has
is_at_recipient_institution == falseAND theG.Subawardsamount is greater than 0, emit one inferred entry per non-recipient co-PI:
{
"subawardee_name": <co-PI's organization>,
"pi_name": <co-PI's name>,
"pi_email": <co-PI's email or null>,
"description": "Implied subaward based on Co-PI <name> at <organization>. Aggregate Subawards line in Budget Category G totals $<amount>; individual subawardee allocation is not broken out in the notice.",
"obligated_amount": null,
"anticipated_amount": null,
"uei": null,
"inferred": true
}
- Otherwise emit
[].
Vandalizer never provides explicit subaward enumerations, so inferred is always true in this translator's output. If multiple co-PIs share the same non-recipient organization, emit one entry per co-PI (downstream can dedupe by subawardee_name).
linked_awards¶
Emit []. Vandalizer does not capture linked-award references.
terms_and_conditions¶
For each of these Vandalizer fields, emit a terms_and_conditions entry only when the value is not absent:
-
Authority Act→{"citation": <value>, "citation_date": null, "url": null, "applicability_notes": null} -
Research Terms And Conditions Date→{"citation": "Research Terms and Conditions", "citation_date": <ISO date>, "url": null, "applicability_notes": null} -
NSF Agency Specific Requirements Date→{"citation": "NSF Agency Specific Requirements", "citation_date": <ISO date>, "url": null, "applicability_notes": null}
Emit [] when all three are absent.
special_conditions¶
Emit []. Vandalizer does not capture narrative conditions.
source_provenance¶
Always populate source_provenance:
-
extractor:"vandalizer-to-udm-translation" -
extractor_version:"0.1.0"(this prompt's version) -
upstream_extractor:"Vandalizer" -
upstream_extractor_version: the version identifier when the runtime provides one; otherwisenull -
source_document: the source document identifier the runtime provides (Vandalizer input filename, hash, or URI); otherwisenull -
extracted_at: the timestamp the runtime provides; otherwisenull -
notes:nullunless the runtime supplies one -
review_annotations: see below
If the Vandalizer input's "what data was highlighted yellow in the original document?" field is present and not absent ("N/A", "", or "null"), emit one entry:
{
"label": "highlighted-yellow",
"value": <the verbatim value>,
"target_field": null,
"description": "Reviewer highlighted this content in the original document during Vandalizer extraction."
}
Otherwise review_annotations is [].
Procedure¶
-
Normalize every string scalar (trim whitespace; convert
"N/A","","null"to absent). -
Fill all UDM scalars (identity, dates, funding, indirect cost,
fees). -
Build
recipient_organizationandcurrent_budget_period. -
Build
project_personnel(PI first, then co-PIs from zipped semicolon lists). -
Build
sponsor_contacts. -
Emit every budget row per the mapping table.
-
Apply the subaward inference rule against the
project_personneland theG.Subawardsrow. -
Emit
terms_and_conditionsandspecial_conditions(the latter always[]). -
Build
source_provenance, including thereview_annotationsentry for the yellow-highlight field when present. -
Ensure every required array is present; re-check that
amendment_numberis"000".
Quality standards¶
-
Determinism. The same Vandalizer input must always produce the same UDM output.
-
No fabrication. If Vandalizer lacks a field, emit
nullor the documented default — never invent values. -
Schema conformance. Output validates against
components/nsf-award-notice-extraction-udm/schema.jsonv1.1.0. -
Typed fidelity. Numbers as numbers, booleans as booleans, ISO dates as strings.
-
Preserve zeros. A stated
$0or0.00remains0— it is data, not absence. -
Provenance always emitted.
source_provenance.extractorandsource_provenance.upstream_extractorare always populated so downstream ingest can distinguish translator output from direct-extractor output.
Produce the JSON now.
Output schema¶
Source: schema.json.
Show schema.json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/AI4RA/prompt-library/components/vandalizer-to-udm-translation/schema.json",
"title": "Vandalizer -> UDM NSF Award Notice Translation Output",
"description": "Local wrapper schema for the vandalizer-to-udm-translation component. The output contract delegates to the repo-local nsf-award-notice-extraction-udm schema so downstream consumers can validate translator output from this component directory without inferring the target schema from prose alone.",
"version": "0.1.0",
"target_component": "nsf-award-notice-extraction-udm",
"target_component_version": "1.1.0",
"allOf": [
{
"$ref": "../nsf-award-notice-extraction-udm/schema.json"
}
]
}
Evals¶
Reference cases¶
Golden cases under evals/cases/.
vandalizer-trial-2511003— Vandalizer trial — FAIN 2511003 / MRI Track 1 AVITI System (artifacts: input, expected)
Changelog¶
Source: CHANGELOG.md.
All notable changes to this component. Versions follow semver: MAJOR for output-contract breaks, MINOR for backward-compatible additions, PATCH for wording or clarity with no behavior change expected.
This component conforms to (does not own) nsf-award-notice-extraction-udm/schema.json. When that schema's version changes, bump this component's version in the same PR.
[0.1.0] — 2026-04-20¶
- Initial version.
- Canonical prompt (
prompt.md) translating Vandalizer flat-JSON NSF extractions to UDM v1.1.0 output. - Deterministic field-by-field mapping with documented defaults for fields Vandalizer does not capture (
amendment_numberdefaults to"000"; amendment type/date/description, recipient address/UEI/email, proposal number, and received-date emitnull). - NSF-format budget mapping table covering categories A through M, including B/E/F/G subcategories and the explicit skip of form rollups (
Total Salaries And Wages,Total Salaries Wages Fringe Benefits). feesscalar populated from Vandalizer'sFeesfield (requires schema v1.1.0).source_provenancealways emitted, withupstream_extractor = "Vandalizer"and areview_annotationsentry for the Vandalizer yellow-highlight field when present.- Subaward inference: always
inferred: true, since Vandalizer never enumerates subrecipients. linked_awardsandspecial_conditionsalways[]— Vandalizer does not capture them.- First eval case:
vandalizer-trial-2511003— MRI Track 1 AVITI System acquisition, Standard Grant, new-project obligation, Co-PI configuration at a single institution.