Skip to content

vandalizer-to-udm-translation

Slugvandalizer-to-udm-translation
Version0.1.0
Statusexperimental
Last fully evaluated0.1.0
Eval statecurrent
Categorytransformation
Domainresearch-administration
Manifestationsprompt
Created2026-04-20
Updated2026-04-20

Tags: nsf award notice udm translation vandalizer structured-transformation json

Audience: ingest-pipelines, sponsored-programs-staff

Manifestations in repo: prompt.md

Converts a Vandalizer NSF-extraction JSON object (flat key/value form) into a single JSON object conforming to the nsf-award-notice-extraction-udm schema. This is a pure transformation — no information is invented, and fields Vandalizer does not capture emit as null or documented defaults.

Output contract: see schema.json (local wrapper delegating to nsf-award-notice-extraction-udm v1.1.0)

Inputs

One JSON object produced by the Vandalizer NSF extraction task. Shape:

  • Flat, single-level key/value — all values are strings.
  • Standard Vandalizer convention: "N/A" denotes absent values.
  • US-format dates (MM/DD/YYYY), currency with $ and commas ("$584,845"), percentages with % ("50.0000%"), semicolon-delimited lists for multi-value fields (Co-PI names/emails/organizations).
  • Flat NSF-format budget line items keyed by label (Senior Personnel Amount, Post Doctoral Scholars Count, etc.), mirroring the 18-category table.
  • A trailing review-metadata field: "what data was highlighted yellow in the original document?".

See evals/cases/vandalizer-trial-2511003/ for the seed input.

Outputs

A single JSON object conforming to the local schema.json wrapper, which delegates to nsf-award-notice-extraction-udm v1.1.0. See prompt.md for the field-by-field translation rules.

Scope and non-scope

In scope. Deterministic field-by-field translation. Format normalization (ISO dates, plain currency, plain percents, typed booleans). NSF-format budget line → UDM budget_categories code/subcode assignment. Subaward inference using the UDM rule (Co-PI at non-recipient org + non-zero G.Subawards). Carrying the Vandalizer review-highlight annotation through source_provenance.review_annotations.

Out of scope. Re-extracting missing fields from the original PDF. Vandalizer does not capture amendment metadata (number, type, date, description), recipient address/UEI/email, proposal number, or the email header's received-date. The translator emits null / documented defaults for these and does not attempt to recover them. Downstream systems that need these fields should run the full nsf-award-notice-extraction-udm extractor on the PDF instead, or extend Vandalizer's output schema.

Defaults and data-quality notes

  • amendment_number is required by UDM but absent from Vandalizer output. Translator emits "000" (new project / initial obligation) by default. Do not deploy this translator against amendment notices without first adding amendment fields to Vandalizer.
  • recipient_organization.legal_name is taken from Principal Investigator Organization. Address, email, and UEI are always null. Ingest consumers should treat recipient records produced by this translator as needing enrichment from a separate organization-resolution step.
  • subawards entries are always inferred: true (Vandalizer never itemizes subrecipients).
  • linked_awards is always [].
  • fees is populated from Vandalizer's Fees field (schema v1.1.0 scalar).
  • source_provenance.extractor = "vandalizer-to-udm-translation", upstream_extractor = "Vandalizer". The Vandalizer's "what data was highlighted yellow in the original document?" field, when not "N/A", is emitted as a review_annotations entry with label: "highlighted-yellow".

Relationship to other components

Concern nsf-award-notice-extraction-udm vandalizer-to-udm-translation
Input NSF Award Notice PDF (or pasted text) Vandalizer flat-JSON extraction output
Category extraction transformation
Output schema owns it conforms to it
Field coverage Full UDM contract Subset — amendment metadata, recipient contact info, received-date all null
Subaward entries explicit or inferred always inferred

Manifestations

  • prompt.md — canonical, LLM-agnostic prompt

Evals

See evals/cases/vandalizer-trial-2511003/ for the seed case — FAIN 2511003 (MRI: Track 1 AVITI System), a Vandalizer extraction of an NSF Standard Grant Amendment 000 notice from 2025.

Provenance

Authored 2026-04-20 in response to a trial Vandalizer extraction produced against an NSF-26-508-era award notice at the University of Idaho. The schema v1.1.0 bump (fees, source_provenance) was driven by gaps surfaced during the initial translator spec review.

Contract scope

  • Output format: json_object

  • Contract scope: delegated_repo_local_schema

  • Validation surfaces: json_schema, golden_eval_cases

  • Schema entrypoints: #

  • Notes: Translator wrapper contract. The local schema delegates to nsf-award-notice-extraction-udm/schema.json so downstream consumers have a concrete contract surface inside this component directory.

  • Machine-readable catalog entry: component_catalog.json

Triad integration

  • UDM alignment: delegated_repo_local_schema — Output aligns to the repo-local NSF award notice extraction schema, which itself is UDM-aligned but maintained in prompt-library.

  • Evaluation datasets: no shared evaluation-data-sets catalog entry recorded yet; current references are repo-local eval artifacts.

  • Harness notes: Treat this as a transformation component, not a source-of-truth extractor. Validate output against the local wrapper schema and remember that missing fields default to null or documented defaults based on the upstream Vandalizer payload.

  • Related component: nsf-award-notice-extraction-udm (delegates_output_schema_to) — This translator's output contract is the award-notice schema.

Prompt body

Source: prompt.md.

Show prompt

Vandalizer → UDM NSF Award Notice Translation — Prompt

Purpose: Convert a Vandalizer NSF-extraction JSON object (flat key/value form) into a single JSON object conforming to the nsf-award-notice-extraction-udm schema v1.1.0. This is a pure transformation — no information is invented, and fields Vandalizer does not capture emit as null or documented defaults.

Expected input: One JSON object produced by the Vandalizer NSF extraction task (flat, string-valued fields; "N/A" used for missing values).

Expected output: One JSON object validating against schema.json, which delegates to nsf-award-notice-extraction-udm v1.1.0. No prose, no markdown outside the JSON.


Prompt

You are a deterministic translator. Read one Vandalizer NSF extraction object and emit one UDM NSF Award Notice JSON object. Do not paraphrase, re-order, or summarize; map field by field using the rules below. The same input must always produce the same output.

Output contract

Emit one JSON object. No preamble, no trailing commentary. If the runtime requires a fenced block, wrap the object in a single ```json ... ``` block and emit nothing else. Every required UDM array (project_personnel, sponsor_contacts, budget_categories, subawards, linked_awards, terms_and_conditions, special_conditions) must be present; emit [] when empty.

Normalization rules

  • Missing values. Treat any Vandalizer value of "N/A", "", or the literal string "null" (case-insensitive) as absent. Emit null for scalars and omit the would-be list item for arrays.

  • Currency. Strip $, commas, whitespace. "$584,845"584845. "$0"0 (not null).

  • Dates. US MM/DD/YYYY → ISO YYYY-MM-DD. Missing → null.

  • Percent. Strip %. "50.0000%"50.0.

  • Booleans. "Yes"true, "No"false, absent → null.

  • JSON types. Emit 0 not "0", false not "false", null not "null".

Scalar mappings

  • award_numberAward Number (required; never null).

  • award_id"NSF-" + Award Number.

  • sponsor_name"National Science Foundation" (constant for this translator).

  • sponsor_award_numbernull (Vandalizer does not distinguish it from award_number).

  • award_titleProject Title.

  • award_instrumentAward Instrument.

  • managing_divisionManaging Division Abbreviation.

  • award_statusnull.

  • is_research_and_developmentResearch And Development Award (Yes/No/absent).

  • is_collaborative_researchtrue iff Project Title starts with "Collaborative Research:" (case-insensitive), else false.

  • proposal_numbernull.

  • award_dateAward Date.

  • award_received_datenull (Vandalizer does not capture the email header date).

  • start_dateAward Period Start Date.

  • end_dateAward Period End Date.

  • amount_obligated_this_amendmentAmount Obligated By This Amendment.

  • total_intended_amountTotal Intended Award Amount.

  • total_obligated_to_dateTotal Amount Obligated To Date.

  • cost_share_approved_amountTotal Approved Cost Share Or Matching Amount (emit 0, not null, when input is "$0").

  • expenditure_limitationExpenditure Limitation.

  • indirect_cost_rate_percentIndirect Cost Rate (percent stripped).

  • indirect_cost_base"MTDC" if Modified Total Direct Costs is a numeric amount (presence of the field as a number); otherwise null. The Vandalizer field names the base by way of being present; when it is "N/A" we cannot determine the base.

  • feesFees (currency stripped). "$0"0; absent → null.

Funding Opportunity split

Tokenize Funding Opportunity on whitespace. The funding_opportunity_number is the sponsor-style prefix up through the first token that matches ^[A-Z0-9][A-Z0-9\-]*$ after a leading alpha prefix (in practice the first two tokens: "NSF 23-519", "PD 23-221Y"). The remainder is the funding_opportunity_title; strip trailing punctuation (:, .).

  • Example: "NSF 23-519 Major Research Instrumentation Program:" → number "NSF 23-519", title "Major Research Instrumentation Program".

  • Example: "PD 23-221Y Growing Research Access for Nationally Transformative Equity and Diversity" → number "PD 23-221Y", title "Growing Research Access for Nationally Transformative Equity and Diversity".

If the input value does not match this pattern, emit the whole stripped string as funding_opportunity_title and funding_opportunity_number as null.

Assistance Listing split

Split Assistance Listing Number And Name on the first whitespace run following the leading dotted number (^\d{2}\.\d{3}). The number goes into cfda_number; the remainder, verbatim including any trailing parenthetical annotation, goes into cfda_name.

  • Example: "47.074 Biological Sciences (Predominant source of funding for SEFA reporting)" → number "47.074", name "Biological Sciences (Predominant source of funding for SEFA reporting)".

Amendment fields

Vandalizer does not capture amendment metadata. Emit the following defaults:

  • amendment_number = "000" (required by UDM; represents initial obligation).

  • amendment_type = null.

  • amendment_date = null.

  • amendment_description = null.

This default is correct for new-project notices. If a translator operator later determines the input represents an amendment, the amendment_number must be overridden out-of-band; this prompt does not infer amendment status.

recipient_organization

  • legal_namePrincipal Investigator Organization.

  • address, email, ueinull (Vandalizer does not extract these).

Fallback: if Principal Investigator Organization is absent, use the first semicolon-separated value of Co Principal Investigator Organization. If still absent, emit legal_name as "UNKNOWN" — the ingest service will surface this as a data-quality issue.

current_budget_period

Populate from the scalars:

  • period_number = 1

  • period_label = null

  • start_dateAward Period Start Date

  • end_dateAward Period End Date

  • direct_costTotal Direct Costs

  • indirect_costIndirect Costs

  • obligated_amountAmount Obligated By This Amendment

Emit current_budget_period: null only when Award Period Start Date, Award Period End Date, or Amount Obligated By This Amendment is absent (the UDM schema requires these three).

project_personnel

Emit one entry for the PI when Principal Investigator Name is present:

{"role": "PI", "name": <PI Name>, "email": <PI Email or null>, "organization": <PI Org or null>, "is_at_recipient_institution": true}

Then split Co Principal Investigator Name, Co Principal Investigator Email, and Co Principal Investigator Organization on ";", trim whitespace, and zip by index. Emit one role: "co-PI" entry per name.

  • When there are fewer emails than names, the trailing entries get email: null.

  • When there are fewer organizations than names (the common case — Vandalizer often collapses a shared recipient org into a single string), reuse the last organization string for all trailing entries.

  • is_at_recipient_institution: true when the entry's organization equals recipient_organization.legal_name compared case-insensitively and stripped of punctuation; false otherwise.

For each of the three contact blocks, emit an entry only when Name is not absent:

  • Managing Grants Official Name / Email / Phone → role "Managing Grants Official"

  • Awarding Official Name / Email → role "Awarding Official" (phone: null)

  • Managing Program Officer Name / Email / Phone → role "Managing Program Officer"

Emit sponsor_contacts: [] when all three blocks are absent.

budget_categories

Emit the following entries in this order, preserving amount as a number. Skip an entry only when ALL its source fields are absent ("N/A"); a stated $0 or 0.00 is data and must be emitted as 0.

| Source key(s) | code | subcode | label |

|---|---|---|---|

| Senior Personnel Amount (+ Count, Calendar/Academic/Summer Months) | "A" | null | "Senior Personnel" |

| Post Doctoral Scholars Amount/Count/Months | "B" | "PostDoctoral" | "Post Doctoral Scholars" |

| Other Professionals Amount/Count/Months | "B" | "OtherProfessionals" | "Other Professionals" |

| Graduate Students Count/Amount | "B" | "GraduateStudents" | "Graduate Students" |

| Undergraduate Students Count/Amount | "B" | "UndergraduateStudents" | "Undergraduate Students" |

| Secretarial Clerical Count/Amount | "B" | "SecretarialClerical" | "Secretarial - Clerical" |

| Other Personnel Count/Amount | "B" | "Other" | "Other" |

| Fringe Benefits | "C" | null | "Fringe Benefits" |

| Equipment | "D" | null | "Equipment" |

| Travel Domestic | "E" | "Domestic" | "Domestic Travel" |

| Travel International | "E" | "International" | "International Travel" |

| Participant Support Costs Stipends | "F" | "Stipends" | "Participant Support Costs - Stipends" |

| Participant Support Costs Travel | "F" | "Travel" | "Participant Support Costs - Travel" |

| Participant Support Costs Subsistence | "F" | "Subsistence" | "Participant Support Costs - Subsistence" |

| Participant Support Costs Other | "F" | "Other" | "Participant Support Costs - Other" |

| Total Number Of Participants (count only; amount: null) | "F" | "TotalParticipants" | "Total Number of Participants" |

| Total Participant Costs | "F" | "Total" | "Total Participant Costs" |

| Materials Supplies | "G" | "MaterialsSupplies" | "Materials and Supplies" |

| Publication Costs | "G" | "Publication" | "Publication Costs" |

| Consultant Services | "G" | "ConsultantServices" | "Consultant Services" |

| Computer Services | "G" | "ComputerServices" | "Computer Services" |

| Subawards | "G" | "Subawards" | "Subawards" |

| Other Direct Costs Other | "G" | "Other" | "Other" |

| Total Other Direct Costs | "G" | "Total" | "Total Other Direct Costs" |

| Total Direct Costs | "H" | null | "Total Direct Costs" |

| Indirect Costs | "I" | null | "Indirect Costs" |

| Total Direct And Indirect Costs | "J" | null | "Total Direct and Indirect Costs" |

| Total Amount Of Request | "L" | null | "Amount of this Request" |

| Cost Sharing Proposed Level | "M" | null | "Cost Sharing Proposed Level" |

Do not emit entries for Total Salaries And Wages or Total Salaries Wages Fringe Benefits — they are computed rollups, not letter-coded lines in the NSF form, and are recoverable from the component rows.

Do not emit a budget_categories entry for Fees; it lives in the top-level fees scalar. The UDM budget code enum is ^[A-M]$.

subawards

Apply the UDM subaward inference rule using the project_personnel entries produced above and the G.Subawards line:

  • If at least one co-PI has is_at_recipient_institution == false AND the G.Subawards amount is greater than 0, emit one inferred entry per non-recipient co-PI:
{

  "subawardee_name": <co-PI's organization>,

  "pi_name": <co-PI's name>,

  "pi_email": <co-PI's email or null>,

  "description": "Implied subaward based on Co-PI <name> at <organization>. Aggregate Subawards line in Budget Category G totals $<amount>; individual subawardee allocation is not broken out in the notice.",

  "obligated_amount": null,

  "anticipated_amount": null,

  "uei": null,

  "inferred": true

}
  • Otherwise emit [].

Vandalizer never provides explicit subaward enumerations, so inferred is always true in this translator's output. If multiple co-PIs share the same non-recipient organization, emit one entry per co-PI (downstream can dedupe by subawardee_name).

linked_awards

Emit []. Vandalizer does not capture linked-award references.

terms_and_conditions

For each of these Vandalizer fields, emit a terms_and_conditions entry only when the value is not absent:

  • Authority Act{"citation": <value>, "citation_date": null, "url": null, "applicability_notes": null}

  • Research Terms And Conditions Date{"citation": "Research Terms and Conditions", "citation_date": <ISO date>, "url": null, "applicability_notes": null}

  • NSF Agency Specific Requirements Date{"citation": "NSF Agency Specific Requirements", "citation_date": <ISO date>, "url": null, "applicability_notes": null}

Emit [] when all three are absent.

special_conditions

Emit []. Vandalizer does not capture narrative conditions.

source_provenance

Always populate source_provenance:

  • extractor: "vandalizer-to-udm-translation"

  • extractor_version: "0.1.0" (this prompt's version)

  • upstream_extractor: "Vandalizer"

  • upstream_extractor_version: the version identifier when the runtime provides one; otherwise null

  • source_document: the source document identifier the runtime provides (Vandalizer input filename, hash, or URI); otherwise null

  • extracted_at: the timestamp the runtime provides; otherwise null

  • notes: null unless the runtime supplies one

  • review_annotations: see below

If the Vandalizer input's "what data was highlighted yellow in the original document?" field is present and not absent ("N/A", "", or "null"), emit one entry:

{

  "label": "highlighted-yellow",

  "value": <the verbatim value>,

  "target_field": null,

  "description": "Reviewer highlighted this content in the original document during Vandalizer extraction."

}

Otherwise review_annotations is [].

Procedure

  1. Normalize every string scalar (trim whitespace; convert "N/A", "", "null" to absent).

  2. Fill all UDM scalars (identity, dates, funding, indirect cost, fees).

  3. Build recipient_organization and current_budget_period.

  4. Build project_personnel (PI first, then co-PIs from zipped semicolon lists).

  5. Build sponsor_contacts.

  6. Emit every budget row per the mapping table.

  7. Apply the subaward inference rule against the project_personnel and the G.Subawards row.

  8. Emit terms_and_conditions and special_conditions (the latter always []).

  9. Build source_provenance, including the review_annotations entry for the yellow-highlight field when present.

  10. Ensure every required array is present; re-check that amendment_number is "000".

Quality standards

  1. Determinism. The same Vandalizer input must always produce the same UDM output.

  2. No fabrication. If Vandalizer lacks a field, emit null or the documented default — never invent values.

  3. Schema conformance. Output validates against components/nsf-award-notice-extraction-udm/schema.json v1.1.0.

  4. Typed fidelity. Numbers as numbers, booleans as booleans, ISO dates as strings.

  5. Preserve zeros. A stated $0 or 0.00 remains 0 — it is data, not absence.

  6. Provenance always emitted. source_provenance.extractor and source_provenance.upstream_extractor are always populated so downstream ingest can distinguish translator output from direct-extractor output.

Produce the JSON now.

Output schema

Source: schema.json.

Show schema.json
{

  "$schema": "https://json-schema.org/draft/2020-12/schema",

  "$id": "https://github.com/AI4RA/prompt-library/components/vandalizer-to-udm-translation/schema.json",

  "title": "Vandalizer -> UDM NSF Award Notice Translation Output",

  "description": "Local wrapper schema for the vandalizer-to-udm-translation component. The output contract delegates to the repo-local nsf-award-notice-extraction-udm schema so downstream consumers can validate translator output from this component directory without inferring the target schema from prose alone.",

  "version": "0.1.0",

  "target_component": "nsf-award-notice-extraction-udm",

  "target_component_version": "1.1.0",

  "allOf": [

    {

      "$ref": "../nsf-award-notice-extraction-udm/schema.json"

    }

  ]

}

Evals

Reference cases

Golden cases under evals/cases/.

  • vandalizer-trial-2511003 — Vandalizer trial — FAIN 2511003 / MRI Track 1 AVITI System (artifacts: input, expected)

Changelog

Source: CHANGELOG.md.

All notable changes to this component. Versions follow semver: MAJOR for output-contract breaks, MINOR for backward-compatible additions, PATCH for wording or clarity with no behavior change expected.

This component conforms to (does not own) nsf-award-notice-extraction-udm/schema.json. When that schema's version changes, bump this component's version in the same PR.

[0.1.0] — 2026-04-20

  • Initial version.
  • Canonical prompt (prompt.md) translating Vandalizer flat-JSON NSF extractions to UDM v1.1.0 output.
  • Deterministic field-by-field mapping with documented defaults for fields Vandalizer does not capture (amendment_number defaults to "000"; amendment type/date/description, recipient address/UEI/email, proposal number, and received-date emit null).
  • NSF-format budget mapping table covering categories A through M, including B/E/F/G subcategories and the explicit skip of form rollups (Total Salaries And Wages, Total Salaries Wages Fringe Benefits).
  • fees scalar populated from Vandalizer's Fees field (requires schema v1.1.0).
  • source_provenance always emitted, with upstream_extractor = "Vandalizer" and a review_annotations entry for the Vandalizer yellow-highlight field when present.
  • Subaward inference: always inferred: true, since Vandalizer never enumerates subrecipients.
  • linked_awards and special_conditions always [] — Vandalizer does not capture them.
  • First eval case: vandalizer-trial-2511003 — MRI Track 1 AVITI System acquisition, Standard Grant, new-project obligation, Co-PI configuration at a single institution.