document-type-classifier-udm¶
document-type-classifier-udm1.0.01.0.0Tags: classification routing triage udm research-administration document-type
Audience: ingest-pipelines, sponsored-programs-staff, routing-layer
Manifestations in repo: prompt.md · skill/SKILL.md
Classifies a research-administration document into a controlled type vocabulary so downstream pipelines can route it to the correct extractor or reviewer. One call in, one small JSON object out.
Output contract: see schema.json
Inputs¶
Text of the document, typically the first N pages (or the full body for short files) as OCR markdown or plain text. The component is text-only; upstream is responsible for PDF-to-text conversion.
Optional metadata the caller may include inline at the top of the input (the prompt ignores these when they conflict with the visible content):
Filename:— the originating filenameSource:— the originating system (email attachment, Grants.gov download, internal upload)
Outputs¶
One JSON object with:
document_type— the top-ranked code from the controlled vocabularyconfidence— 0–1evidence_excerpt— a verbatim quote from the input that grounds the classificationrationale— one short sentence explaining the decisionsecondary_candidates— zero or more runner-up types, each withdocument_type,confidence, andrationale. Emit entries only when the top confidence is below 0.8 or when a plausible alternative exceeds 0.3.
See schema.json for the authoritative definition.
Controlled vocabulary¶
| Code | Covers |
|---|---|
solicitation |
RFP, RFA, FOA, NOFO, program announcement |
proposal_narrative |
Project description, research plan, specific aims |
biosketch |
NSF or NIH biographical sketch |
current_pending |
Current and Pending (Other) Support |
facilities |
Facilities, Equipment, and Other Resources |
data_mgmt |
Data Management Plan or Data Management & Sharing plan |
letter_support |
Letter of support, commitment, or collaboration |
budget |
Budget workbook, budget table, or budget form |
budget_justification |
Narrative budget justification |
award_notice |
Initial award notice / Notice of Award |
award_amendment |
Amendment or modification to an existing award |
jit_response |
Just-in-Time submission |
closeout_letter |
Closeout notification or final report package |
other |
None of the above |
The code values here are shared with other prompt-library components used in UDM-oriented pipelines. solicitation is what solicitation-doc-modifications-udm classifies on; award_notice / award_amendment gate award-document-extraction-udm; the proposal-component codes (biosketch, current_pending, facilities, data_mgmt, letter_support, budget_justification) align with sponsor-doc-defaults-udm.
Manifestations¶
prompt.md— canonical, LLM-agnostic promptskill/SKILL.md— Claude Skill form
Schema¶
schema.json is a JSON Schema (draft 2020-12) defining the full output contract. Validates with any conforming validator; routing layers should gate on it.
Evals¶
See evals/ for reference inputs and known-good outputs. The initial set exercises one clear solicitation, one clear award notice from a non-NSF sponsor, and one deliberately ambiguous short letter that should produce a dominant choice plus a plausible secondary candidate.
Provenance¶
Schema designed 2026-04-18 in response to issue #8. Vocabulary derived from the set of document types already consumed by sibling UDM components so that the classifier's output can route directly into those extractors and reviewers.
Contract scope¶
-
Output format:
json_object -
Contract scope:
repo_local_shared_component_vocabulary -
Validation surfaces:
json_schema,golden_eval_cases -
Schema entrypoints:
# -
Notes: Repo-local routing contract. The document_type enum is shared with sibling prompt-library components and ingest pipelines, but it is not a shared AI4RA-UDM table or externally versioned schema.
-
Machine-readable catalog entry:
component_catalog.json
Triad integration¶
-
UDM alignment:
repo_local_shared_component_vocabulary— The classifier supports UDM-oriented routing workflows, but its enum is maintained in this repo as a local cross-component contract rather than in the shared UDM repository. -
Evaluation datasets: no shared
evaluation-data-setscatalog entry recorded yet; current references are repo-local eval artifacts. -
Harness notes: Treat prompt.md as the canonical invocation surface, validate output against schema.json, and use the repo-local eval cases as the current reference set until a shared dataset entry is registered.
-
Related component:
sponsor-doc-defaults-udm(shares_document_code_vocabulary) — The classifier emits codes that downstream sponsor/defaults workflows recognize. -
Related component:
solicitation-doc-modifications-udm(shares_document_code_vocabulary) — Shared code values let solicitation outputs route by the same document labels.
Prompt body¶
Source: prompt.md.
Show prompt
Document Type Classifier — UDM¶
Purpose: Given the text of a single research-administration document, emit one JSON object identifying its type from a controlled vocabulary. The output is used upstream of extraction pipelines to pick the right downstream component.
Expected input: Plain text or OCR markdown of the document. Typically the first N pages; for short files (under ~3 pages) usually the whole body. May be preceded by optional
Filename:/Source:hints from the caller — do not over-weight these; the content always wins.Expected output: A single JSON object that validates against
schema.json. No prose, no markdown outside the JSON.
Prompt¶
You are a document type classifier for research administration. Read the provided document text and identify which of the controlled vocabulary codes best describes the document. Produce one JSON object matching the output contract — no preamble, no commentary, no markdown outside the JSON. If the runtime requires a fenced block, wrap the object in a single ```json ... ``` block and emit nothing else.
Output contract¶
Emit one object with these fields:
-
document_type— the top-ranked code. -
confidence— number in[0, 1], calibrated per the rule below. -
evidence_excerpt— a verbatim substring of the input (copy-paste, not paraphrase) that grounds the classification. Roughly one sentence or one header line. -
rationale— one short sentence naming the indicator that tipped the classification. -
secondary_candidates— array of runner-up objects, possibly empty. Emit entries only when the top confidence is below 0.8 OR when a plausible alternative scores above 0.3. Each entry hasdocument_type(must differ from the top-level code),confidence(≤ top-level confidence), and a one-sentencerationale. Sort by confidence descending.
Controlled vocabulary¶
Classify into exactly one of these codes:
| Code | Covers | Typical positive indicators |
| --- | --- | --- |
| solicitation | RFP / RFA / FOA / NOFO / program announcement | Sponsor name followed by a program identifier (e.g., PD 23-221Y, PA-25-123, RFA-AI-24-456); sections titled "Program Description", "Eligibility", "Proposal Preparation", "Merit Review Criteria", "Due Dates"; prose directed at prospective applicants ("Proposals are invited…", "This solicitation supports…"). |
| proposal_narrative | Project description, research plan, specific aims | Sections titled "Project Description", "Research Plan", "Specific Aims", "Background/Introduction/Methods/Broader Impacts"; written in the proposer's voice, citing the literature; proposes work to be done. |
| biosketch | NSF or NIH biographical sketch | NSF: "Professional Preparation", "Appointments and Positions", "Products" (with subheads "Products Most Closely Related" / "Other Significant Products"), "Synergistic Activities". NIH: "Positions and Honors", "Contribution to Science", "Personal Statement", "Complete List of Published Work in MyBibliography". Single-person focused, chronological. |
| current_pending | Current and Pending (Other) Support | Tabular per-person listing of active and pending awards, each with project title, sponsor, support type, total award amount, start and end dates, person-months committed. May span multiple people. Often generated by SciENcv. |
| facilities | Facilities, Equipment, and Other Resources | Descriptions of laboratory space, shared instrumentation, computing, libraries, clinical resources, office space; no proposed work; written to demonstrate institutional capacity. Often titled "Facilities, Equipment, and Other Resources". |
| data_mgmt | Data Management Plan or Data Management & Sharing plan | Discussion of data types to be generated, formats, metadata standards, retention periods, sharing plans, repositories, privacy/IRB considerations, timelines for release. NSF: "Data Management Plan" (short). NIH: "Data Management and Sharing Plan" (longer, structured). |
| letter_support | Letter of support, commitment, or collaboration | Letterhead, dated, salutation ("Dear Dr. X" or "To Whom It May Concern"), a short commitment statement referencing a specific proposal or PI, signature block. Usually one page. |
| budget | Structured budget workbook or table | Dense numeric tabular layout with budget category rows (A-M for NSF; personnel / fringe / travel / supplies / etc. for general) and dollar-amount columns, usually across project years. Often exported from an institutional budget tool. Little or no narrative. |
| budget_justification | Narrative budget justification | Prose walking category-by-category through the budget, naming personnel and effort percentages, explaining equipment and travel, and citing the indirect cost rate and base. Often structured as sections A through H (NSF). |
| award_notice | Initial Notice of Award | Sponsor-branded notice block, "Award Number" / "FAIN", period of performance, obligation amount, recipient information, terms and conditions citations. For NSF, "Amendment Number" = "000". For NIH, "Notice of Award" header with a "Type 1" or "New" action code. |
| award_amendment | Modification or amendment | Same structural family as award_notice but explicitly a modification: "Amendment Number" other than "000", "Administrative Amendment", "No-Cost Extension", "Supplemental", or an NIH "Type 3" / "Type 4" / "Type 5" revision. References an existing award. |
| jit_response | Just-in-Time submission | Explicit "Just-in-Time" or "JIT" label; pre-award request for updated other support, IRB/IACUC approval documentation, revised budget, or human subjects information; directed to a program officer or grants management specialist. |
| closeout_letter | Closeout notification or final-report package | "Closeout", "Final Report", "Final Federal Financial Report (FFR)", "Final Invention Statement", "Final Progress Report", "Property Report"; sponsor-initiated notification or recipient-initiated submission marking the end of a project. |
| other | None of the above apply | Use when no positive indicator fits. Prefer other over a speculative match for an unrelated document (emails that are not award notices, internal memos, scholarly manuscripts, contracts for services, general correspondence). |
Classification procedure¶
-
Scan the first 500–1000 characters for a self-identifying phrase: a boxed header, a title line, or an explicit label. Those are the highest-signal indicators. Record the exact text for
evidence_excerptif it resolves the classification. -
If no explicit label is present, look for the structural indicators listed above. Combine multiple weak indicators before committing.
-
Apply the sponsor-agnostic rule. The vocabulary is sponsor-neutral. Do not refuse to classify a document just because the sponsor is one you have not seen. An NIH Notice of Award is
award_noticethe same as an NSF one. -
Apply the initial-vs-amendment rule for award documents: if the document is an award notice and explicitly shows an amendment / modification / no-cost-extension marker, classify as
award_amendment. Otherwiseaward_notice. -
Prefer
otherover a guess when the indicators are weak. Low-confidenceotheris better than a speculative routing decision downstream.
Confidence calibration¶
-
0.9–1.0 — the document carries a canonical header or explicit self-identifying phrase that names the type (e.g., "NSF Award Notice", "Data Management Plan", "Biographical Sketch").
evidence_excerptquotes that phrase. -
0.7–0.9 — strong structural indicators present (section headings, tabular layouts, or writing voice) but no literal self-identification.
-
0.5–0.7 — multiple types plausibly fit; structural cues are partial or mixed. Emit
secondary_candidatesin this band. -
Below 0.5 — evidence is weak; prefer
document_type: "other"with a short rationale, and list the most-plausible coded candidates undersecondary_candidates.
secondary_candidates rule¶
-
Emit entries only when top confidence is below 0.8 or when a plausible alternative exceeds 0.3.
-
Each secondary entry's
document_typemust differ from the top-leveldocument_type. -
Each secondary entry's
confidencemust not exceed the top-level confidence. -
Sort descending by
confidence. -
Emit an empty array (
[]) — notnull— when the classification is unambiguous.
Quality standards¶
-
Evidence-grounded —
evidence_excerptis a verbatim substring of the input. A classification without a quotable indicator should be low-confidenceother. -
Controlled vocabulary only —
document_typemust be one of the enumerated codes. Do not invent new codes; if no code fits, useother. -
Calibrated confidence — follow the bands above. Over-confident classification is a worse failure than cautious
other. -
Sponsor-agnostic — the vocabulary does not encode sponsor identity. An NIH NOA, an NSF NOA, and a DoE award letter are all
award_notice. -
No fabrication — do not infer a type from absence of information. If the document is truncated or unreadable, classify as
otherand say so inrationale. -
Schema conformance — output validates against
schema.json.
Produce the JSON now.
Output schema¶
Source: schema.json.
Show schema.json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/AI4RA/prompt-library/components/document-type-classifier-udm/schema.json",
"title": "Document Type Classifier \u2014 UDM Output",
"description": "Single JSON object classifying a research-administration document into a controlled type vocabulary. Consumed by routing layers that dispatch the document to the appropriate downstream extractor or reviewer. One object per input document.",
"version": "1.0.0",
"type": "object",
"additionalProperties": false,
"required": [
"document_type",
"confidence",
"evidence_excerpt",
"rationale",
"secondary_candidates"
],
"properties": {
"document_type": {
"$ref": "#/$defs/documentTypeCode",
"description": "Top-ranked document type. Emit 'other' rather than a guess when the input does not match any positive indicator."
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Model confidence in the top-ranked document_type. Calibrated per the prompt's confidence rule: 0.9+ for documents with a canonical header or explicit self-identifying phrase, 0.7\u20130.9 for strong structural indicators without a literal label, 0.5\u20130.7 when several types plausibly fit, below 0.5 when the evidence is weak (prefer 'other' with low confidence over a speculative code)."
},
"evidence_excerpt": {
"type": "string",
"minLength": 1,
"description": "A verbatim substring of the input that grounds the classification. Must be copyable from the source text; do not paraphrase. Keep it short (roughly one sentence or a short header line)."
},
"rationale": {
"type": "string",
"minLength": 1,
"description": "One short sentence explaining the decision, referencing the indicator that tipped the classification."
},
"secondary_candidates": {
"type": "array",
"description": "Runner-up document types. Emit entries only when the top confidence is below 0.8 or a plausible alternative exceeds 0.3. Sorted by confidence descending. Empty array when the classification is unambiguous.",
"items": {
"type": "object",
"additionalProperties": false,
"required": [
"document_type",
"confidence",
"rationale"
],
"properties": {
"document_type": {
"$ref": "#/$defs/documentTypeCode",
"description": "Runner-up code. Must differ from the top-level document_type."
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Must be strictly less than or equal to the top-level confidence."
},
"rationale": {
"type": "string",
"minLength": 1,
"description": "One short sentence explaining why this type remains plausible."
}
}
}
}
},
"$defs": {
"documentTypeCode": {
"type": "string",
"enum": [
"solicitation",
"proposal_narrative",
"biosketch",
"current_pending",
"facilities",
"data_mgmt",
"letter_support",
"budget",
"budget_justification",
"award_notice",
"award_amendment",
"jit_response",
"closeout_letter",
"other"
],
"description": "Controlled vocabulary of research-administration document types. 'solicitation' covers RFP / RFA / FOA / NOFO / program announcement. 'proposal_narrative' is the technical narrative of a proposal. 'biosketch' is an NSF or NIH biographical sketch. 'current_pending' is the Current and Pending (Other) Support disclosure. 'facilities' is the Facilities, Equipment, and Other Resources section. 'data_mgmt' is the Data Management (and Sharing) Plan. 'letter_support' is a letter of support, commitment, or collaboration. 'budget' is a structured budget workbook or table. 'budget_justification' is the narrative justification for a budget. 'award_notice' is an initial Notice of Award. 'award_amendment' is a modification or amendment to an existing award. 'jit_response' is a Just-in-Time submission. 'closeout_letter' is a closeout notification or final-report package. 'other' is any document not matching the positive indicators of the above types."
}
}
}
Evals¶
Reference cases¶
Golden cases under evals/cases/.
-
ambiguous-letter— Short ambiguous letter — weak letter of support vs. general correspondence (artifacts: input, expected) -
nih-noa— NIH Notice of Award header block (artifacts: input, expected) -
nsf-pd-23-221y-solicitation— NSF program solicitation first-page excerpt (artifacts: input, expected)
Changelog¶
Source: CHANGELOG.md.
All notable changes to this component. Versions follow semver: MAJOR for output-contract breaks (schema changes that drop or rename fields, or remove / rename vocabulary codes), MINOR for backward-compatible additions (new vocabulary codes, new optional fields, new manifestations), PATCH for wording or clarity with no behavior change expected.
The schema.json version is kept in lockstep with the component version.
[1.0.0] — 2026-04-18¶
- Initial version.
- JSON Schema (
schema.json) defining a single output object withdocument_type,confidence,evidence_excerpt,rationale, and asecondary_candidatesarray. - Controlled 14-code vocabulary covering the most common research-administration document types (solicitation, proposal components, budget, award notice / amendment / JIT / closeout, plus
other). Codes chosen to align with the document codes consumed by sibling UDM components so classifier output can route directly into those extractors. - Canonical prompt (
prompt.md) with per-type indicator cues, a confidence calibration rule, and an explicit rule for when to emitsecondary_candidates. - Claude Skill manifestation (
skill/SKILL.md) tuned for "what kind of document is this" triggers. - First three golden eval cases: a clear NSF solicitation, a clear NIH Notice of Award (proving sponsor generality beyond NSF), and an ambiguous short letter that exercises the
secondary_candidatesrule.