Skip to content

document-type-classifier-udm

Slugdocument-type-classifier-udm
Version1.0.0
Statusexperimental
Last fully evaluated1.0.0
Eval statecurrent
Categoryclassification
Domainresearch-administration
Manifestationsprompt, skill
Created2026-04-18
Updated2026-04-18

Tags: classification routing triage udm research-administration document-type

Audience: ingest-pipelines, sponsored-programs-staff, routing-layer

Manifestations in repo: prompt.md · skill/SKILL.md

Classifies a research-administration document into a controlled type vocabulary so downstream pipelines can route it to the correct extractor or reviewer. One call in, one small JSON object out.

Output contract: see schema.json

Inputs

Text of the document, typically the first N pages (or the full body for short files) as OCR markdown or plain text. The component is text-only; upstream is responsible for PDF-to-text conversion.

Optional metadata the caller may include inline at the top of the input (the prompt ignores these when they conflict with the visible content):

  • Filename: — the originating filename
  • Source: — the originating system (email attachment, Grants.gov download, internal upload)

Outputs

One JSON object with:

  • document_type — the top-ranked code from the controlled vocabulary
  • confidence — 0–1
  • evidence_excerpt — a verbatim quote from the input that grounds the classification
  • rationale — one short sentence explaining the decision
  • secondary_candidates — zero or more runner-up types, each with document_type, confidence, and rationale. Emit entries only when the top confidence is below 0.8 or when a plausible alternative exceeds 0.3.

See schema.json for the authoritative definition.

Controlled vocabulary

Code Covers
solicitation RFP, RFA, FOA, NOFO, program announcement
proposal_narrative Project description, research plan, specific aims
biosketch NSF or NIH biographical sketch
current_pending Current and Pending (Other) Support
facilities Facilities, Equipment, and Other Resources
data_mgmt Data Management Plan or Data Management & Sharing plan
letter_support Letter of support, commitment, or collaboration
budget Budget workbook, budget table, or budget form
budget_justification Narrative budget justification
award_notice Initial award notice / Notice of Award
award_amendment Amendment or modification to an existing award
jit_response Just-in-Time submission
closeout_letter Closeout notification or final report package
other None of the above

The code values here are shared with other prompt-library components used in UDM-oriented pipelines. solicitation is what solicitation-doc-modifications-udm classifies on; award_notice / award_amendment gate award-document-extraction-udm; the proposal-component codes (biosketch, current_pending, facilities, data_mgmt, letter_support, budget_justification) align with sponsor-doc-defaults-udm.

Manifestations

Schema

schema.json is a JSON Schema (draft 2020-12) defining the full output contract. Validates with any conforming validator; routing layers should gate on it.

Evals

See evals/ for reference inputs and known-good outputs. The initial set exercises one clear solicitation, one clear award notice from a non-NSF sponsor, and one deliberately ambiguous short letter that should produce a dominant choice plus a plausible secondary candidate.

Provenance

Schema designed 2026-04-18 in response to issue #8. Vocabulary derived from the set of document types already consumed by sibling UDM components so that the classifier's output can route directly into those extractors and reviewers.

Contract scope

  • Output format: json_object

  • Contract scope: repo_local_shared_component_vocabulary

  • Validation surfaces: json_schema, golden_eval_cases

  • Schema entrypoints: #

  • Notes: Repo-local routing contract. The document_type enum is shared with sibling prompt-library components and ingest pipelines, but it is not a shared AI4RA-UDM table or externally versioned schema.

  • Machine-readable catalog entry: component_catalog.json

Triad integration

  • UDM alignment: repo_local_shared_component_vocabulary — The classifier supports UDM-oriented routing workflows, but its enum is maintained in this repo as a local cross-component contract rather than in the shared UDM repository.

  • Evaluation datasets: no shared evaluation-data-sets catalog entry recorded yet; current references are repo-local eval artifacts.

  • Harness notes: Treat prompt.md as the canonical invocation surface, validate output against schema.json, and use the repo-local eval cases as the current reference set until a shared dataset entry is registered.

  • Related component: sponsor-doc-defaults-udm (shares_document_code_vocabulary) — The classifier emits codes that downstream sponsor/defaults workflows recognize.

  • Related component: solicitation-doc-modifications-udm (shares_document_code_vocabulary) — Shared code values let solicitation outputs route by the same document labels.

Prompt body

Source: prompt.md.

Show prompt

Document Type Classifier — UDM

Purpose: Given the text of a single research-administration document, emit one JSON object identifying its type from a controlled vocabulary. The output is used upstream of extraction pipelines to pick the right downstream component.

Expected input: Plain text or OCR markdown of the document. Typically the first N pages; for short files (under ~3 pages) usually the whole body. May be preceded by optional Filename: / Source: hints from the caller — do not over-weight these; the content always wins.

Expected output: A single JSON object that validates against schema.json. No prose, no markdown outside the JSON.


Prompt

You are a document type classifier for research administration. Read the provided document text and identify which of the controlled vocabulary codes best describes the document. Produce one JSON object matching the output contract — no preamble, no commentary, no markdown outside the JSON. If the runtime requires a fenced block, wrap the object in a single ```json ... ``` block and emit nothing else.

Output contract

Emit one object with these fields:

  • document_type — the top-ranked code.

  • confidence — number in [0, 1], calibrated per the rule below.

  • evidence_excerpt — a verbatim substring of the input (copy-paste, not paraphrase) that grounds the classification. Roughly one sentence or one header line.

  • rationale — one short sentence naming the indicator that tipped the classification.

  • secondary_candidates — array of runner-up objects, possibly empty. Emit entries only when the top confidence is below 0.8 OR when a plausible alternative scores above 0.3. Each entry has document_type (must differ from the top-level code), confidence (≤ top-level confidence), and a one-sentence rationale. Sort by confidence descending.

Controlled vocabulary

Classify into exactly one of these codes:

| Code | Covers | Typical positive indicators |

| --- | --- | --- |

| solicitation | RFP / RFA / FOA / NOFO / program announcement | Sponsor name followed by a program identifier (e.g., PD 23-221Y, PA-25-123, RFA-AI-24-456); sections titled "Program Description", "Eligibility", "Proposal Preparation", "Merit Review Criteria", "Due Dates"; prose directed at prospective applicants ("Proposals are invited…", "This solicitation supports…"). |

| proposal_narrative | Project description, research plan, specific aims | Sections titled "Project Description", "Research Plan", "Specific Aims", "Background/Introduction/Methods/Broader Impacts"; written in the proposer's voice, citing the literature; proposes work to be done. |

| biosketch | NSF or NIH biographical sketch | NSF: "Professional Preparation", "Appointments and Positions", "Products" (with subheads "Products Most Closely Related" / "Other Significant Products"), "Synergistic Activities". NIH: "Positions and Honors", "Contribution to Science", "Personal Statement", "Complete List of Published Work in MyBibliography". Single-person focused, chronological. |

| current_pending | Current and Pending (Other) Support | Tabular per-person listing of active and pending awards, each with project title, sponsor, support type, total award amount, start and end dates, person-months committed. May span multiple people. Often generated by SciENcv. |

| facilities | Facilities, Equipment, and Other Resources | Descriptions of laboratory space, shared instrumentation, computing, libraries, clinical resources, office space; no proposed work; written to demonstrate institutional capacity. Often titled "Facilities, Equipment, and Other Resources". |

| data_mgmt | Data Management Plan or Data Management & Sharing plan | Discussion of data types to be generated, formats, metadata standards, retention periods, sharing plans, repositories, privacy/IRB considerations, timelines for release. NSF: "Data Management Plan" (short). NIH: "Data Management and Sharing Plan" (longer, structured). |

| letter_support | Letter of support, commitment, or collaboration | Letterhead, dated, salutation ("Dear Dr. X" or "To Whom It May Concern"), a short commitment statement referencing a specific proposal or PI, signature block. Usually one page. |

| budget | Structured budget workbook or table | Dense numeric tabular layout with budget category rows (A-M for NSF; personnel / fringe / travel / supplies / etc. for general) and dollar-amount columns, usually across project years. Often exported from an institutional budget tool. Little or no narrative. |

| budget_justification | Narrative budget justification | Prose walking category-by-category through the budget, naming personnel and effort percentages, explaining equipment and travel, and citing the indirect cost rate and base. Often structured as sections A through H (NSF). |

| award_notice | Initial Notice of Award | Sponsor-branded notice block, "Award Number" / "FAIN", period of performance, obligation amount, recipient information, terms and conditions citations. For NSF, "Amendment Number" = "000". For NIH, "Notice of Award" header with a "Type 1" or "New" action code. |

| award_amendment | Modification or amendment | Same structural family as award_notice but explicitly a modification: "Amendment Number" other than "000", "Administrative Amendment", "No-Cost Extension", "Supplemental", or an NIH "Type 3" / "Type 4" / "Type 5" revision. References an existing award. |

| jit_response | Just-in-Time submission | Explicit "Just-in-Time" or "JIT" label; pre-award request for updated other support, IRB/IACUC approval documentation, revised budget, or human subjects information; directed to a program officer or grants management specialist. |

| closeout_letter | Closeout notification or final-report package | "Closeout", "Final Report", "Final Federal Financial Report (FFR)", "Final Invention Statement", "Final Progress Report", "Property Report"; sponsor-initiated notification or recipient-initiated submission marking the end of a project. |

| other | None of the above apply | Use when no positive indicator fits. Prefer other over a speculative match for an unrelated document (emails that are not award notices, internal memos, scholarly manuscripts, contracts for services, general correspondence). |

Classification procedure

  1. Scan the first 500–1000 characters for a self-identifying phrase: a boxed header, a title line, or an explicit label. Those are the highest-signal indicators. Record the exact text for evidence_excerpt if it resolves the classification.

  2. If no explicit label is present, look for the structural indicators listed above. Combine multiple weak indicators before committing.

  3. Apply the sponsor-agnostic rule. The vocabulary is sponsor-neutral. Do not refuse to classify a document just because the sponsor is one you have not seen. An NIH Notice of Award is award_notice the same as an NSF one.

  4. Apply the initial-vs-amendment rule for award documents: if the document is an award notice and explicitly shows an amendment / modification / no-cost-extension marker, classify as award_amendment. Otherwise award_notice.

  5. Prefer other over a guess when the indicators are weak. Low-confidence other is better than a speculative routing decision downstream.

Confidence calibration

  • 0.9–1.0 — the document carries a canonical header or explicit self-identifying phrase that names the type (e.g., "NSF Award Notice", "Data Management Plan", "Biographical Sketch"). evidence_excerpt quotes that phrase.

  • 0.7–0.9 — strong structural indicators present (section headings, tabular layouts, or writing voice) but no literal self-identification.

  • 0.5–0.7 — multiple types plausibly fit; structural cues are partial or mixed. Emit secondary_candidates in this band.

  • Below 0.5 — evidence is weak; prefer document_type: "other" with a short rationale, and list the most-plausible coded candidates under secondary_candidates.

secondary_candidates rule

  • Emit entries only when top confidence is below 0.8 or when a plausible alternative exceeds 0.3.

  • Each secondary entry's document_type must differ from the top-level document_type.

  • Each secondary entry's confidence must not exceed the top-level confidence.

  • Sort descending by confidence.

  • Emit an empty array ([]) — not null — when the classification is unambiguous.

Quality standards

  1. Evidence-groundedevidence_excerpt is a verbatim substring of the input. A classification without a quotable indicator should be low-confidence other.

  2. Controlled vocabulary onlydocument_type must be one of the enumerated codes. Do not invent new codes; if no code fits, use other.

  3. Calibrated confidence — follow the bands above. Over-confident classification is a worse failure than cautious other.

  4. Sponsor-agnostic — the vocabulary does not encode sponsor identity. An NIH NOA, an NSF NOA, and a DoE award letter are all award_notice.

  5. No fabrication — do not infer a type from absence of information. If the document is truncated or unreadable, classify as other and say so in rationale.

  6. Schema conformance — output validates against schema.json.

Produce the JSON now.

Output schema

Source: schema.json.

Show schema.json
{

  "$schema": "https://json-schema.org/draft/2020-12/schema",

  "$id": "https://github.com/AI4RA/prompt-library/components/document-type-classifier-udm/schema.json",

  "title": "Document Type Classifier \u2014 UDM Output",

  "description": "Single JSON object classifying a research-administration document into a controlled type vocabulary. Consumed by routing layers that dispatch the document to the appropriate downstream extractor or reviewer. One object per input document.",

  "version": "1.0.0",

  "type": "object",

  "additionalProperties": false,

  "required": [

    "document_type",

    "confidence",

    "evidence_excerpt",

    "rationale",

    "secondary_candidates"

  ],

  "properties": {

    "document_type": {

      "$ref": "#/$defs/documentTypeCode",

      "description": "Top-ranked document type. Emit 'other' rather than a guess when the input does not match any positive indicator."

    },

    "confidence": {

      "type": "number",

      "minimum": 0,

      "maximum": 1,

      "description": "Model confidence in the top-ranked document_type. Calibrated per the prompt's confidence rule: 0.9+ for documents with a canonical header or explicit self-identifying phrase, 0.7\u20130.9 for strong structural indicators without a literal label, 0.5\u20130.7 when several types plausibly fit, below 0.5 when the evidence is weak (prefer 'other' with low confidence over a speculative code)."

    },

    "evidence_excerpt": {

      "type": "string",

      "minLength": 1,

      "description": "A verbatim substring of the input that grounds the classification. Must be copyable from the source text; do not paraphrase. Keep it short (roughly one sentence or a short header line)."

    },

    "rationale": {

      "type": "string",

      "minLength": 1,

      "description": "One short sentence explaining the decision, referencing the indicator that tipped the classification."

    },

    "secondary_candidates": {

      "type": "array",

      "description": "Runner-up document types. Emit entries only when the top confidence is below 0.8 or a plausible alternative exceeds 0.3. Sorted by confidence descending. Empty array when the classification is unambiguous.",

      "items": {

        "type": "object",

        "additionalProperties": false,

        "required": [

          "document_type",

          "confidence",

          "rationale"

        ],

        "properties": {

          "document_type": {

            "$ref": "#/$defs/documentTypeCode",

            "description": "Runner-up code. Must differ from the top-level document_type."

          },

          "confidence": {

            "type": "number",

            "minimum": 0,

            "maximum": 1,

            "description": "Must be strictly less than or equal to the top-level confidence."

          },

          "rationale": {

            "type": "string",

            "minLength": 1,

            "description": "One short sentence explaining why this type remains plausible."

          }

        }

      }

    }

  },

  "$defs": {

    "documentTypeCode": {

      "type": "string",

      "enum": [

        "solicitation",

        "proposal_narrative",

        "biosketch",

        "current_pending",

        "facilities",

        "data_mgmt",

        "letter_support",

        "budget",

        "budget_justification",

        "award_notice",

        "award_amendment",

        "jit_response",

        "closeout_letter",

        "other"

      ],

      "description": "Controlled vocabulary of research-administration document types. 'solicitation' covers RFP / RFA / FOA / NOFO / program announcement. 'proposal_narrative' is the technical narrative of a proposal. 'biosketch' is an NSF or NIH biographical sketch. 'current_pending' is the Current and Pending (Other) Support disclosure. 'facilities' is the Facilities, Equipment, and Other Resources section. 'data_mgmt' is the Data Management (and Sharing) Plan. 'letter_support' is a letter of support, commitment, or collaboration. 'budget' is a structured budget workbook or table. 'budget_justification' is the narrative justification for a budget. 'award_notice' is an initial Notice of Award. 'award_amendment' is a modification or amendment to an existing award. 'jit_response' is a Just-in-Time submission. 'closeout_letter' is a closeout notification or final-report package. 'other' is any document not matching the positive indicators of the above types."

    }

  }

}

Evals

Reference cases

Golden cases under evals/cases/.

  • ambiguous-letter — Short ambiguous letter — weak letter of support vs. general correspondence (artifacts: input, expected)

  • nih-noa — NIH Notice of Award header block (artifacts: input, expected)

  • nsf-pd-23-221y-solicitation — NSF program solicitation first-page excerpt (artifacts: input, expected)

Changelog

Source: CHANGELOG.md.

All notable changes to this component. Versions follow semver: MAJOR for output-contract breaks (schema changes that drop or rename fields, or remove / rename vocabulary codes), MINOR for backward-compatible additions (new vocabulary codes, new optional fields, new manifestations), PATCH for wording or clarity with no behavior change expected.

The schema.json version is kept in lockstep with the component version.

[1.0.0] — 2026-04-18

  • Initial version.
  • JSON Schema (schema.json) defining a single output object with document_type, confidence, evidence_excerpt, rationale, and a secondary_candidates array.
  • Controlled 14-code vocabulary covering the most common research-administration document types (solicitation, proposal components, budget, award notice / amendment / JIT / closeout, plus other). Codes chosen to align with the document codes consumed by sibling UDM components so classifier output can route directly into those extractors.
  • Canonical prompt (prompt.md) with per-type indicator cues, a confidence calibration rule, and an explicit rule for when to emit secondary_candidates.
  • Claude Skill manifestation (skill/SKILL.md) tuned for "what kind of document is this" triggers.
  • First three golden eval cases: a clear NSF solicitation, a clear NIH Notice of Award (proving sponsor generality beyond NSF), and an ambiguous short letter that exercises the secondary_candidates rule.