The AI4RA Evaluation Ecosystem¶

The AI4RA evaluation ecosystem is a coordinated triad plus a shared schema foundation. This page maps the roles so prompt-library work, dataset work, and harness work do not drift into undocumented assumptions.

At a glance¶

flowchart LR
    PL["AI4RA/prompt-library<br/><i>prompts · skills · agents · schemas · workflows</i>"]
    DS["AI4RA/evaluation-data-sets<br/><i>datasets · artifacts · scoring refs</i>"]
    HARNESS["AI4RA/evaluation-harness<br/><i>discovery · execution · scoring</i>"]
    UDM["ui-insight/AI4RA-UDM<br/><i>shared data-model foundation</i>"]
    REPORTS["evaluation reports<br/><i>published evidence</i>"]

    PL --> HARNESS
    DS --> HARNESS
    UDM -. semantic alignment .-> PL
    UDM -. semantic alignment .-> DS
    HARNESS --> REPORTS
    REPORTS --> PL

Solid arrows are concrete data flows. Dotted arrows show semantic alignment to the shared UDM foundation rather than ownership of the same checked-in schema files.

The roles¶

`AI4RA/prompt-library`¶

What it is. The versioned catalog of prompts, skills, agents, schemas, component contracts, and Vandalizer workflows. Each component carries its own manifestations, changelog, and eval artifacts; workflows layer on top as authored manifest.yaml sources that generate uploadable .vandalizer.json exports. The repo-level component_catalog.json is the harness-facing discovery surface and carries both components and workflows.

What it is not. The dataset store. The scoring corpus. The canonical shared UDM repository.

`AI4RA/evaluation-data-sets`¶

What it is. The dataset leg of the triad: synthetic and real corpora, rendered artifacts, and scoring references. Its dataset_catalog.json is the harness-facing discovery surface for datasets and validation policy.

What it is not. The prompt catalog or the runtime that executes components.

`AI4RA/evaluation-harness`¶

What it is. The runner layer of the triad. It discovers components from component_catalog.json, discovers datasets from dataset_catalog.json, executes evaluation campaigns, validates outputs against the declared contract surfaces, scores results, and publishes run artifacts back into the prompt library. Its repo-level harness_catalog.json is the machine-readable discovery surface for current harness capabilities and integration scope.

What it is not. The source-of-truth location for component contracts or dataset provenance. Those stay in their respective repos.

`ui-insight/AI4RA-UDM`¶

What it is. The shared UDM foundation where a cross-repo data-model contract truly belongs.

What it is not. A synonym for every -udm component in prompt-library. Many prompt-library schemas are repo-local contracts that align to shared UDM semantics without being copies of the shared UDM repo.

How a campaign should work¶

The harness pins a prompt-library component by commit and component version.
The harness pins a dataset by commit and dataset ID.
The harness validates outputs against the component's declared contract surface.
The harness honors each dataset's validation policy when turning outputs into scores.
The harness publishes run artifacts and summaries back into components/<slug>/evals/reports/<run-id>/ when the prompt-library repo is the evidence home.

Cross-repo contract pinning¶

This repo records observed upstream refs in component_catalog.json so cross-repo links are not silently interpreted as “whatever is on main today.” When a change depends on AI4RA/evaluation-harness, AI4RA/evaluation-data-sets, or ui-insight/AI4RA-UDM, update the observed ref and the human documentation in the same change.

Typical change flows¶

New task family — component lands in prompt-library; matching dataset lands in evaluation-data-sets when shared evaluation inputs are needed; harness wiring is updated to consume both catalogs.
Shared UDM implication — proposal/discussion happens in ui-insight/AI4RA-UDM; after agreement, prompt-library and dataset contracts update with pinned observed refs.
Evaluation hardening — new golden cases or shared datasets land first, then the harness uses them to revalidate a newer component version and publish reports.