Evals¶

Evals let you run a team of agents against a set of predefined tasks and automatically score the results. Each eval case specifies what the agent should say (or not say), how many turns it may take, and whether the session must succeed — giving you a repeatable regression suite for your agent configs.

Quick start¶

Scaffold a suite, then run it:

fuseraft eval init                  # interactive wizard → .fuseraft/evals/suite.yaml
fuseraft eval run                   # runs suite.yaml against the default team config

Commands¶

`fuseraft eval init [output]`¶

Scaffolds a new eval suite YAML with annotated example cases.

Flag	Description
`[output]`	Path to write the suite (default: `.fuseraft/evals/suite.yaml`)
`-n, --name <name>`	Suite name embedded in the file
`-c, --config <path>`	Default team config path to embed
`--no-interactive`	Skip prompts and use supplied options and defaults

fuseraft eval init my-evals/suite.yaml --name "Smoke Tests" --config .fuseraft/config/orchestration.yaml

`fuseraft eval run [suite]`¶

Runs every case in a suite and prints pass/fail per case, then a summary.

Flag	Description
`[suite]`	Path to the suite file (default: `.fuseraft/evals/suite.yaml`)
`-c, --config <path>`	Override the suite-level team config
`-o, --output <path>`	Write per-case results as JSONL to this file
`--filter <value>`	Run only cases whose `id` or `tag` contains this substring (case-insensitive)
`--timeout <seconds>`	Per-case timeout; `0` = no timeout (default)
`--no-banner`	Skip the suite header line
`--ci`	Exit with code `1` if any case fails (for CI pipelines)

fuseraft eval run                              # run all cases
fuseraft eval run --filter smoke               # run only cases tagged "smoke"
fuseraft eval run --ci --output results.jsonl  # CI mode with JSONL output

Suite file format¶

Suites are YAML (or JSON) files with a top-level name, a default config path, and a list of cases.

name: My Eval Suite
config: .fuseraft/config/orchestration.yaml   # suite-level default; overridable per case

cases:
  - id: smoke-basic
    task: "Say hello and confirm you are ready."
    must_succeed: true
    expect_keywords:
      - hello
    max_turns: 3
    tags:
      - smoke

Top-level fields¶

Field	Type	Description
`name`	string	Human-readable suite name shown in the banner
`config`	string	Default team config path used by all cases unless overridden
`cases`	list	Ordered list of eval cases

Case fields¶

Field	Type	Default	Description
`id`	string	—	Unique identifier used in reports and `--filter`
`task`	string	—	Inline task prompt sent to the orchestrator
`task_file`	string	—	Path to a file whose contents become the task (mutually exclusive with `task`)
`config`	string	suite default	Per-case team config override
`must_succeed`	bool	`true`	Fail the case when the session does not complete successfully
`expect_keywords`	list\<string>	`[]`	All strings must appear (case-insensitive) in the final assistant message
`expect_regex`	list\<string>	`[]`	All patterns must match (case-insensitive) against the final assistant message
`forbidden_keywords`	list\<string>	`[]`	None of these strings may appear (case-insensitive) in the final assistant message
`max_turns`	int	`0`	Fail if the session exceeds this many agent turns; `0` = unlimited
`tags`	list\<string>	`[]`	Labels used with `--filter`

Scoring¶

Each case is scored after the session finishes. A case passes only when all of the following hold:

If must_succeed: true, the session completed without error.
Every string in expect_keywords appears in the final assistant message.
Every pattern in expect_regex matches the final assistant message.
No string in forbidden_keywords appears in the final assistant message.
If max_turns > 0, the session did not exceed that many turns.

Failure reasons are printed per-case and included in the JSONL output.

Config resolution¶

The team config used for a case is resolved in this order:

config field on the case
-c/--config CLI flag
config field at the suite level
.fuseraft/config/orchestration.yaml (hardcoded fallback)

JSONL output¶

When --output <path> is given, one JSON object per case is written to that file after the suite completes:

{"case_id":"smoke-basic","session_id":"a1b2c3d4","passed":true,"failure_reasons":[],"total_turns":2,"duration_ms":3120,"total_input_tokens":841,"total_output_tokens":53,"error_message":null}
{"case_id":"code-generation","session_id":"e5f6a7b8","passed":false,"failure_reasons":["expected keyword not found: \"def reverse_string\""],"total_turns":5,"duration_ms":9870,"total_input_tokens":2103,"total_output_tokens":198,"error_message":null}

Field	Description
`case_id`	The `id` from the suite
`session_id`	Short random ID for this run
`passed`	`true` if all scoring criteria passed
`failure_reasons`	List of human-readable failure descriptions
`total_turns`	Number of agent turns used
`duration_ms`	Wall-clock time for this case
`total_input_tokens`	Sum of input tokens across all turns
`total_output_tokens`	Sum of output tokens across all turns
`error_message`	Exception message if the orchestrator threw, otherwise `null`

CI integration¶

Pass --ci to make fuseraft eval run exit with code 1 if any case fails. Combined with --output, this gives you a full audit trail:

# .github/workflows/eval.yml
- name: Run evals
  run: fuseraft eval run .fuseraft/evals/suite.yaml --ci --output eval-results.jsonl

- name: Upload results
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: eval-results
    path: eval-results.jsonl

Example suite¶

The file below is the annotated example generated by fuseraft eval init. It covers the four main case patterns:

name: Example Eval Suite
config: .fuseraft/config/orchestration.yaml

cases:
  # Smoke test — quick sanity check that the team responds at all.
  - id: smoke-basic
    task: "Say hello and confirm you are ready."
    must_succeed: true
    expect_keywords:
      - hello
    max_turns: 3
    tags:
      - smoke

  # Keyword + regex check — verify code generation output.
  - id: code-generation
    task: "Write a Python function named reverse_string that returns the reverse of its input."
    must_succeed: true
    expect_keywords:
      - def reverse_string
      - return
    expect_regex:
      - "def reverse_string\\("
    max_turns: 5
    tags:
      - coding

  # Forbidden-keyword guard — catch undesirable response patterns.
  - id: no-refusal
    task: "List three benefits of automated testing."
    must_succeed: true
    forbidden_keywords:
      - "I cannot"
      - "I'm unable"
      - "I am unable"
    tags:
      - quality

  # Task from file — useful for long or multi-line prompts.
  - id: file-task
    task_file: .fuseraft/evals/tasks/my-task.txt
    must_succeed: true
    max_turns: 10
    tags:
      - file-task

  # Per-case config override — run against a different team.
  - id: specialist-check
    config: .fuseraft/config/specialist.yaml
    task: "Explain the role of a load balancer in two sentences."
    must_succeed: true
    expect_keywords:
      - load balancer
    tags:
      - routing