Benchmarks

This page presents evaluation results on 5 logical reasoning datasets using Azure GPT-5.

Methodology

The evaluation follows a consistent methodology across all datasets.

Model: Azure GPT-5 deployment

Configuration: - max_attempts=3 (retry with error feedback) - verify_timeout=10000ms - optimize_timeout=100000ms (JSON backend only) - num_workers=10 (ThreadPoolExecutor for parallel processing)

Metrics (computed via sklearn.metrics):

  • Accuracy: accuracy_score(y_true, y_pred)
  • Precision: precision_score(y_true, y_pred, zero_division=0)
  • Recall: recall_score(y_true, y_pred, zero_division=0)
  • F1: 2 * (precision * recall) / (precision + recall)
  • Success Rate: (total - failed) / total

Execution: The experiments_pipeline.py script runs all benchmarks sequentially, modifying the BACKEND variable in each benchmark/bench_*.py script via regex substitution.

Results

Results from the most recent benchmark run.

Last Updated: 2025-10-16 18:14:07

Benchmark Backend Samples Accuracy Precision Recall F1 Score Success Rate
ProntoQA SMT2 100 100.00% 1.0000 1.0000 1.0000 100.00%
FOLIO SMT2 100 69.00% 0.6949 0.7736 0.7321 99.00%
ProofWriter SMT2 96 98.96% 1.0000 1.0000 1.0000 98.96%
ConditionalQA SMT2 100 83.00% 0.9375 0.8219 0.8759 100.00%
StrategyQA SMT2 100 84.00% 0.8205 0.7805 0.8000 100.00%
ProntoQA JSON 100 99.00% 1.0000 0.9815 0.9907 100.00%
FOLIO JSON 100 76.00% 0.7619 0.9412 0.8421 94.00%
ProofWriter JSON 96 95.83% 1.0000 1.0000 1.0000 95.83%
ConditionalQA JSON 100 76.00% 0.9180 0.8750 0.8960 89.00%
StrategyQA JSON 100 68.00% 0.7500 0.7895 0.7692 86.00%

Dataset Characteristics

Each dataset tests different aspects of logical reasoning.

ProntoQA

ProntoQA features synthetic first-order logic problems with deterministic inference.

Example:

Facts: "Stella is a lion. All lions are brown."
Question: "Is Stella brown?"
Answer: True

Performance:

  • SMT2: 100% (100/100)
  • JSON: 99% (99/100)

Both backends achieve near-perfect results, making this the simplest dataset in the benchmark suite.

FOLIO

FOLIO presents first-order logic problems derived from Wikipedia articles.

Characteristics: Features complex nested quantifiers and longer inference chains.

Performance:

  • SMT2: 69% (69/100)
  • JSON: 76% (76/100)

JSON outperforms SMT2 by 7% on this dataset, which is the most challenging in the suite. However, JSON's lower success rate (94% vs 99%) indicates greater difficulty in program generation.

ProofWriter

ProofWriter tests deductive reasoning over explicit facts and rules.

Example:

Facts: "The bear is red. If something is red, then it is kind."
Question: "Is the bear kind?"
Answer: True

Performance:

  • SMT2: 98.96% (95/96)
  • JSON: 95.83% (92/96)

Both backends achieve high accuracy on this dataset, with SMT2 holding a slight 3% edge.

ConditionalQA

ConditionalQA focuses on conditional reasoning with if-then statements.

Performance:

  • SMT2: 83% (83/100)
  • JSON: 76% (76/100)

SMT2 demonstrates better accuracy (+7%) and also achieves a higher success rate (100% vs 89%).

StrategyQA

StrategyQA tests multi-hop reasoning that requires implicit world knowledge.

Example:

Question: "Would a vegetarian eat a burger made of plants?"
Answer: True (requires knowing: vegetarians avoid meat, plant burgers have no meat)

Performance:

  • SMT2: 84% (84/100)
  • JSON: 68% (68/100)

This dataset shows the largest performance gap, with SMT2 leading by 16%. Both backends achieve good success rates of 100% and 86% respectively.

Analysis

Accuracy Summary

Aggregating results across all datasets:

  • SMT2: 86.8% average accuracy
  • JSON: 82.8% average accuracy

SMT2 proves superior on 4 out of 5 datasets, with FOLIO being the exception where JSON leads by 7%.

Success Rate Summary

The success rate measures program generation and execution reliability:

  • SMT2: 99.4% average (range: 98.96-100%)
  • JSON: 92.8% average (range: 86-100%)

SMT2 demonstrates more reliable program generation and execution overall. JSON's higher success rate variance indicates LLM generation challenges on certain datasets.

Failure Modes

Understanding failure modes helps identify areas for improvement.

SMT2 failures:

  • Program extraction from markdown: regex mismatch
  • Z3 subprocess timeout (rare with 10s limit)
  • Invalid SMT-LIB syntax (caught by Z3 parser)

JSON failures:

  • JSON parsing errors after extraction
  • Invalid sort references (e.g., undefined Person sort)
  • Expression evaluation errors in ExpressionParser.parse_expression()
  • Z3 Python API exceptions

Reproducing Results

Full benchmark suite

To run the complete benchmark suite:

python experiments_pipeline.py

This generates:

  • results/benchmark_results.json - Raw metrics data
  • results/benchmark_results.md - Formatted markdown table
  • Updates README.md between <!-- BENCHMARK_RESULTS_START/END --> markers

Single benchmark

To run just one benchmark:

python benchmark/bench_strategyqa.py

You'll need to modify the BACKEND variable in the script to either smt2 or json.

Custom evaluation

For custom evaluation on your own dataset:

from utils.azure_config import get_client_config
from z3adapter.reasoning import ProofOfThought, EvaluationPipeline

config = get_client_config()
pot = ProofOfThought(llm_client=config["llm_client"], backend="smt2")
evaluator = EvaluationPipeline(proof_of_thought=pot)

result = evaluator.evaluate(
    dataset="data/strategyQA_train.json",
    max_samples=100
)

print(f"Accuracy: {result.metrics.accuracy:.2%}")
print(f"Precision: {result.metrics.precision:.4f}")
print(f"Recall: {result.metrics.recall:.4f}")
print(f"F1: {result.metrics.f1_score:.4f}")

Dataset Sources

The benchmark datasets are located in the data/ directory:

  • ProntoQA: data/prontoqa_test.json
  • FOLIO: data/folio_test.json
  • ProofWriter: data/proof_writer_test.json
  • ConditionalQA: data/conditionalQA_test.json
  • StrategyQA: data/strategyQA_train.json

All datasets follow the same format: JSON arrays with question and answer fields (boolean values).

Implementation Notes

Parallel Processing

Benchmark scripts use num_workers=10 with ThreadPoolExecutor for parallel processing. Note that ProcessPoolExecutor cannot be used due to ProofOfThought being unpicklable.

Caching

Setting skip_existing=True enables resumption of interrupted runs. Results are cached as:

  • output/{backend}_evaluation_{dataset}/{sample_id}_result.json
  • output/{backend}_programs_{dataset}/{sample_id}_program.{ext}

Timeout Handling

The experiments_pipeline.py script sets a 1-hour subprocess timeout for each benchmark. Individual Z3 verification calls timeout at 10 seconds, while optimization calls timeout at 100 seconds.