Examples¶
This page demonstrates common usage patterns through example scripts.
All examples are located in the examples/
directory and should be run from the project root:
python examples/{script}.py
Basic Query¶
The simplest way to use ProofOfThought is through a single query.
File: examples/simple_usage.py
from openai import OpenAI
from z3adapter.reasoning import ProofOfThought
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pot = ProofOfThought(llm_client=client, model="gpt-4o")
result = pot.query("Would Nancy Pelosi publicly denounce abortion?")
print(result.answer) # False
Azure OpenAI¶
For Azure OpenAI deployments, use the provided configuration utility.
File: examples/azure_simple_example.py
from utils.azure_config import get_client_config
from z3adapter.reasoning import ProofOfThought
config = get_client_config()
pot = ProofOfThought(llm_client=config["llm_client"], model=config["model"])
result = pot.query("Can fish breathe underwater?")
print(result.answer) # True
Backend Comparison¶
You can compare how the two backends perform on the same question.
File: examples/backend_comparison.py
config = get_client_config()
question = "Can fish breathe underwater?"
pot_json = ProofOfThought(llm_client=config["llm_client"], backend="json")
pot_smt2 = ProofOfThought(llm_client=config["llm_client"], backend="smt2")
result_json = pot_json.query(question)
result_smt2 = pot_smt2.query(question)
print(f"JSON: {result_json.answer}")
print(f"SMT2: {result_smt2.answer}")
Batch Evaluation¶
For evaluating multiple questions from a dataset, use the evaluation pipeline.
File: examples/batch_evaluation.py
from z3adapter.reasoning import EvaluationPipeline, ProofOfThought
pot = ProofOfThought(llm_client=client)
evaluator = EvaluationPipeline(proof_of_thought=pot, output_dir="results/")
result = evaluator.evaluate(
dataset="data/strategyQA_train.json",
question_field="question",
answer_field="answer",
max_samples=100
)
print(f"Accuracy: {result.metrics.accuracy:.2%}")
print(f"F1 Score: {result.metrics.f1_score:.4f}")
Azure + SMT2 Evaluation¶
This example combines Azure OpenAI with the SMT2 backend for batch evaluation.
File: examples/batch_evaluation_smt2_azure.py
config = get_client_config()
pot = ProofOfThought(
llm_client=config["llm_client"],
model=config["model"],
backend="smt2"
)
evaluator = EvaluationPipeline(proof_of_thought=pot)
result = evaluator.evaluate("data/strategyQA_train.json", max_samples=50)
Full Benchmark Suite¶
For comprehensive benchmarking, the experiments pipeline runs all datasets with both backends.
File: experiments_pipeline.py
This script runs all 5 benchmarks (ProntoQA, FOLIO, ProofWriter, ConditionalQA, StrategyQA) with both backends:
python experiments_pipeline.py
Implementation details:
- Modifies
benchmark/bench_*.py
files to set the backend via regex substitution - Runs each benchmark script as a subprocess with a 1-hour timeout
- Collects metrics from
output/{backend}_evaluation_{benchmark}/
directories - Generates a markdown table and updates README.md with results
Configuration (experiments_pipeline.py:29-41
):
BENCHMARKS = {
"prontoqa": "benchmark/bench_prontoqa.py",
"folio": "benchmark/bench_folio.py",
"proofwriter": "benchmark/bench_proofwriter.py",
"conditionalqa": "benchmark/bench_conditionalqa.py",
"strategyqa": "benchmark/bench_strategyqa.py",
}
BACKENDS = ["smt2", "json"]
Benchmark Script Structure¶
Individual benchmark scripts follow a common pattern, illustrated here with StrategyQA.
File: benchmark/bench_strategyqa.py
config = get_client_config()
pot = ProofOfThought(
llm_client=config["llm_client"],
model=config["model"],
backend=BACKEND, # Modified by experiments_pipeline.py
max_attempts=3,
cache_dir=f"output/{BACKEND}_programs_strategyqa",
)
evaluator = EvaluationPipeline(
proof_of_thought=pot,
output_dir=f"output/{BACKEND}_evaluation_strategyqa",
num_workers=10, # ThreadPoolExecutor for parallel processing
)
result = evaluator.evaluate(
dataset="data/strategyQA_train.json",
id_field="qid",
max_samples=100,
skip_existing=True, # Resume interrupted runs
)
Dataset Format¶
Datasets should be formatted as JSON arrays of objects:
[
{
"question": "Can fish breathe underwater?",
"answer": true
},
{
"question": "Do humans have wings?",
"answer": false
}
]
You can optionally include an ID field:
{"qid": "sample_123", "question": "...", "answer": true}
Use the question_field
, answer_field
, and id_field
parameters to specify custom field names.
Saving Programs¶
To save generated programs to disk for inspection:
result = pot.query(
"Can fish breathe underwater?",
save_program=True,
program_path="output/my_program.smt2"
)
If you don't specify a path, the default is: {cache_dir}/{auto_generated}{ext}
Advanced Configuration¶
For more control over the reasoning process, you can customize various parameters:
pot = ProofOfThought(
llm_client=client,
model="gpt-5",
backend="smt2",
max_attempts=5, # More retries
verify_timeout=20000, # 20s timeout
z3_path="/custom/z3" # Custom Z3 binary
)