ProofOfThought¶

ProofOfThought provides LLM-guided translation of natural language questions into formal logic, which is then verified using the Z3 theorem prover.

Architecture¶

The system follows a multi-stage pipeline to transform questions into verifiable answers:

Question (NL)
    ↓
LLM Translation (few-shot prompting)
    ↓
Formal Program (SMT-LIB 2.0 or JSON DSL)
    ↓
Z3 Execution
    ↓
SAT/UNSAT → Boolean Answer

Components¶

The architecture consists of several key components that work together:

Z3ProgramGenerator (z3adapter.reasoning.program_generator) Provides the LLM interface for program generation. It extracts formal programs from markdown code blocks using regex and supports error feedback through multi-turn conversations.

Backend (z3adapter.backends.abstract) Defines an abstract interface with execute(program_path) → VerificationResult. Two concrete implementations are available:

SMT2Backend: Subprocess call to Z3 CLI. Parses stdout/stderr for sat/unsat via regex (?<!un)\bsat\b and \bunsat\b.
JSONBackend: Python API execution via Z3JSONInterpreter. Returns structured SAT/UNSAT counts.

Z3JSONInterpreter (z3adapter.interpreter) Implements a multi-stage pipeline for processing the JSON DSL:

SortManager: Performs topological sorting of type dependencies and creates Z3 sorts
ExpressionParser: Evaluates expressions using eval() with restricted globals for security
Verifier: Runs solver.check(condition) for each verification
Finally returns SAT/UNSAT counts

ProofOfThought (z3adapter.reasoning.proof_of_thought) Provides the high-level API with a retry loop (default max_attempts=3) and error feedback. Answer determination follows: SAT only → True, UNSAT only → False, both/neither → None.

Quick Start¶

from openai import OpenAI
from z3adapter.reasoning import ProofOfThought

client = OpenAI(api_key="...")
pot = ProofOfThought(llm_client=client, backend="smt2")
result = pot.query("Would Nancy Pelosi publicly denounce abortion?")
# result.answer: False (UNSAT)

Benchmark Results¶

ProofOfThought has been evaluated on multiple reasoning datasets using the following configuration:

Datasets: ProntoQA, FOLIO, ProofWriter, ConditionalQA, StrategyQA
Model: GPT-5 (Azure deployment)
Config: max_attempts=3, verify_timeout=10000ms

Backend	Avg Accuracy	Success Rate
SMT2	86.8%	99.4%
JSON	82.8%	92.8%

The SMT2 backend outperforms JSON on 4 out of 5 datasets. For detailed results, see Benchmarks.

Design Rationale¶

Several key design decisions shape the architecture:

Why use an external theorem prover? LLMs lack deductive closure, meaning they cannot guarantee sound logical inference. Z3 provides this soundness by formally verifying the logical reasoning.
Why offer two backends? The choice trades off portability (SMT-LIB is a widely-supported standard) against LLM generation reliability (structured JSON is easier for models to produce correctly).
Why use iterative refinement? Single-shot generation is often insufficient for complex reasoning. By incorporating error feedback, the system significantly improves its success rate.

Implementation Notes¶

Each backend has distinct implementation characteristics:

SMT2 Backend:

Runs Z3 as a subprocess with the -T:timeout flag
Parses output using regex patterns on stdout/stderr
Uses standard SMT-LIB 2.0 S-expressions

JSON Backend:

Leverages the Python Z3 API through the z3-solver package
Evaluates expressions using restricted eval() with ExpressionValidator
Supports built-in sorts: BoolSort, IntSort, RealSort
Supports custom sorts: DeclareSort, EnumSort, BitVecSort, ArraySort
Handles quantifiers: ForAll and Exists with proper variable binding

Security:

The JSON backend employs ExpressionValidator.safe_eval() with a whitelist of allowed Z3 operators, preventing arbitrary code execution.

For more details, see Backends and API Reference.