Backends¶

ProofOfThought supports two execution backends for Z3: the standard SMT-LIB 2.0 format and a custom JSON DSL.

SMT2Backend¶

The SMT2 backend leverages Z3's standard command-line interface.

Implementation: z3adapter/backends/smt2_backend.py

Execution¶

subprocess.run([z3_path, f"-T:{timeout_seconds}", program_path])

The execution process involves:

Running Z3 as a CLI subprocess with a timeout flag
Applying a hard timeout of timeout_seconds + 10 to prevent hanging
Capturing output from both stdout and stderr

Result Parsing¶

sat_pattern = r"(?<!un)\bsat\b"      # Negative lookbehind to exclude "unsat"
unsat_pattern = r"\bunsat\b"

The parser counts occurrences in Z3 output and applies the following answer logic:

sat_count > 0, unsat_count == 0 → True
unsat_count > 0, sat_count == 0 → False
Otherwise → None

Prompt Template¶

The prompt template guides LLM program generation.

Source: z3adapter/reasoning/smt2_prompt_template.py

The template provides instructions for generating SMT-LIB 2.0 programs with these key requirements: - All commands as S-expressions: (command args...) - Declare sorts before use - Single (check-sat) per program - Semantic: sat = constraint satisfiable, unsat = contradicts knowledge base

File Extension¶

.smt2

JSON Backend¶

The JSON backend uses Z3's Python API for direct programmatic access.

Implementation: z3adapter/backends/json_backend.py

Execution Pipeline¶

interpreter = Z3JSONInterpreter(program_path, verify_timeout, optimize_timeout)
interpreter.run()
sat_count, unsat_count = interpreter.get_verification_counts()

Z3JSONInterpreter Pipeline¶

The interpreter processes JSON programs through three main stages:

Step 1: SortManager (z3adapter/dsl/sorts.py)

First, the system topologically sorts type definitions to handle dependencies, then creates Z3 sorts:

Built-in: BoolSort(), IntSort(), RealSort() (pre-defined)
Custom: DeclareSort(name), EnumSort(name, values), BitVecSort(n), ArraySort(domain, range)

For example, an ArraySort creates dependencies:

{"name": "IntArray", "type": "ArraySort(IntSort, IntSort)"}

This requires IntSort to be defined first (fortunately, it's built-in) before creating IntArray.

Step 2: ExpressionParser (z3adapter/dsl/expressions.py)

Next, the parser evaluates logical expressions from strings using a restricted eval():

safe_globals = {**Z3_OPERATORS, **functions}
context = {**functions, **constants, **variables, **quantified_vars}
ExpressionValidator.safe_eval(expr_str, safe_globals, context)

Only whitelisted operators are permitted:

Z3_OPERATORS = {
    "And", "Or", "Not", "Implies", "If", "Distinct",
    "Sum", "Product", "ForAll", "Exists", "Function", "Array", "BitVecVal"
}

Step 3: Verifier (z3adapter/verification/verifier.py)

Finally, the verifier tests each verification condition:

result = solver.check(condition)  # Adds condition as hypothesis to KB
if result == sat:
    sat_count += 1
elif result == unsat:
    unsat_count += 1

Verification Semantics: When calling solver.check(φ), the system asks: "Is KB ∧ φ satisfiable?"

SAT: φ is consistent with the knowledge base (possible scenario)
UNSAT: φ contradicts the knowledge base (impossible scenario)

Prompt Template¶

The JSON prompt template is more comprehensive than its SMT2 counterpart.

Source: z3adapter/reasoning/prompt_template.py

This 546-line specification of the JSON DSL includes these key sections:

Sorts:

{"name": "Person", "type": "DeclareSort"}

Functions:

{"name": "supports", "domain": ["Person", "Issue"], "range": "BoolSort"}

Constants:

{"persons": {"sort": "Person", "members": ["nancy_pelosi"]}}

Variables: Free variables for quantifier binding:

{"name": "p", "sort": "Person"}

Knowledge Base:

["ForAll([p], Implies(is_democrat(p), supports_abortion(p)))"]

Verifications: The DSL supports three types of verifications:

Simple constraint:

{"name": "test", "constraint": "supports_abortion(nancy)"}

Existential:

{"name": "test", "exists": [{"name": "x", "sort": "Int"}], "constraint": "x > 0"}

Universal:

{"name": "test", "forall": [{"name": "x", "sort": "Int"}],
 "implies": {"antecedent": "x > 0", "consequent": "x >= 1"}}

Critical constraint: The prompt enforces a single verification per question to avoid ambiguous results from testing both φ and ¬φ.

File Extension¶

.json

Benchmark Performance¶

Performance comparison across datasets reveals notable differences between the backends.

Results from experiments_pipeline.py (100 samples per dataset, GPT-5, max_attempts=3):

Dataset	SMT2 Accuracy	JSON Accuracy	SMT2 Success	JSON Success
ProntoQA	100%	99%	100%	100%
FOLIO	69%	76%	99%	94%
ProofWriter	99%	96%	99%	96%
ConditionalQA	83%	76%	100%	89%
StrategyQA	84%	68%	100%	86%

Success Rate represents the percentage of queries that complete without error (including both generation and execution).

Overall, SMT2 achieves higher accuracy on 4 out of 5 datasets, while JSON shows greater success rate variance (86-100% compared to SMT2's 99-100%).

Implementation Differences¶

The backends differ in several implementation details.

Program Generation¶

SMT2: Extracts programs from markdown via:

pattern = r"```smt2\s*([\s\S]*?)\s*```"

JSON: Extracts and parses via:

pattern = r"```json\s*(\{[\s\S]*?\})\s*```"
json.loads(match.group(1))

Error Handling¶

Error handling varies significantly between backends.

SMT2: - Subprocess timeout → TimeoutExpired - Parse errors → regex mismatch → answer=None - Z3 errors in stderr → still parsed

JSON: - JSON parse error → extraction failure - Z3 Python API exception → caught in try/except - Invalid sort reference → ValueError during SortManager - Expression eval error → ValueError during ExpressionParser

Timeout Configuration¶

Timeout handling differs between the two backends.

SMT2: - Uses a single timeout parameter: verify_timeout (ms) - Converts to seconds for Z3 CLI: verify_timeout // 1000 - Applies a hard subprocess timeout: timeout_seconds + 10

JSON: - Uses two separate timeouts: verify_timeout (ms) and optimize_timeout (ms) - Sets timeout via solver.set("timeout", verify_timeout) in Verifier - Timeout applies per individual solver.check() call

Backend Selection Code¶

The system selects backends at runtime based on configuration:

if backend == "json":
    from z3adapter.backends.json_backend import JSONBackend
    backend_instance = JSONBackend(verify_timeout, optimize_timeout)
else:  # smt2
    from z3adapter.backends.smt2_backend import SMT2Backend
    backend_instance = SMT2Backend(verify_timeout, z3_path)

File: z3adapter/reasoning/proof_of_thought.py:78-90

Prompt Selection¶

The appropriate prompt template is chosen based on the selected backend:

if self.backend == "json":
    prompt = build_prompt(question)
else:  # smt2
    prompt = build_smt2_prompt(question)

File: z3adapter/reasoning/program_generator.py:78-81

Both prompts include few-shot examples and format specifications. The SMT2 prompt emphasizes S-expression syntax, while the JSON prompt provides detailed guidance on variable scoping and quantifier semantics.