visitor@hershkshetry:~$
/ projects / specparse

SpecParse

active

Structurally indexed REPL sessions for agent-driven document analysis. Docling converts PDFs into provenance-rich blocks with section hierarchy, table structure, and element IDs. The calling agent navigates the indexed artifacts, dispatches sub-LLM calls for parallel extraction, and assembles structured output — no rigid schemas, no chunking, no embedding retrieval. The agent decides what to look at.

Claude Codex MCP Docling Modal Python

Why Not Direct PDF or RAG

Engineering specifications are table-heavy, cross-referenced, and multi-document. The three common approaches each fail in a specific way:

DIRECT PDF PARSING

No persistent index. A 919-page document requires ~46 sequential reads at 20 pages each, with no way to skip to section 9 without scanning. Cross-referencing a design value on page 5 against a contractual limit on page 600 means re-reading both. No structured table extraction — complex engineering tables with merged cells, units in headers, and footnotes are interpreted visually, not structurally.

RAG (CHUNK + EMBED + RETRIEVE)

Chunking destroys table structure — a 20-row design parameter table becomes 20 disconnected fragments that lose their column headers and units. Retrieval ranks by embedding similarity, not document structure: when you need the complete “Membrane Bioreactor” section as a coherent unit, you get the top-5 most similar 500-token fragments scattered across 3 documents. No native cross-document comparison for the same parameter.

STRUCTURAL INDEXING (THIS APPROACH)

One-time Docling ingestion converts PDFs into blocks with page, section path, element ID, and table headers preserved. The agent queries by section (search_sections("MBR")), by page, or by element — instant, precise, no approximation. Sub-LLM calls process sections in parallel. The engineer decides what matters, not an embedding model.

Architecture

An MCP server exposing 5 tools. The calling agent controls the analysis policy — SpecParse provides indexed access to documents and sub-LLM routing, not a fixed extraction pipeline.

INGESTION

Docling with TableFormerMode.ACCURATE — structural table extraction with column headers and units. Local for small docs, Modal (64 CPU / 256 GB) for documents over 400 pages. Cached extractions load instantly on re-open.

REPL SESSION

Persistent Python environment with indexed artifacts, sub-LLM access (llm_query_batched for concurrent calls), and full doc reader toolkit. 1-hour TTL, up to 5 concurrent sessions.

# Open session — Docling ingests all PDFs, builds artifact index
session_open(folder_path="/docs", modal_extract_url="https://...")

# Navigate — section search, not embedding retrieval
search_sections("membrane bioreactor")   # instant hits across all docs
get_section("9.6.2 Sewage Quality")      # full table with headers + units

# Extract — agent-controlled parallel sub-LLM calls
results = llm_query_batched([
    prompt_stp_hydraulic,     # 5K chars, focused on STP flows
    prompt_cetp_quality,      # 7K chars, focused on CETP influent
    prompt_mbr_process,       # 10K chars, MBR sizing parameters
])                            # concurrent — 8 calls in ~160s

In Practice

First production use: extracting a process engineering Basis of Design from a 958-page industrial wastewater tender across 3 documents.

958
pages ingested
666
items extracted
<45m
end-to-end

Structural ingestion in <30 minutes (Modal), analysis in <15 additional minutes. 472 process parameters, 194 contractor risk items, 16 cross-document findings (conflicts, gaps, supersessions). At ~250 words/page, a senior engineer reviewing 240K words at 30 pages/hour would need ~32 hours. Machine-readable JSON output for downstream agent consumption, plus a PDF report for human review.

  • EPC tender review — process Basis of Design extraction from technical specifications and design basis reports
  • Cross-document reconciliation — detect conflicts, gaps, and supersessions between design basis and contractual specifications
  • Contractor risk analysis — performance guarantees, liability, warranty, testing obligations, scope boundaries