Making Tabular Data AI-Friendly

The problem hiding in every spreadsheet

Due diligence lives in tables. Questionnaire responses arrive as spreadsheets. Financial summaries land as HTML reports. Risk scorecards sit in neatly formatted grids. For a human reader, the meaning of any cell is obvious — you glance at the row header, check the column header, and you understand what “40” represents.

Large language models cannot glance. When a table is fed into an LLM — whether for summarisation, embedding, or retrieval-augmented generation — the model receives a flat sequence of tokens. The spatial relationships that make a table intelligible to a person are lost. The value “40” arrives stripped of the context that gives it meaning: that it is Revenue, for North America, in Q1.

This is not a minor inconvenience. In vendor assessment and compliance work, tables are the primary artefact. If your AI pipeline cannot parse them faithfully, it cannot answer questions about them, cannot compare them, and cannot embed them for retrieval. The table becomes a black hole in an otherwise functional system.

Why existing approaches fall short

The standard response is to serialise the table — convert it to CSV, or dump it as Markdown, and hope the model works it out. For simple, flat tables this can suffice. But real-world tables are rarely simple. They have hierarchical headers that span multiple rows or columns. They have merged cells. They have row groups where a single label governs several sub-rows.

Consider a financial summary where “Revenue” spans three rows — North America, Europe, and Asia Pacific — while “Q1” through “Q4” run across the top. A CSV export loses the spanning. Markdown loses it too. The model sees a grid of values with blank cells where the spanning header should repeat, and must infer the structure. Sometimes it guesses correctly. Often it does not.

Rule-based extraction fares little better. You can write custom parsers for specific table layouts, but each new format demands new rules. In a consultancy handling dozens of clients, each with their own reporting templates, this approach does not scale.

Untabulate: headers projected onto every cell

We built Untabulate to solve this directly. It is a small, open-source Python library — available on PyPI — that takes a table and produces, for every data cell, a complete semantic path: the chain of row and column headers that govern it.

Given a table like:

		Q1	Q2
Revenue		100	120
	North America	40	50
	Europe	60	70

Untabulate produces:

Revenue → Q1: 100
Revenue → Q2: 120
Revenue → North America → Q1: 40
Revenue → North America → Q2: 50
Revenue → Europe → Q1: 60
Revenue → Europe → Q2: 70

Each output string is self-contained. It carries its full meaning without reference to any surrounding structure. That is precisely what an embedding model needs: a chunk of text whose semantics are complete.

How it works

The core algorithm builds what we call a projection grid. For each cell marked as a header, the library records its position and its span — how many rows or columns it covers. When you query a data cell, the library walks leftward through the row headers and upward through the column headers, collecting every header whose span includes that cell. The result is an ordered path from the most general to the most specific header, followed by the cell’s value.

This handles hierarchical and merged headers naturally. A header with rowspan=3 applies to three rows of data. A header with colspan=4 applies to four columns. The algorithm is linear in the number of cells — roughly a million cells per second on ordinary hardware — so it scales comfortably to large workbooks.

The library accepts input from three sources: HTML tables (parsed via lxml), Excel spreadsheets (via openpyxl), or any custom data source that can supply cell positions, spans, and header flags. A command-line interface is included for quick inspection.

Why this matters for assessment platforms

At Fluvial, questionnaire responses and reference data frequently arrive as tables — Excel exports from clients, HTML reports from data feeds, structured grids from regulatory submissions. Our document automation system needs to ingest this material, associate it with the correct fields, and make it available for search, comparison, and generation.

Without semantic flattening, a table embedded in a document is opaque to the AI layer. With it, every cell becomes a searchable, embeddable, comparable statement of fact. “Revenue → North America → Q1: 40” can be matched against a question like “What was North American revenue in the first quarter?” in a way that a bare “40” never could.

This matters equally for any organisation building RAG pipelines over document collections that contain tables — which is to say, nearly every organisation doing knowledge work.

Designed for pipelines, not notebooks

Untabulate is deliberately minimal. It does one thing: flatten tables into semantic strings. It does not visualise, does not transform, does not model. It sits at the boundary between document ingestion and AI processing, turning structured layout into structured language.

from untabulate import untabulate_html

for item in untabulate_html(html, format="strings"):
    print(item)

Three output formats are available — strings for direct embedding, dictionaries for structured metadata, and tuples for lightweight path-value pairs — so it slots into whatever pipeline you are building.

Open source and available now

Untabulate is MIT-licensed and published on PyPI. Install it with:

pip install untabulate

Add [lxml] for HTML support, [openpyxl] for Excel, or both. The source is on GitHub, and contributions are welcome.

Tables are the most common structured data format in professional work, yet they have been largely invisible to the current generation of AI tooling. Untabulate is a small correction to that — a library that gives language models what they need to read a table the way a person does: header by header, cell by cell, with nothing left to guess.

Making Tabular Data AI-Friendly

The problem hiding in every spreadsheet

Why existing approaches fall short

Untabulate: headers projected onto every cell

How it works

Why this matters for assessment platforms

Designed for pipelines, not notebooks

Open source and available now

Related posts

When the Computer Stops Being the Bottleneck - AI, Customisation, and the New Division of Labour

How a Dogma About Language Stalled AI for Fifty Years

Making Tabular Data AI-Friendly