automationbookkeepingAIdata

The Foundation Nobody Wants to Build

April 15, 20267 min read

Every AI demo looks impressive until you ask where the data came from.

The demo shows an LLM summarizing invoices, flagging anomalies, drafting reports, and answering questions about the business in plain English. What the demo doesn't show is the two months of cleanup work that happened before any of that was possible — the folder structure that had to be standardized, the chart of accounts that had to be reorganized, the transaction categories that had to be made consistent before a model could make sense of them.

That work isn't sexy. It also isn't optional.

What an LLM actually needs

A language model is, at its core, a pattern-recognition engine. Feed it consistent, well-structured data and it can find patterns you'd never see manually. Feed it inconsistent, poorly labeled, half-organized data and it will find patterns in the noise — which is worse than finding nothing, because it looks right.

The quality ceiling for any AI-assisted workflow is set by the data underneath it. You can't automate your way past a disorganized file system. You can't analyze trends in a general ledger where the same expense hits three different accounts depending on who entered it. You can't build reliable automations on top of subledger data that doesn't reconcile to the GL.

The model isn't the bottleneck. The data almost always is.

The hierarchy matters

Data organization in a business has a natural hierarchy, and it matters to work through it in order.

File systems come first. Not because they're the most important, but because they're the most foundational. A consistent folder structure — for clients, for vendors, for projects, for periods — is the scaffolding that everything else attaches to. When documents live in predictable places with predictable names, every downstream process gets easier. When they don't, every downstream process involves someone spending five minutes finding the right version of the right file before they can do the actual work.

The general ledger comes next. The GL is the single source of truth for what happened financially in a business. Every transaction, properly categorized, flows through it. A clean GL means your P&L reflects reality. It means your cash flow report is usable. It means when you pull a number to make a decision, you can trust it.

Detail modules feed the GL. Accounts receivable, accounts payable, payroll, inventory — these are where transactions originate. Each one has its own data structure, its own reconciliation requirements, its own points of failure. When the detail data is clean and the posting rules are consistent, the GL stays clean automatically. When the detail is a mess, you spend the end of every month trying to figure out why things don't balance.

Years ago when I worked in Accounts Payable, the GL Accountant said to put anything you want me to adjust each month into Accounts Payable. So you're debiting and crediting Accounts Payable instead of, say, putting a debit into Prepaid Insurance in order to journal the expense portion into Insurance Expense each month. Needless to say, she spent days reconciling Accounts Payable every month. She was a skilled accountant. The problem wasn't her — it was that the structure of the data made her job harder than it needed to be.

Structure first. Automate second.

This is the sequence people get backwards most often.

The appeal of automation is that it removes manual work. So the instinct is to automate as quickly as possible — before you've established that the underlying process is worth automating, before you've confirmed the data is structured consistently enough to automate reliably.

Automating a broken process doesn't fix the process. It runs the broken process faster and at higher volume, which usually makes the downstream cleanup worse, not better.

The right sequence is:

  1. Define the structure. What does a clean version of this data look like? What are the categories, the naming conventions, the reconciliation checkpoints? Document it, even briefly.

  2. Run it manually until it's stable. A process that can't be run consistently by a human can't be automated reliably by a machine. Manual operation surfaces the edge cases, the exceptions, the things that don't fit the structure you thought you had.

  3. Automate the stable parts. Once the process is consistent and the data is clean, automation is straightforward — you're just removing the human from a loop that already works.

  4. Use the data to analyze and improve. Now the interesting work starts. Clean, consistent, automated data is what lets you ask real questions about the business: where are margins improving, where are costs creeping, what's driving the variance between this quarter and last.

Years ago on my first day at a company which was working towards their ISO 9001 Certification (You have procedures that say what you do and you do them consistently), someone mentioned that "if we update SOP-PR-101" (Standard Operating Procedure, Production Department, Document Number 101), "that would also affect WIN-PR-920." Everyone in the room stroked their imaginary hairless cat and said "Yes! that would affect WIN-PR-920!". I thought these people were crazy and rigid but it turned out to be the most flexible company I ever worked for. If ever you made a case for a good reason to change a procedure they would do it, document it, and everyone would implement the change at the same time. By the way, WIN-PR-920 stands for Work Instruction Document, Production Department, Document Number 920.

A hairless cat, stroked thoughtfully

Why accounting is the right place to start

Accounting data is not glamorous. Nobody puts "cleaned up the chart of accounts" in their marketing materials. The ROI on a well-organized GL is real but invisible — you see it in decisions that get made faster, in problems that get caught earlier, in the hours that don't get spent on reconciliation at month-end.

Compare that to the ROI on a new CRM or a marketing automation platform, which is highly visible and immediately attributable. It's not surprising that accounting infrastructure gets deprioritized.

But accounting data has a property that most other business data doesn't: it's comprehensive. Every dollar that moves through the business touches the GL. Every client, every vendor, every expense, every piece of revenue. When that data is organized and trustworthy, you have a complete picture of the business. When it's not, every analysis has an asterisk.

Accounting data is also inherently accurate. Why? Because an accounting transaction must be measurable. You do not add to the value of a business in it's GL every month based on goodwill and your feelings about the business, but when it is sold at a premium, that is a transaction that is measurable and it is recorded. In this way, accounting data, when analyzed by an LLM, is more likely to produce accurate analysis rather than hallucination.

Marketing and sales data tells you what's coming in. Accounting data tells you what's left after everything goes out. Both matter, but one of them is more honest.

The compounding effect — again

I wrote in an earlier post about how clean, connected systems compound over time. The same logic applies here, but the starting point matters even more.

A business that builds on clean accounting data — organized files, reconciled GL, consistent subledger processes — gets better data for every automation it builds afterward. Every analysis it runs is grounded in something real. Every decision it makes is based on numbers it can trust.

A business that skips that foundation and goes straight to the interesting parts — the AI tools, the dashboards, the automations — eventually hits a wall where the system is only as reliable as the data feeding it. And the data is a mess.

Building the foundation isn't the exciting part. It's the part that makes everything else work.

← Older
The Hidden Cost of Keeping Your Books and Systems Separate
← All posts