Skip to content
ignitai Get the app
← Back to blog · · 9 min read

Batch convert PDFs to CSV on Mac (2026 workflow)

Turn a folder of PDFs into one consolidated CSV on Mac — one prompt, one pass, source-file provenance baked in, on-device on macOS 14.4+.

guides batch-convert mac bookkeeping

You have a folder of PDFs. Twelve monthly statements, forty vendor invoices, a quarter of expense receipts — or all three, piled up because the one-at-a-time workflow never survives contact with a real month-end.

Converting one PDF to CSV is a solved problem. Doing it across a folder, in a way that leaves you with one spreadsheet rather than forty individual CSVs you then have to merge, is where most tools fall apart. The free web converters hit a rate limit at file three. The Python scripts you wrote last year break on the one invoice that used a different template. The AppleScript from 2015 predates every OCR engine worth using.

This guide walks through the workflow that actually works in 2026: batch convert PDFs to CSV on Mac, on-device on macOS 14.4+, with source-file provenance on every row and a single consolidated output.

Why batch is harder than “just loop over files”

The naïve plan is: take your one-file solution and loop. This fails for three reasons that show up within the first real batch:

  1. Heterogeneous templates. The twelve monthly statements might be from two banks that changed their PDF format mid-year. The forty invoices are from forty vendors. A prompt tuned to one layout produces junk on another. You need an extraction approach that generalizes, not one that memorizes coordinates.
  2. No provenance. If you loop and concatenate, the output is one giant CSV where every row looks the same and you have no way to trace a suspicious number back to its source PDF. Three months later, when a bookkeeper asks “where did $3,412.50 come from?”, you re-process everything. Every output row needs a source_file — ideally a source_page too — and the loop has to inject it automatically.
  3. Partial failure. One PDF in forty is corrupted, password-protected, or a 200MB scan that blows the model’s context window. A dumb loop either dies on that file or silently skips it. You need a batch that reports which files failed, why, and lets you re-run just the failures without touching the 39 that worked.

A real batch workflow solves all three. The rest of this post is the Mac-native version.

Method 1: ignitai on Mac (the on-device way)

ignitai is designed for this. The whole pitch is that extraction is a language task — you describe what you want, the model finds it across every file in the batch, and the output is one spreadsheet with provenance baked in. The full flow, end to end:

  1. Drag the folder into ignitai. Or drag a selection of files. Or a mix of folders. ignitai flattens them into a single batch queue — up to 500 PDFs in one pass on an M-series Mac. Mixed scan / text-PDF / image inputs are fine; the app routes each file through the right pipeline automatically.
  2. Describe what to extract, once. Plain English, for the whole batch. Examples that work well:
    • “For each transaction, return date, description, amount (negative for debits), and running balance. Skip account summaries and marketing pages.”
    • “For each line item, return description, quantity, unit price, and line total. In a separate sheet, return invoice number, issue date, due date, vendor name, and grand total.”
    • “For each receipt, return date, merchant, category, amount, and tax. If the category isn’t printed, infer from the merchant name.”
  3. Pick CSV. Or XLSX if you want multiple sheets (line items + header metadata) in one file. CSV is the right choice if you’re piping into QuickBooks, Xero, or a custom ledger.
  4. Hit Extract. ignitai runs each PDF through the on-device model (macOS 14.4+ / Apple Silicon), streams results into a consolidated output, and shows a live progress view with per-file status. A 40-file invoice batch typically takes 2–4 minutes.
  5. Review the consolidated output. Every row includes a source_file column with the original filename, and (for multi-page documents) a source_page column. You can filter by source_file to sanity-check any one vendor’s rows without reopening the PDF.
  6. Export and re-run failures. If any files failed (corrupted, empty, OCR’d to nonsense), they’re listed separately. You can fix them — rotate a scan, unlock a password — and re-run just those, then append to the same spreadsheet.

The whole batch lives on your Mac. The PDFs never leave the device on macOS 14.4+. For a bookkeeper or finance operator who runs this weekly, the time savings are the kind that change what’s possible — a weekend of statement reconciliation becomes a coffee.

Method 2: Automator + Folder Actions (the macOS-native DIY)

If you want to build it yourself, macOS gives you most of the pieces:

  1. Use Automator to create a Folder Action: watch an /inbox/ folder, and when a PDF lands, run a Shell Script.
  2. The shell script calls pdftotext -layout "$1" - (requires brew install poppler), pipes the output to a Python or Node script you wrote, and appends the parsed rows to a master CSV in an /out/ folder.
  3. Source-file provenance is your responsibility — prepend the filename to each row before writing.
  4. Failures: you’ll want to || echo "$1" >> failures.log in the shell script so you can re-process later.

This is a valid path if you have hundreds of identically-structured PDFs — say, monthly statements from one bank, where the template is stable — and you want zero ongoing cost. It’s the wrong path if:

  • Your PDFs vary in structure (multiple vendors, multiple formats).
  • Any of them are scans. pdftotext can’t OCR; you’d need to swap in Tesseract and the accuracy drops sharply on real-world invoice fonts.
  • You don’t enjoy maintaining parsing scripts when a vendor changes their billing software.

For most solo operators, the time to build and maintain this pipeline exceeds the cost of an app that just does it.

Method 3: Python / CLI loop (for developers with uniform inputs)

If you’re comfortable in Python and your batch is genuinely uniform:

brew install poppler
pip install pdfplumber pandas

Then a script that opens each PDF with pdfplumber, extracts tables with page.extract_tables(), concatenates into a pandas DataFrame with a source_file column, and writes one CSV:

import pdfplumber, pandas as pd
from pathlib import Path

rows = []
for pdf_path in Path("./inbox").glob("*.pdf"):
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                for row in table:
                    rows.append([*row, pdf_path.name, i + 1])

df = pd.DataFrame(rows, columns=[..., "source_file", "source_page"])
df.to_csv("out.csv", index=False)

This is fine for clean text-based PDFs with identical table structures. It breaks on:

  • Scanspdfplumber sees no text. You’d layer in pytesseract and your accuracy drops.
  • Variable layoutsextract_tables() infers grids from lines. A different vendor with a different border style returns a different shape.
  • Multi-line cellspdfplumber splits them, and you need custom logic to re-join.

For a stable monthly-statement feed, it’s a one-time cost. For a real-world bookkeeping batch, you’ll spend more time fixing edge cases than processing invoices.

Method 4: Web batch endpoints (and the tradeoffs)

Paid tiers of Smallpdf, iLovePDF, and similar offer batch endpoints. The flow is: upload ZIP, wait, download ZIP. They work, with the same caveats the one-file versions have, amplified:

  • You’re uploading the whole folder. For bank statements, vendor invoices, or anything with financial detail, this is a much bigger surface than uploading one file.
  • Per-file cost scales linearly. Most paid tiers cap at 500–1000 files/month; past that, enterprise tiers kick in.
  • Invoice-aware extraction is rare. The generic “convert PDF to Excel” endpoint isn’t parsing your invoice as an invoice; it’s looking for table grids. Expect the same header-metadata loss as in the one-file version.

For public document batches — research corpora, open-data dumps — they’re fine. For your operational data, on-device is the better trade.

Prompt design for heterogeneous batches

The single highest-leverage step in a real batch is writing a prompt that generalizes. Principles that work:

  1. Describe the fields, not the layout. “Return date, description, amount” generalizes. “Return the value in the third column” doesn’t.
  2. Be explicit about sign conventions. “Debits are negative, credits are positive” removes an entire class of silent errors that will show up as a reconciliation mismatch later.
  3. Give the model permission to infer. “If the category isn’t printed, infer it from the merchant name” produces a usable category column when some receipts have one and some don’t. Without it, half your rows have blanks.
  4. Specify a separate sheet for header metadata. A single flat CSV that mixes invoice_number and line_item_description in the same row structure is miserable to pivot on. Tell the extractor to produce two sheets.
  5. State what to skip. “Skip cover pages, terms-and-conditions pages, and marketing inserts” saves you from hundreds of junk rows.

A prompt that nails these five across a heterogeneous 40-file batch is worth more than a faster model. Save it as a preset in ignitai; reuse it next month.

Provenance: the column that saves you

Whatever batch approach you use, every output row must include a source_file column. Ideally a source_page too.

Three reasons this is non-negotiable:

  • Audit. Three months later, a number looks wrong. With source_file, you open one PDF. Without it, you open forty.
  • Partial re-run. One vendor changes their template and corrupts a hundred rows in your next batch. With provenance, you filter those out, re-process that vendor, and append. Without it, you re-process everything.
  • Review speed. When your accountant asks “what’s this $3,412 entry?”, the answer should take 20 seconds, not 20 minutes of PDF-digging.

ignitai adds these columns automatically. If you roll your own, don’t skip the step.

When batch breaks

Honest edge cases for any batch pipeline:

  • Password-protected PDFs. Strip the password first (Preview → Export → uncheck “Encrypt”). Batching over encrypted files fails silently or loudly depending on the tool; either way, not what you want.
  • Massive scans. A 200-page 400-DPI scan of an old archive can blow past even a large model’s context. Split into chapters with Preview’s page-extract tool before batching.
  • Mixed languages in one batch. Either run two batches with language-specific prompts, or add “preserve original-language descriptions” explicitly. Inconsistent handling across a batch is worse than either pure choice.
  • One file that’s actually not a PDF. A .heic that got renamed .pdf, a .docx that’s in the folder by mistake. ignitai surfaces these as explicit failures; a shell loop dies. Worth checking the folder before hitting Extract.

Bottom line

For a folder of PDFs that needs to end up as one CSV on Mac: install ignitai, drag the folder in, write the prompt once, pick CSV, hit Extract. For stable monthly batches over hundreds of identically-formatted PDFs, a pdftotext-plus-Python pipeline is a valid alternative if you want the full DIY path and don’t mind maintaining it. For anything with scans, mixed vendors, or documents you’d rather not upload, the native on-device path is the shortest distance from folder to spreadsheet.

The single-file Mac workflow is covered in the Mac PDF-to-CSV guide; the iPad-native equivalent for invoices is the iPad invoice walkthrough; the iPhone version for bank statements is here. The app is the same across all three; presets and output formats sync via iCloud so the Mac is the batch engine and the mobile devices are the capture surfaces.

Get ignitai on the App Store — free download, $19.99/mo unlocks unlimited batch extractions after the 3-day trial.