Blog

The PDF Tools Everyone Should Know

3 min read

Most “best PDF tool” lists rank products. That is the wrong axis. The PDF operations you actually need - compress, merge, split, convert, OCR, sign - each have a few competing implementations, and the right choice depends on constraints you can name: file confidentiality, fidelity tolerance, batch size, and whether you control the environment. Below I weigh the trade-offs the way I would in a code review.

Compression: lossless rewrites vs lossy resampling

Compression is two unrelated problems wearing the same button. The first is lossless object-stream rewriting: deduplicate repeated resources, drop unused objects, recompress streams with flate, linearize. Tools like qpdf and Ghostscript do this and your text stays pixel-identical. Expect 5 to 30 percent off a bloated export, zero quality cost.

The second is lossy image resampling: downsample embedded scans from 600 to 150 DPI and re-encode as JPEG. That is where the dramatic 10x reductions come from, and where you destroy detail you may need later. Ghostscript’s -dPDFSETTINGS=/ebook versus /screen is exactly this knob.

The trade-off rule I use: run the lossless pass on everything by default, and only reach for resampling when a human has confirmed the output target (email attachment, web preview) tolerates it. Never let an automated pipeline silently resample archival documents.

Client-side vs server-side: who sees the bytes

This is the decision evaluators underweight. Browser-based tools using pdf-lib or pdf.js keep the file in the tab; nothing leaves the machine. For merge, split, rotate, and page reordering, that is strictly better - those operations are pure structure edits and need no heavy native dependency.

Server-side processing wins when the work demands binaries you cannot ship to a browser: Ghostscript for real compression, an OCR engine, font subsetting at scale. The cost is custody. The moment a document crosses the network, you inherit a retention, logging, and deletion problem. For anything regulated or confidential, prefer client-side, and if you must use a server, confirm it deletes inputs and does not log them.

Merge, split, and convert: the boring operations that corrupt silently

Merging looks trivial until two source files declare conflicting fonts or form fields with identical names. A naive concatenation can produce a file that opens fine in one viewer and breaks in another. pdf-lib and pdftk both handle this, but test the output in more than one reader before trusting a batch job.

Conversion is where fidelity quietly dies. PDF-to-Word is reconstruction, not extraction - the tool guesses at paragraphs and tables from absolute glyph positions. Treat the result as a draft, always. Office-to-PDF via LibreOffice headless is far more reliable because it renders from a real layout engine rather than reverse-engineering one.

OCR and signing: correctness over convenience

For OCR, the open default is Tesseract, ideally producing a “sandwich” PDF: the original scan image on top, an invisible searchable text layer underneath. That preserves the visual and gives you selectable text. Accuracy depends heavily on input DPI - feed it 300 DPI, deskew first, and results improve more than swapping engines would.

Signing splits into two meanings people conflate. A drawn-image signature is decoration; it proves nothing. A cryptographic digital signature binds a certificate to a byte range and detects tampering. If a workflow needs legal weight, only the second one counts, and you want a tool that reports signature validity, not just renders the glyph.

How I would choose

Default to lossless and client-side. Escalate to lossy or server-side only when a named constraint forces it, and write that reason down. The best PDF tool is the one whose trade-offs you chose on purpose rather than inherited by accident.