Digital transformation

Paperwork as a data problem: solutions for data-centric businesses

By Simone Timen

∙

Published on January 11, 2022

PDFs have been the transitional technology bridging the old (paper) with the new (digital), but their limitations have become clear as information sharing has moved online. While PDFs are here to stay, we can all benefit from reconceptualizing them as just another destination for data, rather than its origin.

Back to all articles

In 1993, the PDF was introduced. This was the first step in the digital evolution of paperwork, preparing people for a world where documents could be handled online. Much has happened since then. Smartphones became ubiquitous. Cloud storage became the new standard for storing and sharing data. And technology-first companies are dominating almost every industry.

While the PDF played a crucial role in modernizing paperwork in the 1990s, its limitations have become clear in a data-forward world. Today, the top business models that attract massive investments or promise to keep legacy companies competitive are data-driven. Without large, normalized data sets, ideas like AI and machine learning are rendered useless.

PDF data is unusable in its raw form

The “data” on a PDF isn’t really data in the modern sense at all: it’s unstructured and immobile. And attempting to transform that information into valuable datasets can be a Sisyphean task. This puts technology-first companies within industries that require PDFs in a tough spot. In order to make their data usable, they have to find ways to unlock it and structure it.

Most companies have tried to tackle this challenge in one of two ways. Some use human power to extract data, either with their own employees or by outsourcing to data entry service companies. But introducing a human element to data processing is error-prone and inefficient… not to mention costly.

Other companies have invested in software, such as OCR, that can scan PDF images and guess what information to extract. But because OCR software is template-based, it doesn’t work well with the unstructured data found on PDFs, often capturing incorrect information and requiring manual quality checks.

Regardless of the approach, neither has been able to solve the underlying problem: relying on PDFs as the origin of the information.

A data-first mindset needs to extend to paperwork

While the PDF remains an indispensable artifact in many industries, these companies need not remain limited by outdated technology. There are new ways of collecting, sharing, and managing information while preserving the PDF as a stamp of legitimacy. Whether you’re a growth-stage startup looking to disrupt or a legacy company looking to capitalize on your market share, you stand to benefit from reconceptualizing paperwork as “data-first”.

Data capture is not doomed to start with the PDF in its unstructured form. Instead, think about how you would want data to flow if there were no PDFs. It would start with structured data. Maybe you already have it in a CRM. Or maybe you need to ask for it via a webform with field validation. It probably wouldn't start with open-text fields on a PDF.

Anvil’s technology is built to empower any business using PDFs to build “data-first” processes. Anvil’s Workflows and APIs, for example, enable you to capture information with webforms and instantly share that data to PDFs, your CRM, or any other piece of your tech stack. Since structured data capture and PDF creation happen simultaneously, information is instantly shareable. You can maintain PDFs for compliance, but they no longer have to be the origin of your data; they’re just another destination.

The PDF has been the gold standard of data collection for decades, but the cultural shift towards an online, data-based world has exposed its shortcomings. Although PDFs are here to stay, we can all benefit from thinking about them as secondary repositories of information rather than sources of data.

If you have questions, we'd love to hear from you at hello@useanvil.com.

View allView all articles

Dive Deeper

Anvil vs. PandaDoc: API Flexibility and Workflow Automation Compared (2026)

Anvil fills existing PDFs with JSON in one API call. PandaDoc requires rebuilding every document in its template editor first. For developers automating backend document pipelines, Anvil scales by document volume; PandaDoc scales by seat count and serves a different workflow.

Subscribe to the Anvil blog

Get new stories like this delivered directly to your inbox.

Loading...

Paperwork as a data problem: solutions for data-centric businesses

PDF data is unusable in its raw form

A data-first mindset needs to extend to paperwork

Subscribe to the Anvil blog

Get a demo(from a real person)

Get a demo
(from a real person)