Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction

(Note from the Editor  PDF files are so ubiquitous in business and in academia that few people give any thought to the problems that arise in extracting text from a PDF to incorporate into a search index. Tim Allison is a consultant working at the NASA Jet Propulsion Laboratory and has been at the forefront of understanding the causes of the problems and finding solutions. As you will see this very detailed analysis contains a substantial number of images, and so I have taken the decision to publish the introduction in HTML but then provide a link to the full paper as (ironically!) a PDF file.

Tim began working in natural language processing in the early 2000s. Since the early 2010s, he has focused on content/metadata extraction (and evaluation), advanced search and relevance tuning. Tim is the founder of Rhapsode Consulting LLC, and he currently works as a data scientist at NASA’s Jet Propulsion Laboratory, California Institute of Technology. Tim is also a member of numerous Apache software projects, including Apache Tika, PDFBox, POI, OpenNLP and Lucene/Solr. He holds a Ph.D. in Classical Studies, and he started his career as a professor of Latin and ancient Greek.

Disclaimer

“The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. Copyright 2021 California Institute of Technology©. U.S. Government sponsorship acknowledged.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

The author would like to thank Peter Wyatt, Chief Technology Officer of the PDF Association, and other colleagues for their feedback on this article. All errors and omissions are the author’s.

The following represents the viewpoints of the author and does not represent the funding agencies or reviewers.”

Introduction

The Portable Document Format (PDF) is one of the most common document file formats used in industry, academia and government. PDF comprises a significant component of files on the internet (see “PDF’s Popularity Online”). For non-technical users, PDF files may seem straightforward and largely reliable. However, in practice, PDF files present a rich set of challenges for tools that extract text to enable search or other natural language processing tasks. The goal of this article is to offer a general overview of some of the challenges in extracting text from PDFs for technically-oriented people who may be new to PDF. Specifically, this paper is intended for those who process PDF “in the wild”, which is to say, developers or development teams which do not have control over the generation of the PDFs they are processing. For those who are able to influence how the PDFs they process are generated, we encourage focusing on the final section of this article.

The full text of the 16 page article can be downloaded from OverviewOfTextExtractionFromPDFs

 

Leave a comment

Your email address will not be published. Required fields are marked *