Created on 2011-01-16
Last modified on 2011-09-14

PDF check utility

This tool allows you to check PDF properties/contents in a batch mode. Originally I developed this as a tool for a library to do a check on huge collection of PDF files from an output of OCR software or outsourced company.

This tool is able to check PDF properties/contents in a batch mode. It currently checks the following properties of PDF files:

The tool currently checks the following properties of PDF files:

News

2010-09-14
Version 20110914 released.
New TSV format introduced for filter-utility friendly.
Add a check for protection permissions.
Add a support for some TIFF formats (JBIG2DECODE filter).
Add supports for some more errors.
Add a support for encrypted PDF files.
2010-01-16
Released the first version (20110116).

Download

pdf-checker is a free software licensed under GNU AGPL version 3.

Source code is available at http://github.com/masao/pdf-checker/.

How to use

This tool is written in Java. You can use the tool in various environments such as Windows/Mac OS/Unix.

At first, download the binary package and unpack it. And then run the jar file with specifying the targeted PDF files on command line:

 % unzip pdf-checker-YYYYMMDD.zip
 % java -jar PdfChecker.jar pdf/2010J00*.pdf
 pdf/2010J0001.pdf	version	3
 pdf/2010J0001.pdf	encryption	false
 pdf/2010J0001.pdf	creationdate	D:20060627211618
 pdf/2010J0001.pdf	producer	PDFlib 4.0.3 + PDI (SunOS 5.8)
 pdf/2010J0001.pdf	pages	8
 pdf/2010J0001.pdf	page1	pagesize	Rectangle: 595.0x842.0 (rot: 0 degrees)
 pdf/2010J0001.pdf	page1	imagetype	png
 pdf/2010J0001.pdf	page1	dpi-x	346.8101
 pdf/2010J0001.pdf	page1	dpi-y	346.06174
 pdf/2010J0001.pdf	page1	text length	83
 pdf/2010J0001.pdf	page2	pagesize	Rectangle: 595.0x842.0 (rot: 0 degrees)
 pdf/2010J0001.pdf	page2	imagetype	png
 pdf/2010J0001.pdf	page2	dpi-x	346.8101
 pdf/2010J0001.pdf	page2	dpi-y	346.06174
 pdf/2010J0001.pdf	page2	text length	87
 pdf/2010J0001.pdf	page3	pagesize	Rectangle: 595.0x842.0 (rot: 0 degrees)
 pdf/2010J0001.pdf	page3	imagetype	png
 pdf/2010J0001.pdf	page3	dpi-x	346.8101
 pdf/2010J0001.pdf	page3	dpi-y	346.06174
 pdf/2010J0001.pdf	page3	text length	0
 pdf/2010J0001.pdf	page4	pagesize	Rectangle: 595.0x842.0 (rot: 0 degrees)
 pdf/2010J0001.pdf	page4	imagetype	png
 pdf/2010J0001.pdf	page4	dpi-x	346.8101
 pdf/2010J0001.pdf	page4	dpi-y	346.06174
 pdf/2010J0001.pdf	page4	text length	1794
 .....

An example above means that the file parsed is PDF version 3, is not encrypted, is created at June 27th 2006, is produced by a tool called PDFLib, and has 8 pages. Each page of this file has a size of "595x842" without rotation, an embedded (roughly) 300 DPI resolution image with PNG-style compression, and has corresponding textual contents of 80-1800 characters.

The tool can check multiple files by specifying them as arguments. When specifying multiple files, the first column shows each filename.

As output format is a simple text file with tab-separated, you can check it further via other applications like Excel.

TODO

There are some known limitations/bugs in this tool as of September 2011.

Error handling
Reading from an invalid PDF file will causes an exception and hung on it.
Image handling assumes a scanned document and it cause weired results on normal document created by document tools such as Word/LaTeX etc.

Acknowledgment

This tool uses and bundles iText PDF Library and Legion of the Bouncy Castle Java cryptography APIs. Source codes and detailed information is available under:

Development of this tool is partially based on work in DRF Technical Workshop at Karuizawa (Sep. 2010), sponsored by CSI 2010 program of NII.


高久雅生 (Masao Takaku)
https://masao.jpn.org/, tmasao@acm.org