I recently found myself on a project involving PDF file organization. I’ve always known there to be countless open source PDF manipulation tools, but I’ve never really used many myself, and especially not via a Linux shell.
Specifically what I needed to do is:
- split multiple pages into each individual page
- create a thumbnail image to preview each page
- extract all readable text from each page for searching
Splitting PDF Pages
For this job, I decided to use PDFTK (PDF Tool Kit). The syntax is a little muddy, but easy enough to figure out. For example, given input file blah.pdf:
pdftk A=blah.pdf cat A1 output blah-pg1.pdf
This snippet assigns handle “A” to the input file, then issues the “cat” command for page 1 of file A (i.e., “A1″), then instructs to output that to a new file, blah-pg1.pdf.
This is fine for a single page, but if you want to split every page out of the input file, you will have to execute that command repeatedly, once for each page. There are various ways to determine the number of pages automatically, and one way is to use another PDFTK command:
pdftk blah.pdf dump_data
This will dump various info about the specified file, including the number of pages, so this can be captured into a string and parsed.
PDF Page to Thumbnail
Now that each PDF page has been created, this is one of the more obvious, straightforward jobs. With ImageMagick installed, you can issue a command like the following to create a 200px tall JPEG for a single PDF page:
convert -resize x200 blah-pg1.pdf blah-pg1.jpg
Extracting Text – Text-Based
If the PDF file was created with digital text (e.g., printed from a text editing application), then the full source text can be extracted easily. One of the packages you will find very useful is Poppler-Utils, which among a few other utilities includes an app called pdftotext.
pdftotext blah-pg1.pdf blah-pg1.txt
This dumps any available text from the PDF file into a text file. If there is no text available, the text file will be almost empty. In this case you’d need to…
Extracting Text – Image-Based / Scanned OCR
If the text content is unavailable, it’s usually because the PDF was created by a scanner or other image generating source. In this case, OCR (optical character recognition) will need to be used to visually scan the image and attempt to recognize each character. There are a few options available, but most of them do a pretty terrible job at recognizing a lot of content. The best option which is under heavy development, in part by Google, is Tesseract OCR.
Tesseract does a really great job at recognizing most all characters in an image. The only trick is converting the source file into a format accepted by Tesseract because it is very specific about the file format it will accept. (It only likes TIFF files with file extension .tif and maximum bpp of 8). For that conversion, we go back to ImageMagick.
convert -background white -flatten +matte -colorspace Gray -depth 8 -density 600x600 -resample 300x300 blah-pg1.pdf blah-pg1.tif
This command will convert the source PDF page into a grayscale TIFF image with 8 bpp and 300 dpi. Now that we have that, we can invoke Tesseract as follows:
tesseract blah-pg1.tif blah-pg1
This will create file blah-pg1.txt containing all the characters it recognized (alphanumeric and otherwise). Thus far, I haven’t been able to figure out Tesseract’s configuration options, but there are things I’d like to modify, such as only accepting alphanumeric and punctuation characters.
All the applications I’ve discussed above are available across most all Linux distributions, including as packages in Ubuntu (installable via apt-get). There is a lot more that can be done with these tools (such as extraction of embedded images from PDFs and other things), but given just the uses I described above, you should be able to create very powerful software to upload, manage, and index PDF files.