#StackBounty: #python #bash #pdf #mojibake How to identify likely broken pdf pages before extracting its text?

Bounty: 50

TL;DR

My workflow:

  1. Download PDF
  2. Split it into pages using pdftk
  3. Extract text of each page using pdftotext
  4. Classify text and add metadata
  5. Send it to client in a structured format

I need to extract consistent text to jump from 3 to 4. If text is garbled, I have to OCR its page. But, OCR all pages is out of question. How to identify beforehand which pages should be OCRed? I’ve tried to run pdffonts and pdftohtml on each page. Isn’t it expensive to run subprocess.run twice a page?

What do I mean by broken page?

A PDF page that is not possible to extract text from its source, maybe due to to_unicode conversion.

Description

I’m building an application that relies on the extraction of text from a thousand PDF files every day. The layout of text in each PDF is somewhat structured, therefore calling pdftotext from python works well in most cases. But, some PDF files from one or two resources bring pages with problematic fonts, which results in garbled text. I think that using OCR only on problematic pages would be ok to overcome such an issue. So, my problem is how to identify, before extracting text, which pages are likely to result in gibberish.

First, I tried to identify garbled text, after extracting it, using regex (p{Cc} or unlikely chars outside Latin alphabet), but it did not work because I found corrupted text with valid chars and numbers, i.e AAAAABS12 54c] $( JJJJ Pk , as well.

Second, I tried to identify garbled text calling pdffonts – to identify name, encoding, embeddedness and existence of to_unicode map – on each page and parsing its output. In my tests, it kinda works well. But I found also necessary to count how many chars used likely problematic fonts, pdftohtml – Display each text block in p tag along with its font name – saved the day here. @LMC helped me to figure out how to do it, take a look at the answer. The bad part is I ended up calling subprocess.run two times for each pdf page, what is super expensive. It would be cheaper if I could just bind those tools.

I’d like to know if it’s possible and feasible to look at PDF source and validate some CMAP (uni yes and not custom font), if present, or maybe other heuristics to find problematic fonts before extracting text or OCR it.

Example of garbled text in one of my PDF files:

0n1n2n3n4n2n0n3n0n5 6n6nÿn89 ÿn4nx0en3nÿnx0fx10nx11nx12nÿn5nÿn6n6nx13nx11nx11nx146n2n2nx15nx11nx16nx12nx15nx10nx11nx0enx11nx17nx12nx18nx0enx17nx19x0enx1anx16n2 x11nx10nx1bx12nx1cnx10nx10nx15nx1d29 2nx18nx10nx16n89 x0enx14nx13nx14nx1enx14nx1fn5 x11x1fnx15nx10n! x1cn89 x1fn5n3n4n"n1n1n5 x1cn89n#x15nx1dx1fn5n5n1n3n5n$n5n1 5n2n5n%8&&#'#(8&)n*+n'#&*,nÿn(*ÿn-n./0)n1n*n*//#//8&)n*ÿn#/2#%)n*,nÿn(*/ÿn/#&3#40)n*/ÿn#50&*-n.()n%)n*)n/ÿn+nÿn*#/#n&x19nx12nÿnx1cÿn,x1dnx12nx1bx10nx15nx116nÿnx15n7nÿn8n9n4n6nÿn%x10nx15nx11nx166nÿn:x12x10;n2n*,n%#26nÿn<n$n3n0n3n+n3n8n3nÿn+nÿn=x15nx10n6nÿn>n9n0n?nÿn4n3n3n1n+n8n9n3n<n@AnBnCnDnEÿnGHnInÿnJnJnKnLnJnMnJnNnOnPnOnQnIn#x1bÿn0n1nÿnx1cnx10nÿn*x1anx16nx18nÿnx1cnx10nÿn0n3n0n5nx0en/x10nx15nx13x16nx12nÿn/x10nx16nx1dx1cx16nx12n6nÿn* x19nx15nx116nÿnx12nx19nx11nx19nx12nx16nÿnx15ÿn/*-nx0enÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿn(x10nÿx16nx1cnx10nx1bÿnx1cnx12nÿn%x13nx10n9nx10nÿnx1cnx10nÿn'x12nx1ax15nx10nx11nx10nÿnx1cnx12nÿn%x16nx16nx10nRnx10nx1cx16nx12nÿn'x10nx16nx12nx18nÿnx1cnx12nÿn-nx19x11n1nx12nÿnx1cÿn#x11nx12nx1cÿnx1cnx10nÿn*x18nx12nRx126nÿn/x16nx12nx0en& x10nx12nx15nx12nÿn%x10nx18x11nx16nx10nÿn:x12x13nx12nx1cx0enÿn*x19nx11nx19nx10n+x10nÿnx10nÿn&x10nRx11nx16nx10n+x10nÿnx15ÿn/*-n2n2'<nÿn+nÿn#Snx11nx16nx12nx17nx19nx1c x12nx18nÿn*x1cnx1bx15x11nx16nx12nx11nx1dx0enÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿnÿn*x11nx10nx15 x12nx1bx10nx15nx11nx10n6nTUnVnWUnXÿnYXÿnTUnVnWnXnXYZUn[UnT\]X\UnWnXnVDn^n_n`nÿnabnÿnXGbncnE^ndnOnPnOnQnPnenOnfnPnfnJnfnPnengnGbnh_nEGIniaAnYjTknXlm@ YjTknXlmX] ]jTk@[Yj] UnZk]UnZUn] X]noUnWnX] W@Vn\nX]nÿn89nÿn89np ÿnqn(x10x14nx12x13n8rnIOVx11x03x14n(VWHx03GRFXPHQWRx03px03FySLDx03GRx03RULJLQDOx03DVVLQDGRx03GLJLWDOPHQWHx03SRUx03(00$18(/$x030$5,$x03&$/$'2x03'(x03)$5,$6x036,/9$x11x033DUDx03FRQIHULUx03Rx03RULJLQDOx0fx03DFHVVHx03Rx03VLWHx03x0fx03LQIRUPHx03Rx03SURFHVVRx03x13x13x13x13x16x17x18x10x1ax18x11x15x13x15x14x11x1bx11x13x15x11x13x13x1ax16x03Hx03Rx03nFyGLJRx03x17(x14x14x16x14x13x11x03

The text above was extracted from page 25 of this document using pdftotext.

For that page, pdffonts outputs:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      13  0
DIIDPF+ArialMT                       CID TrueType      Identity-H       yes yes yes    131  0
DIIEDH+Arial                         CID TrueType      Identity-H       yes yes no     137  0
DIIEBG+TimesNewRomanPSMT             CID TrueType      Identity-H       yes yes yes    142  0
DIIEDG+Arial                         CID TrueType      Identity-H       yes yes no     148  0
Arial                                TrueType          WinAnsi          yes no  no     159  0

It’s easy to identify that [none] named font as problematic. My take so far, given the data I’ve analysed, is to mark fonts with custom or identity-h encoding, no to_unicode map or none named as likely problematic. But, as I said, I also found cases with ToUnicode table and not Custom encoding fonts, problematic as well. As far as I know, it’s also possible to find, for example, a single char that is defined for a broken font, but does not affect the overall readability of the page, so maybe it would be not necessary to OCR that page. In other words, if a font, in a given page, does not have ToUnicode convertion, it does not mean that the text of the page is totally affected.

I’m looking for a solution that is better than regex garbled text.

Examples of PDF pages that I had to OCR

All pages bellow contains text in portuguese, but if you try to copy the text and paste somewhere you will see universal gibberish.

What I’ve done so far

I’ve avoid calling subprocess twice a page since I created a bash script that iterate pages and merges pdftohtml and pdffonts output for each one into a single HTML:

#!/bin/sh

# Usage: ./font_report.sh -a 1 -b 100 -c foo.pdf


while getopts "a:b:c:" arg; do
    case $arg in
        a) FIRST_PAGE=$OPTARG;;
        b) LAST_PAGE=$OPTARG;;
        c) FILENAME=$OPTARG;;
        *)
            echo 'Error: invalid options' >&2
            exit 1
    esac
done

: ${FILENAME:?Missing -c}

if ! [ -f "$FILENAME" ]; then
    echo "Error: $FILENAME does not exist" >&2
    exit 1
fi

echo "<html xmlns='http://www.w3.org/1999/xhtml' lang='' xml:lang=''>" ;

for page in $(seq $FIRST_PAGE $LAST_PAGE)
do
   { 
       echo "<page number=$page>" ; 
       echo "<pdffonts>" ; 
       pdffonts -f $page -l $page $FILENAME ; 
       echo "</pdffonts>" ;  
       (
           pdftohtml -f $page -l $page -s -i -fontfullname -hidden $FILENAME -stdout | 
           tail -n +35 |  # skips head tag and its content
           head -n -1  # skips html ending tag
        ) ;
       echo "</page>"
    }
done

echo "</html>"

The code above has enabled me to call subprocess once and parse html using lxml for each page (considering <page> tag). But it is still needed to look at text content to have a idea if the text is broken.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.