#StackBounty: #python #python-tesseract #pdf-extraction Pytesseract hindi extraction not accurate

Bounty: 50

I am trying to extract Hindi text from a pdf. I tried all the methods to exract from the pdf but none of them worked. There are explanations why it doesn’t work but no answers as such. So, I decided to convert the pdf to an image and then use PyTesseract to exract text. I have downloaded the hindi trained data, however that also gives highly inaccurate text.

import fitz

filepath = "D:\BADI KA BANS-Ward No-002.pdf"

doc = fitz.open(filepath)
page = doc.loadPage(3)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:Program FilesTesseract-OCRtesseract.exe"

# Create an image object of PIL library
image = Image.open('outfile.png')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')

# Print the text
print(image_to_text)

Output Sample:-

कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो
पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल
कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९
आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी
एजगल्णब्णस्य (बन्द जगाख्मिणण्य
नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी
चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20०
|सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44
जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष

Actual Hindi:-

Actual Hindi

There is an answer to this question I want to scrape a Hindi(Indian Langage) pdf file with python, which seems to tell how to do this but provides no explanation whatsoever.

A link to the pdf :-

https://documentcloud.adobe.com/link/review?uri=urn:aaid:scds:US:cbf06634-41ae-4061-b548-53e63b91d4da

Anyway to do this other than to train the language model myself?


Get this bounty!!!

#StackBounty: #python #python-tesseract Using pytesseract ocr in pythonanywhere for non english languages

Bounty: 50

I am creating a website in pythonanywhere for OCR.In this user can upload text-images and download it in editable format. For english language it is working perfectly, but while i try to include some additional languages (south Indian languages) it showing some error messages.

i put my additional traineddata in folder "/home/wiltomalayalamocr/mysite/langfiles" it contains "mal.traineddata" file

and in my code

        pytesseract.pytesseract.tesseract_cmd = r"/usr/bin/tesseract"
        custom_oem_psm_config = '-l {} --psm {} --tessdata-dir "/home/wiltomalayalamocr/mysite/langfiles"'.format(lang,6)
        text = pytesseract.image_to_string(Image.open(filename) , config=custom_oem_psm_config)

in which lang="mal"
but i am getting the error

pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.04.01 with Leptonica Error opening data file /usr/share/tesseract-ocr/tessdata/mal.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'mal' Tesseract couldn't load any languages! Could not initialize tesseract.')

i am using python-Flask framework

Anybody can help me ….


Get this bounty!!!

#StackBounty: #python #python-3.x #ocr #python-tesseract Why can't get string with PIL and pytesseract?

Bounty: 150

It is a simple Optical Character Recognition (OCR) program in Python 3 to get string, I have uploaded the target gif file here, please download it and save it as /tmp/target.gif.

enter image description here

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.gif')))

I paste all the error info here, please fix it to get the characters from image.

/usr/lib/python3/dist-packages/PIL/Image.py:925: UserWarning: Couldn't allocate palette entry for transparency
  "for transparency")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 309, in image_to_string
    }[output_type]()
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 308, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 208, in run_and_get_output
    temp_name, input_filename = save_image(image)
  File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 136, in save_image
    image.save(input_file_name, format=img_extension, **image.info)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1728, in save
    save_handler(self, fp, filename)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 407, in _save
    _get_local_header(fp, im, (0, 0), flags)
  File "/usr/lib/python3/dist-packages/PIL/GifImagePlugin.py", line 441, in _get_local_header
    transparency = int(transparency)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

I convert it with convert command in bash.

convert  "/tmp/target.gif"   "/tmp/target.jpg"

I show /tmp/target.gif and /tmp/target.jpg here.
enter image description here

Then execute the above python code again.

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('/tmp/target.jpg')))

Nothing can i get with the pytesseract.image_to_string(Image.open('/tmp/target.jpg')),i get blank character.

enter image description here


Get this bounty!!!