Geospatial Solutions Expert: Using pyTesseract to extract text from picture

Sunday, February 25, 2024

Using pyTesseract to extract text from picture

Here we got screenshots of text from web pages as seen below.

There is need to extract specific text from the images (in this case text that contain 'address' or 'location' strings), so we make use of PIL and pytesseract

import glob
import pytesseract
from PIL import Image
# Download tesseract.exe: https://github.com/UB-Mannheim/tesseract/wiki
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


# Extract text from image...
images = glob.glob(fr'C:\Users\`HYJ7\Documents\Jupyter_Notebooks\Naveda Company Scrapping\imgs\*.png')
len(images)

# Read image using PIL and extract text using pyTesseract
# Read img...
image = Image.open(images[67])
# Extrxct text...
extracted_text = pytesseract.image_to_string(image)
clean_txt = extracted_text.strip().split('\n')

for c in clean_txt:
    if any(substring in c for substring in ['Address', 'Location']):
        print(c)

print('Done...')

That is it!

Geospatial Solutions Expert

Sunday, February 25, 2024

Using pyTesseract to extract text from picture

No comments:

Post a Comment