Sunday, February 25, 2024

Using pyTesseract to extract text from picture

 Here we got screenshots of text from web pages as seen below.

There is need to extract specific text from the images (in this case text that contain 'address' or 'location' strings), so we make use of PIL and pytesseract




import glob
import pytesseract
from PIL import Image
# Download tesseract.exe: https://github.com/UB-Mannheim/tesseract/wiki
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


# Extract text from image...
images = glob.glob(fr'C:\Users\`HYJ7\Documents\Jupyter_Notebooks\Naveda Company Scrapping\imgs\*.png')
len(images)

# Read image using PIL and extract text using pyTesseract
# Read img...
image = Image.open(images[67])
# Extrxct text...
extracted_text = pytesseract.image_to_string(image)
clean_txt = extracted_text.strip().split('\n')

for c in clean_txt:
    if any(substring in c for substring in ['Address', 'Location']):
        print(c)

print('Done...')




That is it!

No comments:

Post a Comment