Sunday, February 25, 2024

Using pyTesseract to extract text from picture

 Here we got screenshots of text from web pages as seen below.

There is need to extract specific text from the images (in this case text that contain 'address' or 'location' strings), so we make use of PIL and pytesseract




import glob
import pytesseract
from PIL import Image
# Download tesseract.exe: https://github.com/UB-Mannheim/tesseract/wiki
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


# Extract text from image...
images = glob.glob(fr'C:\Users\`HYJ7\Documents\Jupyter_Notebooks\Naveda Company Scrapping\imgs\*.png')
len(images)

# Read image using PIL and extract text using pyTesseract
# Read img...
image = Image.open(images[67])
# Extrxct text...
extracted_text = pytesseract.image_to_string(image)
clean_txt = extracted_text.strip().split('\n')

for c in clean_txt:
    if any(substring in c for substring in ['Address', 'Location']):
        print(c)

print('Done...')




That is it!

Friday, February 23, 2024

Wednesday, February 14, 2024

Making a fantasy contour map for background screens

 If you ever wanted to make custom contour images for use in your desktop background as I usually do, then you have come to the right post.

I will show you how to make background contour image like the one below without having to obtain X, Y, Z data and without using a sophisticated GIS tool.


The software we will use is the free graphics software call 'Inkscape'. Download and install it. Then follow these steps:-

  1. Launch the Inkscape software
  2. Draw a shape (example: rectangle, square, circle, polygon etc)
  3. Go to 'Fill and Stroke'
  4. Select 'Pattern Fill' >> 'Nature patterns'
  5. Select the pattern named ''Terrain"


Then play around with Scale X, Scale Y, Orientation, Offset X, Offset Y, Gap X, Gap Y, Color etc






The final result will depend on your creativity.

Thank you for reading!

Friday, February 9, 2024

Extract images from PDF file

 The code snippet below will read a PDF file and extract the images in every page of the file into a folder.

The file name is structured like so: Image-{x}_{index}.png where x is the PDF page number while index is an arbitrary number that increment to make the names unique for each file.

from spire.pdf.common import *
from spire.pdf import *


# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile(r"dermatology-atlas-for-skin-color_compress.pdf")

for x in range(0, 305): # 305 is the expected number of pages
    print('Processing...', x)
    # Get a specific page
    page = doc.Pages[x]

    # Extract images from the page
    images = []
    for image in page.ExtractImages():
        images.append(image)

    # Save images to specified location with specified format extension
    index = 0
    for image in images:
        imageFileName = f'image_for_PDF/Image-{x}_{index}.png'
        index += 1
        image.Save(imageFileName, ImageFormat.get_Png())
        
doc.Close()


The output result of the PDF file: dermatology-atlas-for-skin-color_compress.pdf is as shown below:-

That is it!

Monday, February 5, 2024

Algorithm to Shift last two elements of a python list and pad in between

 In this scenario, I got a list that most always contain six items. If the items contained are less that six, always move the last two items to the end and pad between with empty space.

For example this ['A', 'B', 'E', 'F'] list will becomes this: ['A', 'B', '', '', 'E', 'F']

The code:

# Shift last two elements of a list and pad in between with a '<empty string>'
input_list = ['A', 'B', 'E', 'F']
print('Original list: ', input_list)

last_two_elem = input_list[-2:] # Get last two elements
del input_list[-2:] # Delete last two elements
print(f'Last two elements of the list: {last_two_elem}, the lsit after removing last two elements: {input_list}')

# Pad the list to nth time...
pad_value = 4
input_list += [''] * (pad_value - len(input_list)) # for 6 len list...
print(f'The lsit after padding: {input_list}')

# Extend the list with the last two elements...
input_list.extend(last_two_elem)
print(f'The lsit after extending: {input_list}')



That is it!