Python automation artifact: Accurately extract table data from images, and say goodbye to tedious manual input

This paper uses OCR technology and computer vision methods to provide a method to extract tables from pictures, which can not only ensure the absolute accuracy of text information, but also ensure that the original table structure is not damaged, which greatly improves the work efficiency for companies that like to use WeChat to work

In work and life, we often encounter such a scenario: friends, customers, colleagues or leaders often send us the form data in the form of screenshots in order to quickly transmit information. However, this image format causes us a lot of inconvenience, especially when we want to continue editing the data. Although there are some automatic extraction tools on the market, they often only extract text and cannot guarantee the original structure of the table, which makes us have to manually enter the information in the picture into Excel one by one.

Python automation artifact: Accurately extract table data from images, and say goodbye to tedious manual input

This process is not only boring and inefficient, but it is also error-prone and greatly affects our productivity.

However, now with Python automation tools, we can easily solve this problem!

As a powerful programming language, Python has many excellent libraries and tools that can help us automatically extract tabular data from images. By combining computer vision and OCR technology, we can write programs to accurately identify tables in images and convert them into editable Excel formats.

Specifically, we can use the OpenCV library to process images, and improve the recognition accuracy of tabular data through image preprocessing steps such as grayscale, binarization, and noise reduction. The Tesseract OCR engine is then used to recognize the text in the image and convert it into a string.

Next, comes the crucial step – identifying the table structure. This requires us to use some algorithms and strategies to identify the columns, columns, cells, and other information of the table to ensure the accuracy and completeness of the data. This process may require a combination of some natural language processing and machine learning techniques to address the challenges of different table layouts and formats.

Once we have successfully identified the tabular structure, we can organize the extracted data in tabular form and convert it into a DataFrame object using the pandas library. DataFrame is a powerful data structure provided by pandas, which not only conveniently stores and manipulates tabular data, but also can be exported directly as an Excel file.

Finally, we save the DataFrame object as an Excel file, and we get an Excel sheet that exactly matches the structure of the table in the original image. In this way, we can easily edit, analyze and process the data, which greatly improves the work efficiency.

Core sample code

from PIL import Image
import pytesseract
from openpyxl import Workbook


def load_image(image_path):
    image = Image.open(image_path)
    return image


def convert_to_grayscale(image):
    return image.convert("L")


def extract_text(image):
    return pytesseract.image_to_string(image)


def extract_table_data(text):
    rows = text.strip().split("\n")
    table_data = [row.split("\t") for row in rows]
    return table_data


def save_as_excel(table_data, output_path):
    workbook = Workbook()
    sheet = workbook.active


    for row_index, row_data in enumerate(table_data, start=1):
        for column_index, cell_data in enumerate(row_data, start=1):
            sheet.cell(row=row_index, column=column_index, value=cell_data)


    workbook.save(output_path)


# 调用示例
image_path = "table_image.jpg"
output_path = "table_data.xlsx"


image = load_image(image_path)
grayscale_image = convert_to_grayscale(image)
text = extract_text(grayscale_image)
table_data = extract_table_data(text)
save_as_excel(table_data, output_path)

It can be seen that the information extracted in this way is not only error-free, but also perfectly maintains the structure of the table

In conclusion, the Python automation tool provides us with an efficient and accurate way to extract tabular data from images with one click and convert it into an editable Excel format. It can not only solve the tedious problem of manually entering the form data, but also guarantee the accuracy and completeness of the data. Let's embrace Python automation and say goodbye to the hassle of tedious manual input!