laitimes

Python Office Automation: PDF to Excel Table Extraction Practice

author:Artificial intelligence learns from people

In the era of digital office, we often encounter the need for various file format conversion and data extraction. Python is very efficient and precise in dealing with these kinds of problems. Today, I'm going to share a professional-grade hands-on case of extracting Excel sheets from PDF files using Python.

My friend, a senior executive, recently had to work on a PDF employee handbook template with multiple Excel sheets. Due to the limitations of PDFs, it is not feasible to copy and paste the tables directly into Excel, and manually remaking these tables is not only time-consuming, but also error-prone. Faced with this challenge, I decided to use Python to help her.

Python Office Automation: PDF to Excel Table Extraction Practice
Python Office Automation: PDF to Excel Table Extraction Practice
Python Office Automation: PDF to Excel Table Extraction Practice

First of all, I chose the two Python libraries, tabula-py and pandas. tabula-py is a Python wrapper based on the Java library Tabula, which can easily extract tabular data from PDFs, while pandas is a powerful data processing library that can easily process tabular data and save it in Excel format.

Next, I wrote a piece of Python code to implement the function of extracting the table from PDF and saving it as an Excel file:

Python code

import tabula  
import pandas as pd  
  
# PDF文件路径  
pdf_path = '绩效考核表格.pdf'  
  
# 读取PDF中的所有表格  
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)  
  
# 遍历每个表格,并转换为pandas DataFrame  
for i, table in enumerate(tables):  
    df = pd.DataFrame(table)  
      
    # 清洗数据,根据具体表格结构可能需要进行额外的处理  
    # 例如,删除空行、处理列名等  
    # df = df.dropna(how='all')  # 删除全空行  
    # df.columns = ['Column1', 'Column2', ...]  # 设置列名  
      
    # 将清洗后的DataFrame保存为Excel文件  
    excel_path = f'extracted_table_{i}.xlsx'  
    df.to_excel(excel_path, index=False)

           
Python Office Automation: PDF to Excel Table Extraction Practice
Python Office Automation: PDF to Excel Table Extraction Practice
Python Office Automation: PDF to Excel Table Extraction Practice

This code first reads all the tables in the PDF file using the tabula.read_pdf function, then iterates through each table and converts it to a DataFrame object for pandas. During the conversion process, we can clean and process the data according to the structure of the specific table, such as removing blank rows, setting column names, etc. Finally, use the df.to_excel method to save the cleaned DataFrame as an Excel file.

By running this code, my friend managed to extract all the Excel sheets from the PDF employee handbook template, and the formatting and data of these tables were left as they were. This has greatly improved her productivity and avoids the errors that can come with manual operations.

This case illustrates the power of Python in the field of office automation. By choosing the right library and writing efficient code, we can easily solve the problems of various file format conversions and data extraction. If you are facing a similar challenge, you may wish to try Python Office Automation, I believe it will bring you unexpected surprises.

Read on