天天看点

python PDF相关模块

文章目录

  • ​​一、pdfplumber​​
  • ​​pdfplumber.Page 提取表格方法​​
  • ​​二、pymupdf 图片相关模块​​
  • ​​三、一些案例​​
  • ​​3.1 普通 pdf2txt​​
  • ​​3.2 获取pdf文本提取以及签字信息​​
  • ​​3.3 案例 e_invoice文本提取​​
  • ​​四、pdf2word​​

一、pdfplumber

pdfplumber是一款基于pdfminer,完全由python开发的pdf文档解析库,不仅可以获取每个字符、矩形框、线等对象的具体信息,而且还可以抽取文本和表格。目前pdfplumber仅支持可编辑的pdf文档。​​【github地址】​​

1、二者都可以获取到每个字符、矩形框、线等对象的具体信息,但是pdfplumber在pdfminer的基础上进行了封装和处理,使得到的对象更易于使用,对用户更友好。

2、二者都能对文本解析,但是pdfminer输出的文本在布局上可能与原文差别比较大,但是pdfplumber抽取出的文本与原文可以有更高的一致性。

pdfplumber实现了表格抽取逻辑,基于最基本的字符、线框等对象的位置信息,定位、识别pdf文档中的表格。

  • pdfplumber.pdf中包含了如下两个属性。
  • ​.metadata​

  • ​.pages​

​.page_number​

​页码

​.width​

​页面宽度

​.height​

​ 页面高度

​.objects​

​/

​.chars​

​/

​.lines​

​/

​.rects​

​ 这些属性中每一个都是一个列表,每个列表都包含一个字典,每个字典用于说明页面中的对象信息, 包括直线,字符, 方格等位置信息。

一些常用的方法

​.extract_text()​

​ 用来提页面中的文本,将页面的所有字符对象整理为的那个字符串

​.extract_words()​

​ 返回的是所有的单词及其相关信息

​.extract_tables()​

​ 提取页面的表格

​.to_image()​

​ 用于可视化调试时,返回PageImage类的一个实例

方法 描述

​.crop(bounding_box)​

返回裁剪到边界框的页面版本,该页面的形式以4元组​

​(x0, top, x1, bottom)​

​表示。裁剪后的页面保留了至少部分位于边界框内的对象。如果对象仅部分落在该框内,则对其尺寸进行切片以适合边界框。

​.within_bbox(bounding_box)​

与相似​

​.crop​

​,但仅保留完全落在边界框内的对象。

​.filter(test_function)​

返回与只有一个版本的页面​

​.objects​

​​为其​

​test_function(obj)​

​​回报​

​True​

​。

​.extract_text(x_tolerance=0, y_tolerance=0)​

将页面的所有字符对象整理到一个字符串中。

- 添加一个字符的 ​

​x1​

​​ 与下一个字符的 ​

​x0​

​​ 之间的差大于 ​

​x_tolerance​

​ (公差)的空格。

- 添加换行符​

​doctop​

​​,其中一个字符的字符与下一个字符的字符之间的差​

​doctop​

​​大于​

​y_tolerance​

​(公差)。

​.extract_words(x_tolerance=0, y_tolerance=0, horizontal_ltr=True, vertical_ttb=True)​

返回一个列表,其中包含所有看起来像单词的对象及其边界框。单词被认为是字符序列;

其中(对于“直立”字符)一个字符的​

​x1​

​​和下一个字符的​

​x0​

​之间的差异小于或等于x_tolerance(公差),并且一个字符的doctor和下一个字符的doctor小于或等于y_tolerance(公差)。

对于非竖直字符也采用了类似的方法,但不是测量它们之间的垂直距离,而是测量它们之间的水平距离。参数horizontal_ltr和vertical_ttb指示是否应该从左到右(对于水平单词)/从上到下(对于垂直单词)读取单词。

​.extract_tables(table_settings)​

从页面中提取表格数据。有关更多详细信息,请参见下面的“ ​​提取表​​ ”。

​.to_image(**conversion_kwargs)​

返回​

​PageImage​

​​该类的实例。有关更多详细信息,请参见下面的“ ​​可视调试​​​ ”。有关conversion_kwargs,请参见​​此处​​。
pdfplumber.Page 提取表格方法

提取表格:​​示例​​

方法 描述

​.find_tables(table_settings={})​

返回​

​Table​

​​对象列表。所述​

​Table​

​​对象提供访问​

​.cells​

​​,​

​.rows​

​​和​

​.bbox​

​​特性,以及该​

​.extract(x_tolerance=3, y_tolerance=3)​

​方法。

​.extract_tables(table_settings={})​

返回从页面上所有表中提取的文本,结构为,以列表列表的形式表示​

​table -> row -> cell​

​。

​.extract_table(table_settings={})​

返回从页面上最大的表中提取的文本,该文本表示为列表列表,结构为​

​row -> cell​

​。(如果多个表具有相同的大小(以单元格的数量来衡量,则此方法将返回最接近页面顶部的表。)

​.debug_tablefinder(table_settings={})​

返回的实例​

​TableFinder​

​​类,有访问​

​.edges​

​​,​

​.intersections​

​​,​

​.cells​

​​,和​

​.tables​

​属性。
class PDF_Text_Image:
    def pdf_extract_word(self,pdf_path):
        with pdfplumber.open(pdf_path) as pdf:
            metadata_info = pdf.metadata
            pages_info = pdf.pages
        print("Create_modDate",metadata_info)
        print("Total_pages",len(pages_info))
        for i in range(len(pages_info)):
            pg_width = pages_info[i].width
            pg_height = pages_info[i].width
            pg_num = pages_info[i].page_number
            pg_words = pages_info[i].extract_words()
            raw_data = {"x":[],"y":[],"top":[],"bottom":[],"text":[]}

            for j in range(len(pg_words)):
                x0 = pg_words[j]["x0"]
                x1 = pg_words[j]["x1"]
                top = pg_words[j]["top"]
                bottom = pg_words[j]["bottom"]
                text = pg_words[j]["text"]
            pg_tables = pages_info[i].find_tables()
            pg_images = pages_info[i].images
            try:
                save_name = './aaaa.'
                img = Image.open(BytesIO(pages_info[i].images[0]["stream"].rawdata))
                img.save(save_name + img.format, quality=95)
            except:
                print("image error!")
                pass

    def extract_jpg_from_pdf(self,path):
        pdf = open(path, "rb").read()
        start_mark = b"\xff\xd8"
        start_fix = 0
        end_mark = b"\xff\xd9"
        end_fix = 2

        i = 0
        n_jpg = 0

        try:
            is_stream = pdf.find(b"stream", i)
            if is_stream < 0:
                pass

            is_start = pdf.find(start_mark, is_stream, is_stream + 20)
            if is_start < 0:
                i = is_stream + 20
                pass

            is_end = pdf.find(b"endstream", is_start)
            if is_end < 0:
                raise Exception("Didn't find end of stream !")
            is_end = pdf.find(end_mark, is_end - 20)
            if is_end < 0:
                raise Exception("Didn't find end of JPG!")

            is_start += start_fix
            is_end += end_fix

            print("JPG %d from %d to %d" % (n_jpg, is_start, is_end))
            jpg = pdf[is_start:is_end]

            print("提取图片" + "pic_%d.jpg" % n_jpg)
            with open("./pic_%d.jpg" % n_jpg, "wb") as f:
                f.write(jpg)
        except:
            pass      

无需第三方包获取pdf图片

def extract_jpg_from_pdf(path):
    pdf = open(path, "rb").read()
 
    start_mark = b"\xff\xd8"
    start_fix = 0
    end_mark = b"\xff\xd9"
    end_fix = 2
 
    i = 0
    n_jpg = 0
 
    while True:
        is_stream = pdf.find(b"stream", i)
        if is_stream < 0:
            break
 
        is_start = pdf.find(start_mark, is_stream, is_stream + 20)
        if is_start < 0:
            i = is_stream + 20
            continue
 
        is_end = pdf.find(b"endstream", is_start)
        if is_end < 0:
            raise Exception("Didn't find end of stream !")
        is_end = pdf.find(end_mark, is_end - 20)
        if is_end < 0:
            raise Exception("Didn't find end of JPG!")
 
        is_start += start_fix
        is_end += end_fix
 
        print("JPG %d from %d to %d" % (n_jpg, is_start, is_end))
        jpg = pdf[is_start:is_end]
 
        print("提取图片" + "pic_%d.jpg" % n_jpg)
        jpg_file = open("pic_%d.jpg" % n_jpg, "wb")      

二、pymupdf 图片相关模块

安装​

​pip install PyMuPDF​

​​ ,地址:​​https://pypi.org/project/PyMuPDF/​​​ 官方使用文档:​​【点击查看】​​ 直接跳转图像相关模块​​【点击查看】​​,感觉这个模块更专注于pdf-图片

pdf2image

import fitz

doc = fitz.open(pdf_path2)
for page in doc:
    #============方法1============
    pix = page.getPixmap(alpha = False)  # 直接pdf转image
    
    #===========方法2 =============
    zoom_x = 2.0
    zomm_y = 2.0
    mat = fitz.Matrix(zoom_x, zomm_y)
    pix = page.getPixmap(matrix=mat)    # 放大(2X2),然后pdf转image

    pix.writePNG("../../page-%i_1.png" % page.number)      
import os
import fitz  # 导入的是fitz


if __name__ == '__main__':
    base_path = input("请输入要转换的文件路径:")  # 输入要转换的PDF所在的文件夹
    filenames = os.listdir(base_path)      # 获取PDF文件列表
    for filename in filenames:
        full_path = os.path.join(base_path, filename)  # 拼接,得到PDF文件的绝对路径
        doc = fitz.open(full_path)  # 打开PDF文件,doc为Document类型,是一个包含每一页PDF文件的列表
        rotate = int(0)  # 设置图片的旋转角度
        zoom_x = 2.0  # 设置图片相对于PDF文件在X轴上的缩放比例
        zoom_y = 2.0  # 设置图片相对于PDF文件在Y轴上的缩放比例
        trans = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
        print("%s开始转换..." % filename)
        if doc.pageCount > 1:  # 获取PDF的页数
            for pg in range(doc.pageCount):
                page = doc[pg]  # 获得第pg页
                pm = page.getPixmap(matrix=trans, alpha=False)  # 将其转化为光栅文件(位数)
                new_full_name = full_path.split(".")[0]  # 保证输出的文件名不变
                pm.writeImage("%s%s.jpg" % (new_full_name, pg))  # 将其输入为相应的图片格式,可以为位图,也可以为矢量图            # 我本来想输出为jpg文件,但是在网页中都是png格式(即调用writePNG),再转换成别的图像文件前,最好查一下是否支持
        else:
            page = doc[0]
            pm = page.getPixmap(matrix=trans, alpha=False)
            new_full_name = full_path.split(".")[0]
            pm.writeImage("%s.jpg" % new_full_name)
        print("%s转换完成!" % filename)      
import fitz
from PIL import Image
import pandas as pd

doc = fitz.open("pdf", request_bytes)
txtblocks = {"text": [], "bbox_x0": [], "bbox_y0": [], "bbox_x1": [], "bbox_y1": []}
page = doc[0]
page_num = doc.page_count
pix = page.getPixmap(alpha=False)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
d = page.getText("dict")
blocks = d["blocks"]
for block in blocks:
  if 0 == block["type"]:
    for i in range(len(block["lines"])):
      txtblocks["text"].append(block["lines"][i]["spans"][0]["text"])
      bbox = block["lines"][i]["spans"][0]["bbox"]
      txtblocks["bbox_x0"].append(bbox[0])
      txtblocks["bbox_y0"].append(bbox[1])
      txtblocks["bbox_x1"].append(bbox[2])
      txtblocks["bbox_y1"].append(bbox[3])
pd_data = pd.DataFrame(txtblocks)      

三、一些案例

3.1 普通 pdf2txt

安装依赖库:​

​pip install pdfplumber​

​​ 安装依赖库:​

​pip install pdfminer​

生成纯 txt 文本,无法格式化,效果不佳。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
import io


class PDFUtils():
    def __init__(self):
        pass
    def pdf2txt(self, path):
        output = io.StringIO()
        with open(path, 'rb') as f:        # 以二进制读模式打开
            praser = PDFParser(f)          # 连接分析器 与文档对象
            doc = PDFDocument(praser)

            if not doc.is_extractable:
                raise PDFTextExtractionNotAllowed

            pdfrm = PDFResourceManager()    # 创建PDf 资源管理器 来管理共享资源
            laparams = LAParams()           # 创建一个PDF设备对象

            device = PDFPageAggregator(pdfrm, laparams=laparams)

            interpreter = PDFPageInterpreter(pdfrm, device)  # 创建一个PDF解释器对象

            for page in PDFPage.create_pages(doc):
                interpreter.process_page(page)
                layout = device.get_result()
                for x in layout:
                    if hasattr(x, "get_text"):
                        content = x.get_text()
                        output.write(content)

        content = output.getvalue()
        output.close()
        return content


if __name__ == '__main__':
    path = './t3.pdf'
    pdf_utils = PDFUtils()
    print (pdf_utils.pdf2txt(path))      
3.2 获取pdf文本提取以及签字信息
from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager,process_pdf
from endesive import pdf  # 签字信息
 
def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
    return lines
 
 
 
if __name__ == '__main__':
    with open(r'D:\data\pdf\111.pdf', "rb") as my_pdf:
        print(read_pdf(my_pdf))     # 文本信息
        data = my_pdf.read()
        print(pdf.verify(data))     # 签字信息      

pdf 文件信息

from PyPDF2 import PdfFileReader

def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
    txt = f"""
    Information about {pdf_path}: 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information,

if __name__ == '__main__':
path = r'D:\data\pdf\111.pdf'
extract_information(path)      

pdf数字签章(签字):https://github.com/m32/endesive

3.3 案例 e_invoice文本提取

转载自:​​http://www.yooongchun.com/2019/12/18/invoiceextractor/​​,代码测试可行

import os
import pdfplumber as pb



class Extractor(object):
    def __init__(self, path):
        self.file = path if os.path.isfile else None

    def _load_data(self):
        if self.file and os.path.splitext(self.file)[1] == '.pdf':
            pdf = pb.open(self.file)
            page = pdf.pages[0]
            words = page.extract_words(x_tolerance=5)
            lines = page.lines
            # convert coordination
            for index, word in enumerate(words):
                words[index]['y0'] = word['top']
                words[index]['y1'] = word['bottom']
            for index, line in enumerate(lines):
                lines[index]['x1'] = line['x0']+line['width']
                lines[index]['y0'] = line['top']
                lines[index]['y1'] = line['bottom']
            return {'words': words, 'lines': lines}
        else:
            print("file %s cann't be opened." % self.file)
            return None

    def _fill_line(self, lines):
        hlines = [line for line in lines if line['width'] > 0]  # 筛选横线
        hlines = sorted(hlines, key=lambda h: h['width'], reverse=True)[:-2]  # 剔除较短的两根
        vlines = [line for line in lines if line['height'] > 0]  # 筛选竖线
        vlines = sorted(vlines, key=lambda v: v['y0'])  # 按照坐标排列
        # 查找边框顶点
        hx0 = hlines[0]['x0']  # 左侧
        hx1 = hlines[0]['x1']  # 右侧
        vy0 = vlines[0]['y0']  # 顶部
        vy1 = vlines[-1]['y1']  # 底部

        thline = {'x0': hx0, 'y0': vy0, 'x1': hx1, 'y1': vy0}  # 顶部横线
        bhline = {'x0': hx0, 'y0': vy1, 'x1': hx1, 'y1': vy1}  # 底部横线
        lvline = {'x0': hx0, 'y0': vy0, 'x1': hx0, 'y1': vy1}  # 左侧竖线
        rvline = {'x0': hx1, 'y0': vy0, 'x1': hx1, 'y1': vy1}  # 右侧竖线

        hlines.insert(0, thline)
        hlines.append(bhline)
        vlines.insert(0, lvline)
        vlines.append(rvline)
        return {'hlines': hlines, 'vlines': vlines}

    def _is_point_in_rect(self, point, rect):
        '''判断点是否在矩形内'''
        px, py = point
        p1, p2, p3, p4 = rect
        if p1[0] <= px <= p2[0] and p1[1] <= py <= p3[1]:
            return True
        else:
            return False

    def _find_cross_points(self, hlines, vlines):
        points = []
        delta = 1
        for vline in vlines:
            vx0 = vline['x0']
            vy0 = vline['y0']
            vy1 = vline['y1']
            for hline in hlines:
                hx0 = hline['x0']
                hy0 = hline['y0']
                hx1 = hline['x1']
                if (hx0-delta) <= vx0 <= (hx1+delta) and (vy0-delta) <= hy0 <= (vy1+delta):
                    points.append((int(vx0), int(hy0)))
        return points

    def _find_rects(self, cross_points):
        # 构造矩阵
        X = sorted(set([int(p[0]) for p in cross_points]))
        Y = sorted(set([int(p[1]) for p in cross_points]))
        df = pd.DataFrame(index=Y, columns=X)
        for p in cross_points:
            x, y = int(p[0]), int(p[1])
            df.loc[y, x] = 1
        df = df.fillna(0)
        # 寻找矩形
        rects = []
        COLS = len(df.columns)-1
        ROWS = len(df.index)-1
        for row in range(ROWS):
            for col in range(COLS):
                p0 = df.iat[row, col]  # 主点:必能构造一个矩阵
                cnt = col+1
                while cnt <= COLS:
                    p1 = df.iat[row, cnt]
                    p2 = df.iat[row+1, col]
                    p3 = df.iat[row+1, cnt]
                    if p0 and p1 and p2 and p3:
                        rects.append(((df.columns[col], df.index[row]), (df.columns[cnt], df.index[row]), (
                            df.columns[col], df.index[row+1]), (df.columns[cnt], df.index[row+1])))
                        break
                    else:
                        cnt += 1
        return rects

    def _put_words_into_rect(self, words, rects):
        # 将words按照坐标层级放入矩阵中
        groups = {}
        delta = 2
        for word in words:
            p = (int(word['x0']), int((word['y0']+word['y1'])/2))
            flag = False
            for r in rects:
                if self._is_point_in_rect(p, r):
                    flag = True
                    groups[('IN', r[0][1], r)] = groups.get(
                        ('IN', r[0][1], r), [])+[word]
                    break
            if not flag:
                y_range = [
                    p[1]+x for x in range(delta)]+[p[1]-x for x in range(delta)]
                out_ys = [k[1] for k in list(groups.keys()) if k[0] == 'OUT']
                flag = False
                for y in set(y_range):
                    if y in out_ys:
                        v = out_ys[out_ys.index(y)]
                        groups[('OUT', v)].append(word)
                        flag = True
                        break
                if not flag:
                    groups[('OUT', p[1])] = [word]
        return groups

    def _find_text_by_same_line(self, group, delta=1):
        words = {}
        group = sorted(group, key=lambda x: x['x0'])
        for w in group:
            bottom = int(w['bottom'])
            text = w['text']
            k1 = [bottom-i for i in range(delta)]
            k2 = [bottom+i for i in range(delta)]
            k = set(k1+k2)
            flag = False
            for kk in k:
                if kk in words:
                    words[kk] = words.get(kk, '')+text
                    flag = True
                    break
            if not flag:
                words[bottom] = words.get(bottom, '')+text
        return words

    def _split_words_into_diff_line(self, groups):
        groups2 = {}
        for k, g in groups.items():
            words = self._find_text_by_same_line(g, 3)
            groups2[k] = words
        return groups2

    def _index_of_y(self, x, rects):
        for index, r in enumerate(rects):
            if x == r[2][0][0]:
                return index+1 if index+1 < len(rects) else None
        return None

    def _find_outer(self, k, words):
        df = pd.DataFrame()
        for pos, text in words.items():
            if re.search(r'发票$', text):  # 发票名称
                df.loc[0, '发票名称'] = text
            elif re.search(r'发票代码', text):  # 发票代码
                num = ''.join(re.findall(r'[0-9]+', text))
                df.loc[0, '发票代码'] = num
            elif re.search(r'发票号码', text):  # 发票号码
                num = ''.join(re.findall(r'[0-9]+', text))
                df.loc[0, '发票号码'] = num
            elif re.search(r'开票日期', text):  # 开票日期
                date = ''.join(re.findall(
                    r'[0-9]{4}年[0-9]{1,2}月[0-9]{1,2}日', text))
                df.loc[0, '开票日期'] = date
            elif '机器编号' in text and '校验码' in text:  # 校验码
                text1 = re.search(r'校验码:\d+', text)[0]
                num = ''.join(re.findall(r'[0-9]+', text1))
                df.loc[0, '校验码'] = num
                text2 = re.search(r'机器编号:\d+', text)[0]
                num = ''.join(re.findall(r'[0-9]+', text2))
                df.loc[0, '机器编号'] = num
            elif '机器编号' in text:
                num = ''.join(re.findall(r'[0-9]+', text))
                df.loc[0, '机器编号'] = num
            elif '校验码' in text:
                num = ''.join(re.findall(r'[0-9]+', text))
                df.loc[0, '校验码'] = num
            elif re.search(r'收款人', text):
                items = re.split(r'收款人:|复核:|开票人:|销售方:', text)
                items = [item for item in items if re.sub(
                    r'\s+', '', item) != '']
                df.loc[0, '收款人'] = items[0] if items and len(items) > 0 else ''
                df.loc[0, '复核'] = items[1] if items and len(items) > 1 else ''
                df.loc[0, '开票人'] = items[2] if items and len(items) > 2 else ''
                df.loc[0, '销售方'] = items[3] if items and len(items) > 3 else ''
        return df

    def _find_and_sort_rect_in_same_line(self, y, groups):
        same_rects_k = [k for k, v in groups.items() if k[1] == y]
        return sorted(same_rects_k, key=lambda x: x[2][0][0])

    def _find_inner(self, k, words, groups, groups2, free_zone_flag=False):
        df = pd.DataFrame()
        sort_words = sorted(words.items(), key=lambda x: x[0])
        text = [word for k, word in sort_words]
        context = ''.join(text)
        if '购买方' in context or '销售方' in context:
            y = k[1]
            x = k[2][0][0]
            same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
            target_index = self._index_of_y(x, same_rects_k)
            target_k = same_rects_k[target_index]
            group_context = groups2[target_k]
            prefix = '购买方' if '购买方' in context else '销售方'
            for pos, text in group_context.items():
                if '名称' in text:
                    name = re.sub(r'名称:', '', text)
                    df.loc[0, prefix+'名称'] = name
                elif '纳税人识别号' in text:
                    tax_man_id = re.sub(r'纳税人识别号:', '', text)
                    df.loc[0, prefix+'纳税人识别号'] = tax_man_id
                elif '地址、电话' in text:
                    addr = re.sub(r'地址、电话:', '', text)
                    df.loc[0, prefix+'地址电话'] = addr
                elif '开户行及账号' in text:
                    account = re.sub(r'开户行及账号:', '', text)
                    df.loc[0, prefix+'开户行及账号'] = account
        elif '密码区' in context:
            y = k[1]
            x = k[2][0][0]
            same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
            target_index = self._index_of_y(x, same_rects_k)
            target_k = same_rects_k[target_index]
            words = groups2[target_k]
            context = [v for k, v in words.items()]
            context = ''.join(context)
            df.loc[0, '密码区'] = context
        elif '价税合计' in context:
            y = k[1]
            x = k[2][0][0]
            same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
            target_index = self._index_of_y(x, same_rects_k)
            target_k = same_rects_k[target_index]
            group_words = groups2[target_k]
            group_context = ''.join([w for k, w in group_words.items()])
            items = re.split(r'[((]小写[))]', group_context)
            b = items[0] if items and len(items) > 0 else ''
            s = items[1] if items and len(items) > 1 else ''
            df.loc[0, '价税合计(大写)'] = b
            df.loc[0, '价税合计(小写)'] = s
        elif '备注' in context:
            y = k[1]
            x = k[2][0][0]
            same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
            target_index = self._index_of_y(x, same_rects_k)
            if target_index:
                target_k = same_rects_k[target_index]
                group_words = groups2[target_k]
                group_context = ''.join([w for k, w in group_words.items()])
                df.loc[0, '备注'] = group_context
            else:
                df.loc[0, '备注'] = ''
        else:
            if free_zone_flag:
                return df, free_zone_flag
            y = k[1]
            x = k[2][0][0]
            same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
            if len(same_rects_k) == 8:
                free_zone_flag = True
                for kk in same_rects_k:
                    yy = kk[1]
                    xx = kk[2][0][0]
                    words = groups2[kk]
                    words = sorted(words.items(), key=lambda x: x[0]) if words and len(
                        words) > 0 else None
                    key = words[0][1] if words and len(words) > 0 else None
                    val = [word[1] for word in words[1:]
                           ] if key and words and len(words) > 1 else ''
                    val = '\n'.join(val) if val else ''
                    if key:
                        df.loc[0, key] = val
        return df, free_zone_flag

    def extract(self):
        data = self._load_data()
        words = data['words']
        lines = data['lines']

        lines = self._fill_line(lines)
        hlines = lines['hlines']
        vlines = lines['vlines']

        cross_points = self._find_cross_points(hlines, vlines)
        rects = self._find_rects(cross_points)

        word_groups = self._put_words_into_rect(words, rects)
        word_groups2 = self._split_words_into_diff_line(word_groups)

        df = pd.DataFrame()
        free_zone_flag = False
        for k, words in word_groups2.items():
            if k[0] == 'OUT':
                df_item = self._find_outer(k, words)
            else:
                df_item, free_zone_flag = self._find_inner(
                    k, words, word_groups, word_groups2, free_zone_flag)
            df = pd.concat([df, df_item], axis=1)
        return df

if __name__=="__main__":
    path=r't3.pdf'
    data = Extractor(path).extract()
    print(data)      

获取文本图片等信息

pdf文件提取图片: ​​​javascript:void(0)​​​ docx文件提取图片:​​javascript:void(0)​​

四、pdf2word

github 地址 ​​https://github.com/dothinking/pdf2docx​​​ 安装方法:​

​pip install pdf2docx -i https://pypi.tuna.tsinghua.edu.cn/simple​

​ 使用文档:​​https://dothinking.github.io/pdf2docx/quickstart.html​​

pdf2docx

from pdf2docx import Converter

pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'

# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file, start=0, end=None)
cv.close()      
from pdf2docx import parse

pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'

# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=None)      

提取表格

from pdf2docx import Converter

pdf_file = '/path/to/sample.pdf'

cv = Converter(pdf_file)
tables = cv.extract_tables(start=0, end=1)
cv.close()

for table in tables:
    print(table)      

命令行界面

$ pdf2docx --help

NAME
    pdf2docx - Command line interface for pdf2docx.

SYNOPSIS
    pdf2docx COMMAND | -

DESCRIPTION
    Command line interface for pdf2docx.

COMMANDS
    COMMAND is one of the following:

    convert
      Convert pdf file to docx file.

    debug
      Convert one PDF page and plot layout information for      

按页面范围

指定页面范围​

​--start​

​​(从(如果省略,则从第一页开始)和​

​--end​

​(到如果省略,则 至最后一页)之间。

默认情况下,页面索引是从零开始的,但是可以通过将其关闭 ​

​--zero_based_index=False​

​,即,第一个页面索引从1开始。

转换所有页面:

$ pdf2docx convert test.pdf test.docx      

将页面从第二个转换到最后:

$ pdf2docx convert test.pdf test.docx --start=1      

将页面从第一页转换到第三页(索引= 2):

$ pdf2docx convert test.pdf test.docx --end=3      

转换第二页和第三页:

$ pdf2docx convert test.pdf test.docx --start=1 --end=3      

使用从零开始的索引转换第一和第二页,请关闭:

$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False      

按页码

转换第一,第三和第五页:

$ pdf2docx convert test.pdf test.docx --pages=0,2,4      
$ pdf2docx convert test.pdf test.docx --multi_processing=True      
$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4