文章目录
- 一、pdfplumber
- pdfplumber.Page 提取表格方法
- 二、pymupdf 图片相关模块
- 三、一些案例
- 3.1 普通 pdf2txt
- 3.2 获取pdf文本提取以及签字信息
- 3.3 案例 e_invoice文本提取
- 四、pdf2word
一、pdfplumber
pdfplumber是一款基于pdfminer,完全由python开发的pdf文档解析库,不仅可以获取每个字符、矩形框、线等对象的具体信息,而且还可以抽取文本和表格。目前pdfplumber仅支持可编辑的pdf文档。【github地址】
1、二者都可以获取到每个字符、矩形框、线等对象的具体信息,但是pdfplumber在pdfminer的基础上进行了封装和处理,使得到的对象更易于使用,对用户更友好。
2、二者都能对文本解析,但是pdfminer输出的文本在布局上可能与原文差别比较大,但是pdfplumber抽取出的文本与原文可以有更高的一致性。
pdfplumber实现了表格抽取逻辑,基于最基本的字符、线框等对象的位置信息,定位、识别pdf文档中的表格。
- pdfplumber.pdf中包含了如下两个属性。
-
.metadata
-
.pages
.page_number
页码
.width
页面宽度
.height
页面高度
.objects
/
.chars
/
.lines
/
.rects
这些属性中每一个都是一个列表,每个列表都包含一个字典,每个字典用于说明页面中的对象信息, 包括直线,字符, 方格等位置信息。
一些常用的方法
.extract_text()
用来提页面中的文本,将页面的所有字符对象整理为的那个字符串
.extract_words()
返回的是所有的单词及其相关信息
.extract_tables()
提取页面的表格
.to_image()
用于可视化调试时,返回PageImage类的一个实例
方法 | 描述 |
| 返回裁剪到边界框的页面版本,该页面的形式以4元组 表示。裁剪后的页面保留了至少部分位于边界框内的对象。如果对象仅部分落在该框内,则对其尺寸进行切片以适合边界框。 |
| 与相似 ,但仅保留完全落在边界框内的对象。 |
| 返回与只有一个版本的页面 为其 回报 。 |
| 将页面的所有字符对象整理到一个字符串中。 - 添加一个字符的 与下一个字符的 之间的差大于 (公差)的空格。 - 添加换行符 ,其中一个字符的字符与下一个字符的字符之间的差 大于 (公差)。 |
| 返回一个列表,其中包含所有看起来像单词的对象及其边界框。单词被认为是字符序列; 其中(对于“直立”字符)一个字符的 和下一个字符的 之间的差异小于或等于x_tolerance(公差),并且一个字符的doctor和下一个字符的doctor小于或等于y_tolerance(公差)。 对于非竖直字符也采用了类似的方法,但不是测量它们之间的垂直距离,而是测量它们之间的水平距离。参数horizontal_ltr和vertical_ttb指示是否应该从左到右(对于水平单词)/从上到下(对于垂直单词)读取单词。 |
| 从页面中提取表格数据。有关更多详细信息,请参见下面的“ 提取表 ”。 |
| 返回 该类的实例。有关更多详细信息,请参见下面的“ 可视调试 ”。有关conversion_kwargs,请参见此处。 |
pdfplumber.Page 提取表格方法
提取表格:示例
方法 | 描述 |
| 返回 对象列表。所述 对象提供访问 , 和 特性,以及该 方法。 |
| 返回从页面上所有表中提取的文本,结构为,以列表列表的形式表示 。 |
| 返回从页面上最大的表中提取的文本,该文本表示为列表列表,结构为 。(如果多个表具有相同的大小(以单元格的数量来衡量,则此方法将返回最接近页面顶部的表。) |
| 返回的实例 类,有访问 , , ,和 属性。 |
class PDF_Text_Image:
def pdf_extract_word(self,pdf_path):
with pdfplumber.open(pdf_path) as pdf:
metadata_info = pdf.metadata
pages_info = pdf.pages
print("Create_modDate",metadata_info)
print("Total_pages",len(pages_info))
for i in range(len(pages_info)):
pg_width = pages_info[i].width
pg_height = pages_info[i].width
pg_num = pages_info[i].page_number
pg_words = pages_info[i].extract_words()
raw_data = {"x":[],"y":[],"top":[],"bottom":[],"text":[]}
for j in range(len(pg_words)):
x0 = pg_words[j]["x0"]
x1 = pg_words[j]["x1"]
top = pg_words[j]["top"]
bottom = pg_words[j]["bottom"]
text = pg_words[j]["text"]
pg_tables = pages_info[i].find_tables()
pg_images = pages_info[i].images
try:
save_name = './aaaa.'
img = Image.open(BytesIO(pages_info[i].images[0]["stream"].rawdata))
img.save(save_name + img.format, quality=95)
except:
print("image error!")
pass
def extract_jpg_from_pdf(self,path):
pdf = open(path, "rb").read()
start_mark = b"\xff\xd8"
start_fix = 0
end_mark = b"\xff\xd9"
end_fix = 2
i = 0
n_jpg = 0
try:
is_stream = pdf.find(b"stream", i)
if is_stream < 0:
pass
is_start = pdf.find(start_mark, is_stream, is_stream + 20)
if is_start < 0:
i = is_stream + 20
pass
is_end = pdf.find(b"endstream", is_start)
if is_end < 0:
raise Exception("Didn't find end of stream !")
is_end = pdf.find(end_mark, is_end - 20)
if is_end < 0:
raise Exception("Didn't find end of JPG!")
is_start += start_fix
is_end += end_fix
print("JPG %d from %d to %d" % (n_jpg, is_start, is_end))
jpg = pdf[is_start:is_end]
print("提取图片" + "pic_%d.jpg" % n_jpg)
with open("./pic_%d.jpg" % n_jpg, "wb") as f:
f.write(jpg)
except:
pass
无需第三方包获取pdf图片
def extract_jpg_from_pdf(path):
pdf = open(path, "rb").read()
start_mark = b"\xff\xd8"
start_fix = 0
end_mark = b"\xff\xd9"
end_fix = 2
i = 0
n_jpg = 0
while True:
is_stream = pdf.find(b"stream", i)
if is_stream < 0:
break
is_start = pdf.find(start_mark, is_stream, is_stream + 20)
if is_start < 0:
i = is_stream + 20
continue
is_end = pdf.find(b"endstream", is_start)
if is_end < 0:
raise Exception("Didn't find end of stream !")
is_end = pdf.find(end_mark, is_end - 20)
if is_end < 0:
raise Exception("Didn't find end of JPG!")
is_start += start_fix
is_end += end_fix
print("JPG %d from %d to %d" % (n_jpg, is_start, is_end))
jpg = pdf[is_start:is_end]
print("提取图片" + "pic_%d.jpg" % n_jpg)
jpg_file = open("pic_%d.jpg" % n_jpg, "wb")
二、pymupdf 图片相关模块
安装
pip install PyMuPDF
,地址:https://pypi.org/project/PyMuPDF/ 官方使用文档:【点击查看】 直接跳转图像相关模块【点击查看】,感觉这个模块更专注于pdf-图片
pdf2image
import fitz
doc = fitz.open(pdf_path2)
for page in doc:
#============方法1============
pix = page.getPixmap(alpha = False) # 直接pdf转image
#===========方法2 =============
zoom_x = 2.0
zomm_y = 2.0
mat = fitz.Matrix(zoom_x, zomm_y)
pix = page.getPixmap(matrix=mat) # 放大(2X2),然后pdf转image
pix.writePNG("../../page-%i_1.png" % page.number)
import os
import fitz # 导入的是fitz
if __name__ == '__main__':
base_path = input("请输入要转换的文件路径:") # 输入要转换的PDF所在的文件夹
filenames = os.listdir(base_path) # 获取PDF文件列表
for filename in filenames:
full_path = os.path.join(base_path, filename) # 拼接,得到PDF文件的绝对路径
doc = fitz.open(full_path) # 打开PDF文件,doc为Document类型,是一个包含每一页PDF文件的列表
rotate = int(0) # 设置图片的旋转角度
zoom_x = 2.0 # 设置图片相对于PDF文件在X轴上的缩放比例
zoom_y = 2.0 # 设置图片相对于PDF文件在Y轴上的缩放比例
trans = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
print("%s开始转换..." % filename)
if doc.pageCount > 1: # 获取PDF的页数
for pg in range(doc.pageCount):
page = doc[pg] # 获得第pg页
pm = page.getPixmap(matrix=trans, alpha=False) # 将其转化为光栅文件(位数)
new_full_name = full_path.split(".")[0] # 保证输出的文件名不变
pm.writeImage("%s%s.jpg" % (new_full_name, pg)) # 将其输入为相应的图片格式,可以为位图,也可以为矢量图 # 我本来想输出为jpg文件,但是在网页中都是png格式(即调用writePNG),再转换成别的图像文件前,最好查一下是否支持
else:
page = doc[0]
pm = page.getPixmap(matrix=trans, alpha=False)
new_full_name = full_path.split(".")[0]
pm.writeImage("%s.jpg" % new_full_name)
print("%s转换完成!" % filename)
import fitz
from PIL import Image
import pandas as pd
doc = fitz.open("pdf", request_bytes)
txtblocks = {"text": [], "bbox_x0": [], "bbox_y0": [], "bbox_x1": [], "bbox_y1": []}
page = doc[0]
page_num = doc.page_count
pix = page.getPixmap(alpha=False)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
d = page.getText("dict")
blocks = d["blocks"]
for block in blocks:
if 0 == block["type"]:
for i in range(len(block["lines"])):
txtblocks["text"].append(block["lines"][i]["spans"][0]["text"])
bbox = block["lines"][i]["spans"][0]["bbox"]
txtblocks["bbox_x0"].append(bbox[0])
txtblocks["bbox_y0"].append(bbox[1])
txtblocks["bbox_x1"].append(bbox[2])
txtblocks["bbox_y1"].append(bbox[3])
pd_data = pd.DataFrame(txtblocks)
三、一些案例
3.1 普通 pdf2txt
安装依赖库:
pip install pdfplumber
安装依赖库:
pip install pdfminer
生成纯 txt 文本,无法格式化,效果不佳。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
import io
class PDFUtils():
def __init__(self):
pass
def pdf2txt(self, path):
output = io.StringIO()
with open(path, 'rb') as f: # 以二进制读模式打开
praser = PDFParser(f) # 连接分析器 与文档对象
doc = PDFDocument(praser)
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
pdfrm = PDFResourceManager() # 创建PDf 资源管理器 来管理共享资源
laparams = LAParams() # 创建一个PDF设备对象
device = PDFPageAggregator(pdfrm, laparams=laparams)
interpreter = PDFPageInterpreter(pdfrm, device) # 创建一个PDF解释器对象
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
layout = device.get_result()
for x in layout:
if hasattr(x, "get_text"):
content = x.get_text()
output.write(content)
content = output.getvalue()
output.close()
return content
if __name__ == '__main__':
path = './t3.pdf'
pdf_utils = PDFUtils()
print (pdf_utils.pdf2txt(path))
3.2 获取pdf文本提取以及签字信息
from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager,process_pdf
from endesive import pdf # 签字信息
def read_pdf(pdf):
# resource manager
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
# device
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdf)
device.close()
content = retstr.getvalue()
retstr.close()
# 获取所有行
lines = str(content).split("\n")
return lines
if __name__ == '__main__':
with open(r'D:\data\pdf\111.pdf', "rb") as my_pdf:
print(read_pdf(my_pdf)) # 文本信息
data = my_pdf.read()
print(pdf.verify(data)) # 签字信息
pdf 文件信息
from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information,
if __name__ == '__main__':
path = r'D:\data\pdf\111.pdf'
extract_information(path)
pdf数字签章(签字):https://github.com/m32/endesive
3.3 案例 e_invoice文本提取
转载自:http://www.yooongchun.com/2019/12/18/invoiceextractor/,代码测试可行
import os
import pdfplumber as pb
class Extractor(object):
def __init__(self, path):
self.file = path if os.path.isfile else None
def _load_data(self):
if self.file and os.path.splitext(self.file)[1] == '.pdf':
pdf = pb.open(self.file)
page = pdf.pages[0]
words = page.extract_words(x_tolerance=5)
lines = page.lines
# convert coordination
for index, word in enumerate(words):
words[index]['y0'] = word['top']
words[index]['y1'] = word['bottom']
for index, line in enumerate(lines):
lines[index]['x1'] = line['x0']+line['width']
lines[index]['y0'] = line['top']
lines[index]['y1'] = line['bottom']
return {'words': words, 'lines': lines}
else:
print("file %s cann't be opened." % self.file)
return None
def _fill_line(self, lines):
hlines = [line for line in lines if line['width'] > 0] # 筛选横线
hlines = sorted(hlines, key=lambda h: h['width'], reverse=True)[:-2] # 剔除较短的两根
vlines = [line for line in lines if line['height'] > 0] # 筛选竖线
vlines = sorted(vlines, key=lambda v: v['y0']) # 按照坐标排列
# 查找边框顶点
hx0 = hlines[0]['x0'] # 左侧
hx1 = hlines[0]['x1'] # 右侧
vy0 = vlines[0]['y0'] # 顶部
vy1 = vlines[-1]['y1'] # 底部
thline = {'x0': hx0, 'y0': vy0, 'x1': hx1, 'y1': vy0} # 顶部横线
bhline = {'x0': hx0, 'y0': vy1, 'x1': hx1, 'y1': vy1} # 底部横线
lvline = {'x0': hx0, 'y0': vy0, 'x1': hx0, 'y1': vy1} # 左侧竖线
rvline = {'x0': hx1, 'y0': vy0, 'x1': hx1, 'y1': vy1} # 右侧竖线
hlines.insert(0, thline)
hlines.append(bhline)
vlines.insert(0, lvline)
vlines.append(rvline)
return {'hlines': hlines, 'vlines': vlines}
def _is_point_in_rect(self, point, rect):
'''判断点是否在矩形内'''
px, py = point
p1, p2, p3, p4 = rect
if p1[0] <= px <= p2[0] and p1[1] <= py <= p3[1]:
return True
else:
return False
def _find_cross_points(self, hlines, vlines):
points = []
delta = 1
for vline in vlines:
vx0 = vline['x0']
vy0 = vline['y0']
vy1 = vline['y1']
for hline in hlines:
hx0 = hline['x0']
hy0 = hline['y0']
hx1 = hline['x1']
if (hx0-delta) <= vx0 <= (hx1+delta) and (vy0-delta) <= hy0 <= (vy1+delta):
points.append((int(vx0), int(hy0)))
return points
def _find_rects(self, cross_points):
# 构造矩阵
X = sorted(set([int(p[0]) for p in cross_points]))
Y = sorted(set([int(p[1]) for p in cross_points]))
df = pd.DataFrame(index=Y, columns=X)
for p in cross_points:
x, y = int(p[0]), int(p[1])
df.loc[y, x] = 1
df = df.fillna(0)
# 寻找矩形
rects = []
COLS = len(df.columns)-1
ROWS = len(df.index)-1
for row in range(ROWS):
for col in range(COLS):
p0 = df.iat[row, col] # 主点:必能构造一个矩阵
cnt = col+1
while cnt <= COLS:
p1 = df.iat[row, cnt]
p2 = df.iat[row+1, col]
p3 = df.iat[row+1, cnt]
if p0 and p1 and p2 and p3:
rects.append(((df.columns[col], df.index[row]), (df.columns[cnt], df.index[row]), (
df.columns[col], df.index[row+1]), (df.columns[cnt], df.index[row+1])))
break
else:
cnt += 1
return rects
def _put_words_into_rect(self, words, rects):
# 将words按照坐标层级放入矩阵中
groups = {}
delta = 2
for word in words:
p = (int(word['x0']), int((word['y0']+word['y1'])/2))
flag = False
for r in rects:
if self._is_point_in_rect(p, r):
flag = True
groups[('IN', r[0][1], r)] = groups.get(
('IN', r[0][1], r), [])+[word]
break
if not flag:
y_range = [
p[1]+x for x in range(delta)]+[p[1]-x for x in range(delta)]
out_ys = [k[1] for k in list(groups.keys()) if k[0] == 'OUT']
flag = False
for y in set(y_range):
if y in out_ys:
v = out_ys[out_ys.index(y)]
groups[('OUT', v)].append(word)
flag = True
break
if not flag:
groups[('OUT', p[1])] = [word]
return groups
def _find_text_by_same_line(self, group, delta=1):
words = {}
group = sorted(group, key=lambda x: x['x0'])
for w in group:
bottom = int(w['bottom'])
text = w['text']
k1 = [bottom-i for i in range(delta)]
k2 = [bottom+i for i in range(delta)]
k = set(k1+k2)
flag = False
for kk in k:
if kk in words:
words[kk] = words.get(kk, '')+text
flag = True
break
if not flag:
words[bottom] = words.get(bottom, '')+text
return words
def _split_words_into_diff_line(self, groups):
groups2 = {}
for k, g in groups.items():
words = self._find_text_by_same_line(g, 3)
groups2[k] = words
return groups2
def _index_of_y(self, x, rects):
for index, r in enumerate(rects):
if x == r[2][0][0]:
return index+1 if index+1 < len(rects) else None
return None
def _find_outer(self, k, words):
df = pd.DataFrame()
for pos, text in words.items():
if re.search(r'发票$', text): # 发票名称
df.loc[0, '发票名称'] = text
elif re.search(r'发票代码', text): # 发票代码
num = ''.join(re.findall(r'[0-9]+', text))
df.loc[0, '发票代码'] = num
elif re.search(r'发票号码', text): # 发票号码
num = ''.join(re.findall(r'[0-9]+', text))
df.loc[0, '发票号码'] = num
elif re.search(r'开票日期', text): # 开票日期
date = ''.join(re.findall(
r'[0-9]{4}年[0-9]{1,2}月[0-9]{1,2}日', text))
df.loc[0, '开票日期'] = date
elif '机器编号' in text and '校验码' in text: # 校验码
text1 = re.search(r'校验码:\d+', text)[0]
num = ''.join(re.findall(r'[0-9]+', text1))
df.loc[0, '校验码'] = num
text2 = re.search(r'机器编号:\d+', text)[0]
num = ''.join(re.findall(r'[0-9]+', text2))
df.loc[0, '机器编号'] = num
elif '机器编号' in text:
num = ''.join(re.findall(r'[0-9]+', text))
df.loc[0, '机器编号'] = num
elif '校验码' in text:
num = ''.join(re.findall(r'[0-9]+', text))
df.loc[0, '校验码'] = num
elif re.search(r'收款人', text):
items = re.split(r'收款人:|复核:|开票人:|销售方:', text)
items = [item for item in items if re.sub(
r'\s+', '', item) != '']
df.loc[0, '收款人'] = items[0] if items and len(items) > 0 else ''
df.loc[0, '复核'] = items[1] if items and len(items) > 1 else ''
df.loc[0, '开票人'] = items[2] if items and len(items) > 2 else ''
df.loc[0, '销售方'] = items[3] if items and len(items) > 3 else ''
return df
def _find_and_sort_rect_in_same_line(self, y, groups):
same_rects_k = [k for k, v in groups.items() if k[1] == y]
return sorted(same_rects_k, key=lambda x: x[2][0][0])
def _find_inner(self, k, words, groups, groups2, free_zone_flag=False):
df = pd.DataFrame()
sort_words = sorted(words.items(), key=lambda x: x[0])
text = [word for k, word in sort_words]
context = ''.join(text)
if '购买方' in context or '销售方' in context:
y = k[1]
x = k[2][0][0]
same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
target_index = self._index_of_y(x, same_rects_k)
target_k = same_rects_k[target_index]
group_context = groups2[target_k]
prefix = '购买方' if '购买方' in context else '销售方'
for pos, text in group_context.items():
if '名称' in text:
name = re.sub(r'名称:', '', text)
df.loc[0, prefix+'名称'] = name
elif '纳税人识别号' in text:
tax_man_id = re.sub(r'纳税人识别号:', '', text)
df.loc[0, prefix+'纳税人识别号'] = tax_man_id
elif '地址、电话' in text:
addr = re.sub(r'地址、电话:', '', text)
df.loc[0, prefix+'地址电话'] = addr
elif '开户行及账号' in text:
account = re.sub(r'开户行及账号:', '', text)
df.loc[0, prefix+'开户行及账号'] = account
elif '密码区' in context:
y = k[1]
x = k[2][0][0]
same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
target_index = self._index_of_y(x, same_rects_k)
target_k = same_rects_k[target_index]
words = groups2[target_k]
context = [v for k, v in words.items()]
context = ''.join(context)
df.loc[0, '密码区'] = context
elif '价税合计' in context:
y = k[1]
x = k[2][0][0]
same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
target_index = self._index_of_y(x, same_rects_k)
target_k = same_rects_k[target_index]
group_words = groups2[target_k]
group_context = ''.join([w for k, w in group_words.items()])
items = re.split(r'[((]小写[))]', group_context)
b = items[0] if items and len(items) > 0 else ''
s = items[1] if items and len(items) > 1 else ''
df.loc[0, '价税合计(大写)'] = b
df.loc[0, '价税合计(小写)'] = s
elif '备注' in context:
y = k[1]
x = k[2][0][0]
same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
target_index = self._index_of_y(x, same_rects_k)
if target_index:
target_k = same_rects_k[target_index]
group_words = groups2[target_k]
group_context = ''.join([w for k, w in group_words.items()])
df.loc[0, '备注'] = group_context
else:
df.loc[0, '备注'] = ''
else:
if free_zone_flag:
return df, free_zone_flag
y = k[1]
x = k[2][0][0]
same_rects_k = self._find_and_sort_rect_in_same_line(y, groups)
if len(same_rects_k) == 8:
free_zone_flag = True
for kk in same_rects_k:
yy = kk[1]
xx = kk[2][0][0]
words = groups2[kk]
words = sorted(words.items(), key=lambda x: x[0]) if words and len(
words) > 0 else None
key = words[0][1] if words and len(words) > 0 else None
val = [word[1] for word in words[1:]
] if key and words and len(words) > 1 else ''
val = '\n'.join(val) if val else ''
if key:
df.loc[0, key] = val
return df, free_zone_flag
def extract(self):
data = self._load_data()
words = data['words']
lines = data['lines']
lines = self._fill_line(lines)
hlines = lines['hlines']
vlines = lines['vlines']
cross_points = self._find_cross_points(hlines, vlines)
rects = self._find_rects(cross_points)
word_groups = self._put_words_into_rect(words, rects)
word_groups2 = self._split_words_into_diff_line(word_groups)
df = pd.DataFrame()
free_zone_flag = False
for k, words in word_groups2.items():
if k[0] == 'OUT':
df_item = self._find_outer(k, words)
else:
df_item, free_zone_flag = self._find_inner(
k, words, word_groups, word_groups2, free_zone_flag)
df = pd.concat([df, df_item], axis=1)
return df
if __name__=="__main__":
path=r't3.pdf'
data = Extractor(path).extract()
print(data)
获取文本图片等信息
pdf文件提取图片: javascript:void(0) docx文件提取图片:javascript:void(0)
四、pdf2word
github 地址 https://github.com/dothinking/pdf2docx 安装方法:
pip install pdf2docx -i https://pypi.tuna.tsinghua.edu.cn/simple
使用文档:https://dothinking.github.io/pdf2docx/quickstart.html
pdf2docx
from pdf2docx import Converter
pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file, start=0, end=None)
cv.close()
from pdf2docx import parse
pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=None)
提取表格
from pdf2docx import Converter
pdf_file = '/path/to/sample.pdf'
cv = Converter(pdf_file)
tables = cv.extract_tables(start=0, end=1)
cv.close()
for table in tables:
print(table)
命令行界面
$ pdf2docx --help
NAME
pdf2docx - Command line interface for pdf2docx.
SYNOPSIS
pdf2docx COMMAND | -
DESCRIPTION
Command line interface for pdf2docx.
COMMANDS
COMMAND is one of the following:
convert
Convert pdf file to docx file.
debug
Convert one PDF page and plot layout information for
按页面范围
指定页面范围
--start
(从(如果省略,则从第一页开始)和
--end
(到如果省略,则 至最后一页)之间。
默认情况下,页面索引是从零开始的,但是可以通过将其关闭
--zero_based_index=False
,即,第一个页面索引从1开始。
转换所有页面:
$ pdf2docx convert test.pdf test.docx
将页面从第二个转换到最后:
$ pdf2docx convert test.pdf test.docx --start=1
将页面从第一页转换到第三页(索引= 2):
$ pdf2docx convert test.pdf test.docx --end=3
转换第二页和第三页:
$ pdf2docx convert test.pdf test.docx --start=1 --end=3
使用从零开始的索引转换第一和第二页,请关闭:
$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False
按页码
转换第一,第三和第五页:
$ pdf2docx convert test.pdf test.docx --pages=0,2,4
$ pdf2docx convert test.pdf test.docx --multi_processing=True
$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4