天天看点

用python将小册子打印扫描的A3幅面双页乱码的PDF文件转换A4幅面顺码的PDF文件

程序功能: 用python将小册子打印扫描的A3幅面双页乱码的PDF文件转换A4幅面顺码的PDF文件

        问题:A3幅面的双面打印的在中间用骑马钉装订成小册子, 拆开中间装订的订书钉由复印机一次性扫描成PDF文件。

        阅读不方便:页码错乱,A3幅面,左右两页。

        本程序将扫描的A3幅面的PDF文件转换成理顺页码单面的A4幅面的PDF文件。

编程思路: PDF文件==>页面png(图片文件)(用到:pdf2image,poppler中的 pdftommp.exe )

               ==>图片文件A3大小切分成2个A4幅面的图片文件 (用到:PIL中的Image)

               ==>图片文件组成PDF文件(用到:img2pdf中的img2pdf.convert(pngList)

import os,sys
import img2pdf
'''
程序功能: 用python将小册子打印扫描的A3幅面双页乱码的PDF文件转换A4幅面顺码的PDF文件
        问题:A3幅面的双面打印的在中间用骑马钉装订成小册子, 拆开中间装订的订书钉由复印机一次性扫描成PDF文件。
        阅读不方便:页码错乱,A3幅面,左右两页。
        本程序将扫描的A3幅面的PDF文件转换成理顺页码单面的A4幅面的PDF文件。
编程思路: PDF文件==>页面png(图片文件)(用到:pdf2image,oppler中的 pdftommp.exe )
               ==>图片文件A3大小切分成2个A4幅面的图片文件 (用到:PIL中的Image)
               ==>图片文件组成PDF文件(用到:img2pdf中的img2pdf.convert(pngList)

附注:实测中使用png格式的形成的A4幅面的PDF较小。

重要事项:
(1) 程序安装在d:盘(或e:)的d:\leader
(2) 解压poppler-0.68.0_x86后得到的bin目录下的文件安装到 d:\leader\bin;
    并将d:\leader\bin加入到windows的path变量中。(下面的批处理文件已经解决这个问题)
(3) 批处理文件可以如下:
    rem main.cmd
    path d:\leader\bin;%path%
    d:
    cd \leader
    python main.prg %1

(4) 运行时,可以: main d:\A3pdf目录
    也可以: 直接main运行。不添加命令行参数使用缺省A3PDF目录为: d:\leader\pdf
    使用时只需将要转换的A3pdf文件copy到 d:\leader\pdf之下即可。
    转换完成的文件在 d:\leader\pdf\A4子目录下。
    中间转换时产生的图片文件在 d:\leader\pdf\PNG 子目录下。这些图片文件可以删除。

版本 0.1
    版本 01,使用全局变量,各个函数均在一个文件中。

程序使用了几个库:
pip install pillow
pip install PyPdf3
pip install pdf2image
pip install img2pdf

程序还使用了 poppler-0.68.0_x86
pdf2image是包装器,poppler是转换过程真正需要的。


编程 叶照清 [email protected]
日期 2021.01.25

=============
Poppler for Windows
I have been using the Poppler library for some time, over a series of various projects. It’s an open source set of libraries and command line tools, very useful for dealing with PDF files. Poppler is targeted primarily for the Linux environment, but the developers have included Windows support as well in the source code. Getting the executables (exe) and/or dlls for the latest version however is very difficult on Windows. So after years of pain, I jumped on oDesk and contracted Ilya Kitaev, to both compile with Microsoft Visual Studio, and also prepare automated tools for easy compiling in the future. Update: MSVC isn’t very well supported, these days the download is based off MinGW.

So now, you can run the following utilities from Windows!

PDFToText – Extract all the text from PDF document. I suggest you use the -Layout option for getting the content in the right order.
PDFToHTML – Which I use with the -xml option to get an XML file listing all of the text segments’ text, position and size, very handy for processing in C#
PDFToCairo – For exporting to images types, including SVG!
Many more smaller utilities
Download

Latest binary : poppler-0.68.0_x86
http://blog.alivate.com.au/wp-content/uploads/2018/10/poppler-0.68.0_x86.7z
'''

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)
from pdf2image import convert_from_path
import os,sys,PyPDF3
from PIL import Image

def pdf2img(pdf_file):
    basename=os.path.basename(pdf_file)[:-4]
    try:
        images = convert_from_path(pdf_file)
        for idx, img in enumerate(images):
            path=path_png+rf'\{basename}_{idx:02d}.png'
            img.save(path)
    except Exception as e:
        print(e)

def pic_half(filename1,No,MaxPage):
    basename=os.path.basename(filename1)[:-3]
    basefileName= path_png+'\\'+basename
    MaxPage+=1
    
    img = Image.open(filename1+'.png')
    size = img.size
    #print(size)

    # 准备将图片切割成2张小图片
    weight = int(size[0] // 2)
    height = int(size[1] // 1)
    # 切割后的小图的宽度和高度
    #print(weight, height)
    for j in range(1):
        for i in range(2):
            box = (weight * i, height * j, weight * (i + 1), height * (j + 1))
            #print(box)
            imgHalf = img.crop(box)
            if No%2==1:
                if i==0:
                    fsave= basefileName+f'_A4_{(MaxPage-No):02d}.png'
                else:
                    fsave= basefileName+f'_A4_{No:02d}.png'
            else:
                if i==1:
                    fsave= basefileName+f'_A4_{(MaxPage-No):02d}.png'
                else:
                    fsave= basefileName+f'_A4_{No:02d}.png'
                
##            print('\t'+fsave)
            imgHalf.save(fsave)
    img.close()

def one_pdf(pdf_file1):
##    '''
##        A3.pdf 总页数 测试的是10页==》对折20页
##    '''
##    try:
##        pdf_stream = open(pdf_file1,'rb')
##        pdf = PyPDF3.PdfFileReader(pdf_stream)
##    except:
##        print(f"{pdf_file1} 不是合法的PDF文件!")
##        exit(1)
##             
##    maxP=pdf.numPages
##    pdf_stream.close()
##    del pdf
    
    pdf2img(pdf_file1)
        
    basename=os.path.basename(pdf_file1)[:-4]
    
    for i in range(1,maxP+1):
        A3_png=path_png+f'\\{basename}_{i-1:02d}'
        print(A3_png)
        No=i
        pic_half(A3_png,No,maxP*2)

def doImg2Pdf(fileName):
    bb=pdf_b_name[:-4]
    with open(f"{A4_dir}\\{bb}_A4.pdf", "wb") as f:
        
        #fileList = os.listdir(fileName)
        #print(fileList)
        pngList = []
        for ii in range(1,maxP*2+1):
           
            pngName =f'{bb}_A4_{ii:02d}.png'
            print('\t'+pngName)
            pngList.append(pngName) 
        pfn_bytes = img2pdf.convert(pngList)
        f.write(pfn_bytes)
    print(f"{A4_dir}\\{bb}_A4.pdf 转换完成。\n")


################################       
root = os.path.abspath(os.path.dirname(__file__))
dd=''
if root.find(":") == 1:
    print(__file__)
    dd=root[:2]
path =dd+r'\LEADER\PDF'
path_a4 = path+r'\A4'
path_png = path+r'\PNG'
PDF_list=[]
maxP=10

if len(sys.argv)>1:
    path = sys.argv[1]
    
if os.path.isdir(path):
    path_a4 = path+'\\A4'
    path_png = path+'\\PNG'
    if not os.path.exists(path_a4) : os.makedirs(path_a4)
    if not os.path.exists(path_png): os.makedirs(path_png)                
    Dir_l = os.listdir(path)
    for ff in Dir_l[:]:
        if ff.find('.pdf') == -1:
            Dir_l.remove(ff)
    print("需转换的文件列表:")
    for i in range(len(Dir_l)):
        print(f'{i:04d}\t{Dir_l[i]}')
    PDF_list = Dir_l
    print(f"\n需转换的文件总数:{i+1:04d}")
else:
    print( f'{path} :非法目录')
    exit(1)

if len(PDF_list) == 0 : print("无PDF文件!");exit(0)



#Main_loop
for pdf_file in PDF_list:
    full_fileName = path+'\\'+pdf_file
    print(f'{full_fileName} 转换中。。。')

    try:
        pdf_stream = open(full_fileName,'rb')
        pdf = PyPDF3.PdfFileReader(pdf_stream)
    except:
        print(f"{Full_fileName} 不是合法的PDF文件!")
        exit(1)
             
    maxP=pdf.numPages
    pdf_stream.close()
    del pdf
    
    one_pdf(full_fileName)

    pdf_file = full_fileName   #"e:\TEST\叶照清.pdf"
    png_dir = os.path.dirname(pdf_file)+'\\PNG'
    A4_dir  = os.path.dirname(pdf_file)+'\\A4'
    pdf_b_name = os.path.basename(pdf_file)

    os.chdir(png_dir)

    doImg2Pdf(png_dir)

#eof