In modern society, parents are paying more and more attention to their children's education. With the summer vacation approaching, my friend Xiao Wang is facing a common anxiety of many parents: how to prepare for their children who are about to enter fifth grade. Xiao Wang's children are smart and lively and curious about new knowledge, but Xiao Wang is worried that there are some problems with the electronic version of the textbook, which will affect the child's learning experience.
It turned out that the electronic version of the textbook that Xiao Wang found was the Word version, which came with a large number of pictures [there is only one picture on each page of word]. These pictures were originally intended to assist in teaching, but when they were printed, they found that there were obvious black frames on the edges, which was very unsightly. Xiao Wang didn't want his children to be resistant to learning because of these problems, so he thought of my friend who can program to see if he could solve this problem with technical means.
I thought about it for a while and decided that I could write a program in Python that would extract images from a Word document and save them in JPG format. In this way, Xiao Wang can print out the extracted pictures to give the child a clearer and more beautiful learning material.
Core code
import docx
import os, re
word_path = 'E:\\code\\plan_work\\Demo.docx'
result_path = "./img_result"
# doc = docx.Document(word_path)
# dict_rel = doc.part._rels
# for rel in dict_rel:
# rel = dict_rel[rel]
# if "image" in rel.target_ref:
# if not os.path.exists(result_path):
# os.makedirs(result_path)
# img_name = re.findall("/(.*)", rel.target_ref)[0]
# word_name = os.path.splitext(word_path)[0]
# if os.sep in word_name:
# new_name = word_name.split('\\')[-1]
# else:
# new_name = word_name.split('/')[-1]
# img_name = f'{new_name}-'+'.'+f'{img_name}'
# with open(f'{result_path}/{img_name}', "wb") as f:
# f.write(rel.target_part.blob)
def get_pictures(word_path, result_path):
"""
图片提取
:param word_path: word路径
:return:
"""
try:
doc = docx.Document(word_path)
dict_rel = doc.part._rels
for rel in dict_rel:
rel = dict_rel[rel]
if "image" in rel.target_ref:
if not os.path.exists(result_path):
os.makedirs(result_path)
img_name = re.findall("/(.*)", rel.target_ref)[0]
word_name = os.path.splitext(word_path)[0]
if os.sep in word_name:
new_name = word_name.split('\\')[-1]
else:
new_name = word_name.split('/')[-1]
img_name = f'{new_name}-'+'.'+f'{img_name}'
with open(f'{result_path}/{img_name}', "wb") as f:
f.write(rel.target_part.blob)
except:
pass
if __name__ == '__main__':
#获取文件夹下的word文档列表,路径自定义
# os.chdir("D:\Demo")
# spam=os.listdir(os.getcwd())
# for i in spam:
# get_pictures(str(i),os.getcwd())
get_pictures(word_path,result_path)