利用python從《牛津高階英漢詞典》裡提取單詞清單

2023-01-29 12:09:26

從網上下載下傳的《牛津高階英漢詞典》是以文本的形式存在于A-Z的檔案夾中。每個檔案夾有多個檔案。如圖：

檔案夾A裡的檔案有：

利用python從《牛津高階英漢詞典》裡提取單詞清單

其它檔案夾基本也是多個檔案。

基本思路是通過檔案夾周遊找出所有檔案。然後對這些檔案逐個應用正規表達式進行搜尋，提取單詞清單。經過前面若幹天的學習，填了一個又一個坑，現在可以給出一個比較成熟的代碼了。

import re

p=re.compile(r"\b[-a-z]{2,40}\s?\r\n")

#周遊檔案夾下所有子檔案夾即檔案，
#傳回包括子檔案夾在内的全部檔案
def list_all_files(dir):
    import os.path
    _files=[]
    list=os.listdir(dir)
    for i in range(0,len(list)):
        path=os.path.join(dir,list[i])
        if os.path.isdir(path):
            _files.extend(list_all_files(path))
        if os.path.isfile(path):
            _files.append(path)
    return _files
#用來進行單詞計數
count=0
files=list_all_files("oxford-dict")
with open("listofwords.txt","w") as f:
    for file in files:
        f.write("\n"+file+"\n")
        with open(file,"rb") as fr:
            str=fr.read().decode("gbk","ignore")#别忘了“ignore”，此坑甚大。
            words=re.findall(p,str)
            #去除重複的單詞
            word_remove_duplication=[]
            for word in words:
                if word not in word_remove_duplication:
                    word_remove_duplication.append(word)
            #将單詞寫入文本
            for s in word_remove_duplication:
                f.write(s)
                count=count+1
    f.write("單詞數量大約是：{}".format(count))

輸出文本如下：

利用python從《牛津高階英漢詞典》裡提取單詞清單

利用python從《牛津高階英漢詞典》裡提取單詞清單

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Netty——自定義協定解決TCP粘包拆包問題什麼是TCP粘包拆包自定義協定解決拆包粘包問題

neo4j之cypher使用文檔

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入