天天看點

使用BeautifulSoup去除URL标簽使用BeautifulSoup去除URL标簽

使用BeautifulSoup去除URL标簽

原始的文本資訊如下圖:

使用BeautifulSoup去除URL标簽使用BeautifulSoup去除URL标簽

處理後的文本資訊如下圖:

使用BeautifulSoup去除URL标簽使用BeautifulSoup去除URL标簽

處理代碼如下,python 3.5

# encoding = utf-8
from bs4 import BeautifulSoup
import time
import string
t1 = time.time()
f = open('undergraduatePOI.txt','rb')
result = ''
for eachLine in f:
    t = eachLine.strip().decode('utf8')
    soup = BeautifulSoup(t)
    string = soup.get_text()
    print(string)
    result +="\n"+str(string)
f = open('Puser2.txt', "w", encoding='utf-8')
f.write(result)
f.close()
print("\n"+">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"+"\n"+"列印完畢")
t2 = time.time()
print("去除URL用時:"+str(t2-t1)+"秒")
           

繼續閱讀