天天看點

【代碼】一鍵下載下傳當天ArXiv上的pdf檔案

一鍵下載下傳當天ArXiv上的pdf檔案

複制代碼即可一鍵運作。

ArXiv網址:https://arxiv.org/list/astro-ph/new

直接上代碼:

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import numpy as np
import pandas as np
import urllib
import re
import os
 
html = urlopen("https://arxiv.org/list/astro-ph/new")
bsObj = BeautifulSoup(html, "lxml")
 
# 判斷是否下載下傳pdf檔案:
def decide(title, abstract, regular):
    s1 = re.search(regular, str(title.get_text()))
    s2 = re.search(regular, str(abstract.get_text()))
    if s1 is not None:
        return True
    elif s2 is not None:
        return True
    else:
        return False
 
# 無輸入用預設
regular = input("input regular expression:")
if regular == '':
    regular = "GRB|FRB|GW"
 
# 将pdf儲存到path路徑下:
dateline = bsObj.find("h3")
year = '20' + dateline.get_text().split(' ')[-1] + '/'
month = dateline.get_text().split(' ')[-2] + '/'
path = 'F:/ArXiv/' + year + month
# 若沒有此路徑,建立一個路徑:
isExists = os.path.exists(path)
if not isExists:
    os.makedirs(path)
 
titleList = bsObj.findAll("div", {"class":"list-title mathjax"})
for title in titleList:
    abstract = title.parent.find("p", {"class":"mathjax"})
    if abstract is not None:
        if decide(title, abstract, regular):      # 判斷是否下載下傳到本地
            download = title.parent.parent.previous_sibling.previous_sibling.find("a", {"title":"Download PDF"}).attrs['href']
            fileUrl = 'https://arxiv.org' + download
            savePath = path + download[5:] + '.pdf'
            if os.path.isfile(savePath):
                os.remove(savePath)             # 覆寫原檔案
            urlretrieve(fileUrl, savePath)
            print('%s is done!' %title.get_text())
print('Finished')
           

如果環境沒有配好,事先需要安裝好所需的包

pip install pandas bs4 lxml
           

文獻儲存的路徑為: F:\ArXiv\year\month

自動按年月分好類放在F:\ArXiv檔案夾下,可在path處修改。

輸入正規表達式:

input regular expression:
           

直接回車為預設值,預設搜尋和GRB, FRB, GW相關的内容,可在regular處修改。可根據自己的需求輸入正規表達式。

看到Finished表面下載下傳完成

參考部落格:一鍵下載下傳當天ArXiv上的pdf檔案

繼續閱讀