【代碼】一鍵下載下傳當天ArXiv上的pdf檔案

2023-06-24 13:24:39

一鍵下載下傳當天ArXiv上的pdf檔案

複制代碼即可一鍵運作。

ArXiv網址：https://arxiv.org/list/astro-ph/new

直接上代碼：

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import numpy as np
import pandas as np
import urllib
import re
import os
 
html = urlopen("https://arxiv.org/list/astro-ph/new")
bsObj = BeautifulSoup(html, "lxml")
 
# 判斷是否下載下傳pdf檔案：
def decide(title, abstract, regular):
    s1 = re.search(regular, str(title.get_text()))
    s2 = re.search(regular, str(abstract.get_text()))
    if s1 is not None:
        return True
    elif s2 is not None:
        return True
    else:
        return False
 
# 無輸入用預設
regular = input("input regular expression:")
if regular == '':
    regular = "GRB|FRB|GW"
 
# 将pdf儲存到path路徑下：
dateline = bsObj.find("h3")
year = '20' + dateline.get_text().split(' ')[-1] + '/'
month = dateline.get_text().split(' ')[-2] + '/'
path = 'F:/ArXiv/' + year + month
# 若沒有此路徑，建立一個路徑：
isExists = os.path.exists(path)
if not isExists:
    os.makedirs(path)
 
titleList = bsObj.findAll("div", {"class":"list-title mathjax"})
for title in titleList:
    abstract = title.parent.find("p", {"class":"mathjax"})
    if abstract is not None:
        if decide(title, abstract, regular):      # 判斷是否下載下傳到本地
            download = title.parent.parent.previous_sibling.previous_sibling.find("a", {"title":"Download PDF"}).attrs['href']
            fileUrl = 'https://arxiv.org' + download
            savePath = path + download[5:] + '.pdf'
            if os.path.isfile(savePath):
                os.remove(savePath)             # 覆寫原檔案
            urlretrieve(fileUrl, savePath)
            print('%s is done!' %title.get_text())
print('Finished')

如果環境沒有配好，事先需要安裝好所需的包

pip install pandas bs4 lxml

文獻儲存的路徑為： F:\ArXiv\year\month

自動按年月分好類放在F:\ArXiv檔案夾下，可在path處修改。

輸入正規表達式：

input regular expression:

直接回車為預設值，預設搜尋和GRB, FRB, GW相關的内容，可在regular處修改。可根據自己的需求輸入正規表達式。

看到Finished表面下載下傳完成

參考部落格：一鍵下載下傳當天ArXiv上的pdf檔案

【代碼】一鍵下載下傳當天ArXiv上的pdf檔案

一鍵下載下傳當天ArXiv上的pdf檔案

ArXiv網址：https://arxiv.org/list/astro-ph/new

直接上代碼：

如果環境沒有配好，事先需要安裝好所需的包

文獻儲存的路徑為： F:\ArXiv\year\month

輸入正規表達式：

看到Finished表面下載下傳完成

繼續閱讀

Intel® 64 and IA-32 Architectures Software Developer's Manuals

學Silverlight 2系列(22)：在Silverlight中用JavaScript調用.NET代碼

[引]在Oracle中如何利用Rowid查找和删除表中的重複記錄

LoadRunner腳本調試-關聯

測試工具之 LoadRunner & WinRunner

loadrunner之自動關聯

LR中檢查點函數簡析

Windows 指令行基礎(轉載同學的)

将visio的圖檔插入latex（png格式轉換成eps格式圖檔）

.bat批處理指令學習

ASP程式設計中15個非常有用的例子

DOS批處理腳本語言簡介

DOS 批處理檔案

Lua的預設metamethod清單

恢複Linux作業系統的GRUB引導程式

spec檔案詳解