Nginx 過濾網絡爬蟲

2023-04-21 12:11:41

現在的網絡爬蟲越來越多，有很多爬蟲都是初學者寫的，和搜尋引擎的爬蟲不一樣，他們不懂如何控制速度，結果往往大量消耗伺服器資源，導緻帶寬白白浪費了。有對網站收錄有益的，比如百度蜘蛛（Baiduspider），也有不但不遵守robots規則對伺服器造成壓力，還不能為網站帶來流量的無用爬蟲，比如宜搜蜘蛛（YisouSpider）（最新補充：宜搜蜘蛛已被UC神馬搜尋收購！是以本文已去掉宜搜蜘蛛的禁封！==>相關文章)。

進入到nginx安裝目錄下的conf目錄，将如下代碼儲存為

agent_deny.conf

# cd /usr/local/nginx/conf

# vi agent_deny.conf

#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {

return 403;

}



#禁止指定UA及UA為空的通路

if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) {

return 403;

}



#禁止非GET|HEAD|POST方式的抓取

if ($request_method !~ ^(GET|HEAD|POST)$) {

return 403;

}

然後，在網站相關配置中的 server段插入如下代碼：

include agent_deny.conf;

儲存後，執行如下指令，平滑重新開機nginx即可：

/usr/local/nginx/sbin/nginx -s reload

測試

使用curl -A 模拟抓取即可，比如：

curl -I-A'YYSpider'www.test.com

模拟UA為空的抓取：

curl -I-A' ' www.test.com

模拟百度蜘蛛的抓取：

curl -I -A 'Baiduspider' www.test.com

附錄：UA收集

FeedDemon 内容采集

BOT/0.1 (BOT for JCE) sql注入

CrawlDaddy sql注入

Java 内容采集

Jullo 内容采集

Feedly 内容采集

UniversalFeedParser 内容采集

ApacheBench cc攻擊器

Swiftbot 無用爬蟲

YandexBot 無用爬蟲

AhrefsBot 無用爬蟲

YisouSpider 無用爬蟲（已被UC神馬搜尋收購，此蜘蛛可以放開！）

jikeSpider 無用爬蟲

MJ12bot 無用爬蟲

ZmEu phpmyadmin 漏洞掃描

WinHttp 采集cc攻擊

EasouSpider 無用爬蟲

HttpClient tcp攻擊

Microsoft URL Control 掃描

YYSpider 無用爬蟲

jaunty wordpress爆破掃描器

oBot 無用爬蟲

Python-urllib 内容采集

Indy Library 掃描

FlightDeckReports Bot 無用爬蟲

Linguee Bot 無用爬蟲

Nginx 過濾網絡爬蟲

附錄：UA收集

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入