蜘蛛，爬蟲多,代碼品質差下的相對供求平衡政策

需求分析：

由于種種問題，導緻蜘蛛通路和抓取量大的的時候，背景資料庫高負載，影響正常的使用者通路和英文平台的通路！比較推薦的做法是寫robot.txt檔案，但seo方面又希望對蜘蛛通路不做速度和頁面方面的限制，典型的僧多粥少場景，或者使用oracle的資源計劃來限制資料庫使用者的會話連接配接數，但可能對正常的使用者造成影響！是以想做一個相對智能的腳本對爬蟲進行适當的限制，保證資料庫伺服器負載正常的情況下，最大限度的允許爬蟲通路，當然這隻是治标不治本的方法，臨時解決下，正常還是要去優化資料庫的SQL,或者用緩存等手段做靜态化的頁面，再不然就做資料庫的讀寫分離，hiberinate架構生成的SQL不是人看的；

政策原則：

利用防火牆，臨時限制蜘蛛通路，降低oracle資料庫的負載量，每隔3分鐘自動關閉防火牆，盡量減少對蜘蛛通路的影響

一：前端web設定任務計劃和相關腳本

[root@server195 ~]# crontab -l

*/3 * * * * /usr/local/scripts/flush_iptables.sh

[root@server195 ~]# cat /usr/local/scripts/flush_iptables.sh

#!/bin/sh

/sbin/iptables -F

/sbin/iptables -X

/sbin/iptables -Z

/sbin/iptables -t nat -F

/sbin/iptables -t nat -X

/sbin/iptables -t nat -Z

防火牆腳本進行蜘蛛IP 60秒動态抽取，該腳本由oracle伺服器端來調用，web伺服器和oracle伺服器之間需要配置ssh密鑰信任

[root@server195 ~]# cat /usr/local/scripts/deny_spider.sh

#function: deny some spider to protect oracle server

#author:lw.yang

#modify_time:2011-12-20

DATE=$(date +%Y%m%d)

rm -rf /tmp/soso_spider_ip.txt /tmp/youdao_spider_ip.txt /tmp/sogou_spider_ip.txt /tmp/baidu_spider_ip.txt

service iptables stop

tail -f /data/apache_logs/access_log_$DATE |grep -i 'spider'|grep -i 'soso' |awk -F ' ' '{print

$3}' >> /tmp/soso_spider_ip.txt &

tail -f /data/apache_logs/access_log_$DATE |grep -i 'spider'|grep -i 'youdao' |awk -F ' '

'{print $3}' >> /tmp/youdao_spider_ip.txt &

tail -f /data/apache_logs/access_log_$DATE |grep -i 'spider'|grep -i 'sogou' |awk -F ' '

'{print $3}' >> /tmp/sogou_spider_ip.txt &

tail -f /data/apache_logs/access_log_$DATE |grep -i 'spider'|grep -i 'baidu' |awk -F ' '

'{print $3}' >> /tmp/baidu_spider_ip.txt &

sleep 60

killall -9 tail

for i in $(cat /tmp/soso_spider_ip.txt /tmp/youdao_spider_ip.txt /tmp/sogou_spider_ip.txt

/tmp/baidu_spider_ip.txt|uniq);

iptables -A INPUT -s $i/32 -p tcp --dport 80 -j DROP

done

二：資料庫端的任務計劃和相關腳本

[root@server199 ~]# crontab -l

*/1 * * * * /usr/local/scripts/check_load.sh

[root@server199 ~]# cat /usr/local/scripts/check_load.sh

#function: trigger deny spider scripts on web server

LOAD=$(uptime |awk -F ',' '{print $4}' |awk -F '.' '{print $1}' |cut -d ':' -f 2)

if [ -f /tmp/begin_deny_spider.txt ];then

exit

elif [ $LOAD -gt 18 ];then

date > /tmp/begin_deny_spider.txt

ssh -p 2007 192.168.1.195 /usr/local/scripts/deny_spider.sh &

elif [ $LOAD -lt 10 ];then

rm -rf /tmp/begin_deny_spider.txt

else

三：效果

限制前

限制後

web端防火牆清單

四：robot.txt檔案和apache配置限制方法參考

# cat robots.txt

User-agent: *

Robot-version: 2.0

Crawl-delay: 10

Request-rate: 90/1m

Disallow: /WEB-INF/

apache配置：

SetEnvIfNoCase User-Agent "^Sogou" bad_bot

SetEnvIfNoCase User-Agent "^Sosospider" bad_bot

SetEnvIfNoCase User-Agent "^qihoobot" bad_bot

SetEnvIfNoCase User-Agent "^CollapsarWEB" bad_bot

Options FollowSymLinks

AllowOverride None

Order deny,allow

Deny from env=bad_bot

Allow from all

</Directory>

2011年12月22日更新check_load腳本，采用背景常駐程序運作，取消crontab

[root@server199 ~]# cat /usr/local/scripts/check_load.sh

#modify_time:2011-12-22

while true

LOAD=$(uptime |awk -F ',' '{print $4}' |awk -F '.' '{print $1}' |cut -d ':' -f 2)

if [ $LOAD -gt 8 ];then

ssh -p 2007 192.168.1.195 /usr/local/scripts/deny_spider.sh &

if [ $LOAD -lt 4 ];then

ssh -p 2007 192.168.1.195 /usr/local/scripts/flush_iptables.sh &

sleep 3

本文轉自斬月部落格51CTO部落格，原文連結http://blog.51cto.com/ylw6006/746553如需轉載請自行聯系原作者

ylw6006

蜘蛛，爬蟲多,代碼品質差下的相對供求平衡政策

繼續閱讀

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

Apache2.4.x 配置檔案詳解Apache配置需要了解如下：開始講解：

配置apache支援PHP（win7）

ACS基本配置-權限等級管理

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Bugku-WEB-web33

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

sqlServer根據經緯查距離

SequoiaDB巨杉資料庫C++驅動概述

Oracle 批量查詢傳入List 傳回List