天天看點

閑聊Robots協定

其實,我了解搜尋引擎方面的知識是比較晚的~~~對robots協定還是來自2012年的“3B大戰“也就是360和百度之間的一場争論!!

360呢,在2012年推出了自己的一款搜尋引擎”360搜尋“,并在釋出沒多久就一躍成為中國第二大搜尋引擎,超越搜狗,僅次于百度!!!

但是呢,百度就指出自己的Robots文本中已設定不允許360爬蟲進入,而360的爬蟲依然對“百度知道”、“百度百科”等百度網站内容進行抓取。

違反了國際上”Robots協定“。一下是關于這方面大家可以檢視:http://baike.baidu.com/view/9230864.htm  至此呢,我才了解到了”Robots協定“

百度一下,了解到”

        robots協定(也稱為爬蟲協定、爬蟲規則、機器人協定等)也就是 robots.txt,網站通過robots協定告訴搜尋引擎哪些頁面可以抓取,哪些頁面不能抓取。Robots協定是網站國際 網際網路界通行的道德規範,其目的是保護網站資料和敏感資訊、確定使用者個人資訊和隐私不被侵犯。因其不是指令,故需要搜尋引擎自覺遵守。一些 病毒如 malware(馬威爾病毒)經常通過忽略robots協定的方式,擷取網站背景資料和個人資訊。

           robots.txt檔案是一個 文本檔案,使用 任何一個常見的文本編輯器,比如 Windows系統自帶的Notepad,就可以建立和編輯它。 robots.txt是一個協定,而不是一個 指令。robots.txt是 搜尋引擎中通路網站的時候要檢視的第一個檔案。robots.txt檔案告訴 蜘蛛程式在伺服器上什麼檔案是可以被檢視的。 當一個搜尋蜘蛛通路一個 站點時,它會首先 檢查該站點 根目錄下是否存在robots.txt,如果存在,搜尋機器人就會按照該檔案中的内容來确定通路的範圍;如果該檔案不存在,所有的搜尋蜘蛛将能夠通路網站上所有沒有被密碼保護的頁面。百度官方建議,僅當您的網站包含不希望被 搜尋引擎收錄的内容時,才需要使用robots.txt檔案。如果您希望搜尋引擎收錄網站上所有内容,請勿建立robots.txt檔案。 如果将 網站視為酒店裡的一個房間,robots.txt就是主人在房間門口懸挂的“ 請勿打擾”或“歡迎打掃”的提示牌。這個檔案告訴來訪的搜尋引擎哪些房間可以進入和參觀,哪些房間因為存放貴重物品,或可能涉及住戶及訪客的隐私而不對搜尋引擎開放。但robots.txt不是 指令,也不是 防火牆,如同守門人無法阻止竊賊等惡意闖入者。

以上是來自百度的解釋!!Robots僅僅是一種協定而已!如果你不遵循它,那也沒辦法!隻能通過打官司解決了!!

我們來看一下各大網站的Robots.txt吧~~~

www.baidu.com/robots.txt

User-agent: Baiduspider
Disallow: /w?

User-agent: Googlebot
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: MSNBot
Allow: /

User-agent: Baiduspider-image
Disallow: /w?

User-agent: YoudaoBot
Allow: /

User-agent: Sogou web spider
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: Sogou inst spider
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: Sogou spider2
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: Sogou blog
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: Sogou News Spider
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: Sogou Orion spider
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: JikeSpider
Allow: /

User-agent: Sosospider
Allow: /

User-agent: YYspider
Allow: /

User-agent: PangusoSpider
Allow: /

User-agent: yisouspider
Allow: /

User-agent: EasouSpider
Allow: /

User-agent: *
Disallow: /
           

上面是什麼意思就不用多說了吧、?User-agent後面跟的也就是網絡爬蟲的名字了!!!

正如百度所說,确實沒允許360spider進行爬取!!

www.google.com/robots.txt

User-agent: *
Disallow: /search
Disallow: /sdch
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Allow: /catalogs/about
Allow: /catalogs/p?
Disallow: /catalogues
Disallow: /news
Allow: /news/directory
Disallow: /nwshp
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /imgres
Disallow: /imglanding
Disallow: /sbd
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /shihui?
Disallow: /shihui/
Disallow: /froogle?
Disallow: /products?
Disallow: /products/
Disallow: /froogle_
Disallow: /product_
Disallow: /products_
Disallow: /products;
Disallow: /print
Disallow: /books/
Disallow: /bkshp?*q=*
Disallow: /books?*q=*
Disallow: /books?*output=*
Disallow: /books?*pg=*
Disallow: /books?*jtp=*
Disallow: /books?*jscmd=*
Disallow: /books?*buy=*
Disallow: /books?*zoom=*
Allow: /books?*q=related:*
Allow: /books?*q=editions:*
Allow: /books?*q=subject:*
Allow: /books/about
Allow: /booksrightsholders
Allow: /books?*zoom=1*
Allow: /books?*zoom=5*
Disallow: /ebooks/
Disallow: /ebooks?*q=*
Disallow: /ebooks?*output=*
Disallow: /ebooks?*pg=*
Disallow: /ebooks?*jscmd=*
Disallow: /ebooks?*buy=*
Disallow: /ebooks?*zoom=*
Allow: /ebooks?*q=related:*
Allow: /ebooks?*q=editions:*
Allow: /ebooks?*q=subject:*
Allow: /ebooks?*zoom=1*
Allow: /ebooks?*zoom=5*
Disallow: /patents?
Disallow: /patents/download/
Disallow: /patents/pdf/
Disallow: /patents/related/
Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Disallow: /complete
Disallow: /s?
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Allow: /maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=d
Allow: /maps/api/js?
Disallow: /maps?
Disallow: /mapstt?
Disallow: /mapslt?
Disallow: /maps/stk/
Disallow: /maps/br?
Disallow: /mapabcpoi?
Disallow: /maphp?
Disallow: /mapprint?
Disallow: /maps/api/js/
Disallow: /maps/api/staticmap?
Disallow: /mld?
Disallow: /staticmap?
Disallow: /places/
Allow: /places/$
Disallow: /maps/preview
Disallow: /maps/place
Disallow: /help/maps/streetview/partners/welcome/
Disallow: /help/maps/indoormaps/partners/
Disallow: /lochp?
Disallow: /center
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Disallow: /mbd?
Disallow: /extern_js/
Disallow: /xjs/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /trends/hottrends?
Disallow: /trends/viz?
Disallow: /notebook/search?
Disallow: /musica
Disallow: /musicad
Disallow: /musicas
Disallow: /musicl
Disallow: /musics
Disallow: /musicsearch
Disallow: /musicsp
Disallow: /musiclp
Disallow: /browsersync
Disallow: /call
Disallow: /archivesearch?
Disallow: /archivesearch/url
Disallow: /archivesearch/advanced_search
Disallow: /base/reportbadoffer
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /codesearch?
Disallow: /codesearch/feeds/search?
Disallow: /wapsearch?
Disallow: /safebrowsing
Allow: /safebrowsing/diagnostic
Allow: /safebrowsing/report_badware/
Allow: /safebrowsing/report_error/
Allow: /safebrowsing/report_phish/
Disallow: /reviews/search?
Disallow: /orkut/albums
Allow: /jsapi
Disallow: /views?
Disallow: /c/
Disallow: /cbk
Allow: /cbk?output=tile&cb_client=maps_sv
Disallow: /recharge/dashboard/car
Disallow: /recharge/dashboard/static/
Disallow: /translate_a/
Disallow: /translate_c
Disallow: /translate_f
Disallow: /translate_static/
Disallow: /translate_suggestion
Disallow: /profiles/me
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static
Disallow: /s2
Allow: /s2/search/social
Disallow: /transconsole/portal/
Disallow: /gcc/
Disallow: /aclk
Disallow: /cse?
Disallow: /cse/home
Disallow: /cse/panel
Disallow: /cse/manage
Disallow: /tbproxy/
Disallow: /imesync/
Disallow: /shenghuo/search?
Disallow: /support/forum/search?
Disallow: /reviews/polls/
Disallow: /hosted/images/
Disallow: /ppob/?
Disallow: /ppob?
Disallow: /ig/add?
Disallow: /adwordsresellers
Disallow: /accounts/o8
Allow: /accounts/o8/id
Disallow: /topicsearch?q=
Disallow: /xfx7/
Disallow: /squared/api
Disallow: /squared/search
Disallow: /squared/table
Disallow: /toolkit/
Allow: /toolkit/*.html
Disallow: /globalmarketfinder/
Allow: /globalmarketfinder/*.html
Disallow: /qnasearch?
Disallow: /app/updates
Disallow: /sidewiki/entry/
Disallow: /quality_form?
Disallow: /labs/popgadget/search
Disallow: /buzz/post
Disallow: /compressiontest/
Disallow: /analytics/reporting/
Disallow: /analytics/admin/
Disallow: /analytics/web/
Disallow: /analytics/feeds/
Disallow: /analytics/settings/
Disallow: /alerts/
Disallow: /ads/search
Disallow: /phone/compare/?
Allow: /alerts/manage
Allow: /alerts/remove
Disallow: /travel/clk
Disallow: /hotelfinder/rpc
Disallow: /hotels/rpc
Disallow: /flights/rpc
Disallow: /commercesearch/services/
Disallow: /evaluation/
Disallow: /chrome/browser/mobile/tour
Disallow: /compare/*/apply*
Disallow: /forms/perks/
Disallow: /baraza/*/search
Disallow: /baraza/*/report
Disallow: /shopping/suppliers/search
Disallow: /ct/
Disallow: /edu/cs4hs/
Disallow: /trustedstores/s/
Disallow: /trustedstores/tm2
Disallow: /trustedstores/verify
Disallow: /adwords/proposal
Disallow: /shopping/product/
Disallow: /shopping/seller
Disallow: /shopping/reviewer
Sitemap: http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Sitemap: http://www.google.com/sitemaps_webmasters.xml
Sitemap: http://www.gstatic.com/sitemaps/websearch_hreflang/sitemap_index.xml
Sitemap: http://www.google.com/ventures/sitemap_ventures.xml
Sitemap: http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml
Sitemap: http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml
           

诶?為什麼多出了Sitemap這個元素呢?

           前面說過爬蟲會通過網頁内部的連結發現新的網頁。但是如果沒有連接配接指向的網頁怎麼辦?或者使用者輸入條件生成的動态網頁怎麼辦?能否讓網站管理者通知搜尋引擎他們網站上有哪些可供抓取的網頁?這就是sitemap,最簡單的 Sitepmap 形式就是 XML 檔案,在其中列出網站中的網址以及關于每個網址的其他資料(上次更新的時間、更改的頻率以及相對于網站上其他網址的重要程度等等),利用這些資訊搜尋引擎可以更加智能地抓取網站内容。

sitemap是另一個話題,足夠開一篇新的文章聊的,這裡就不展開了,有興趣的同學可以參考sitemap

新的問題來了,爬蟲怎麼知道這個網站有沒有提供sitemap檔案,或者說網站管理者生成了sitemap,(可能是多個檔案),爬蟲怎麼知道放在哪裡呢?

由于robots.txt的位置是固定的,于是大家就想到了把sitemap的位置資訊放在robots.txt裡。這就成為robots.txt裡的新成員了。

以上是跟的xml檔案形式,大家可以打開看一下~~~其實還可以後跟txt格式的~~如:

閑聊Robots協定

大家打開看看!!!!!!!當然還可以是壓縮包的形式哦~~我們看一下亞馬遜的

http://www.amazon.cn/robots.txt

User-agent: *
Disallow: /buycar
Disallow: /cart
Disallow: /checkout
Disallow: /class
Disallow: /com
Disallow: /common
Disallow: /css
Disallow: /dll
Disallow: /doc
Disallow: /dp/e-mail-friend/
Disallow: /dp/manual-submit/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
Disallow: /gp/aws/ssop
Disallow: /gp/cart
Disallow: /gp/css/homepage.html
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/flex
Disallow: /gp/gfix
Disallow: /gp/history
Disallow: /gp/item-dispatch
Disallow: /gp/music/clipserve
Disallow: /gp/music/wma-pop-up
Disallow: /gp/offer-listing
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /gp/twitter/
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/yourstore
Disallow: /inc
Disallow: /js
Disallow: /lib
Disallow: /mn/bookLookInsideApp
Disallow: /mn/checkInitApp
Disallow: /mn/checkoutAlertMsgApp
Disallow: /mn/checkoutredirectApp
Disallow: /mn/giftCardApp
Disallow: /mn/loginApplication
Disallow: /mn/loyaltyApp
Disallow: /mn/orderAddrApp
Disallow: /mn/orderCfmApp
Disallow: /mn/orderDetailApp
Disallow: /mn/orderFailApp
Disallow: /mn/orderHistoryApp
Disallow: /mn/orderModifyApp
Disallow: /mn/orderSummaryApp
Disallow: /mn/paymentRedriveApp
Disallow: /mn/recommendReviewApp
Disallow: /mn/releaseReviewApp
Disallow: /mn/reviewVoteApplication
Disallow: /mn/selectPaymentMethodApp
Disallow: /mn/selectShippingOpptionApplication
Disallow: /mn/shipmentTraceApp
Disallow: /mn/shoppingCartApplication
Disallow: /mn/tellFriend
Disallow: /mn/thankYouApplication
Disallow: /mn/virtualAccountApp
Disallow: /mn/yourAccountApp
Disallow: /paper
Disallow: /xml
Disallow: /youraccount
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
Allow: /wishlist/universal*
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
Disallow: /registry/wishlist/
Disallow: /gp/help/customer/display.html*nodeId=200843370
Disallow: /gp/help/customer/display.html*nodeId=200877580
Disallow: /gp/help/customer/display.html*nodeId=200877590
Disallow: /gp/help/customer/display.html*nodeId=200879080
Disallow: /gp/help/customer/display.html*nodeId=200879100
Disallow: /gp/help/customer/display.html*nodeId=200879120
Disallow: /gp/help/customer/display.html*nodeId=200879160
Disallow: /gp/help/customer/display.html*nodeId=200879140
Disallow: /gp/help/customer/display.html*nodeId=200877610
Disallow: /gp/help/customer/display.html*nodeId=200878960
Disallow: /gp/help/customer/display.html*nodeId=200878980
Disallow: /gp/help/customer/display.html*nodeId=200879000
Disallow: /gp/help/customer/display.html*nodeId=200879040
Disallow: /gp/help/customer/display.html*nodeId=200879020
Disallow: /gp/help/customer/display.html*nodeId=200877630
Disallow: /gp/help/customer/display.html*nodeId=200879200
Disallow: /gp/help/customer/display.html*nodeId=200879220
Disallow: /gp/help/customer/display.html*nodeId=200879240
Disallow: /gp/help/customer/display.html*nodeId=200879280
Disallow: /gp/help/customer/display.html*nodeId=200879260
Disallow: /gp/help/customer/display.html*nodeId=200877650
Disallow: /gp/help/customer/display.html*nodeId=200879320
Disallow: /gp/help/customer/display.html*nodeId=200879340
Disallow: /gp/help/customer/display.html*nodeId=200879360
Disallow: /gp/help/customer/display.html*nodeId=200879400
Disallow: /gp/help/customer/display.html*nodeId=200879380
Disallow: /gp/help/customer/display.html*nodeId=200877560
Disallow: /gp/help/customer/display.html*nodeId=200843460
Disallow: /gp/help/customer/display.html*nodeId=200843440
Disallow: /gp/help/customer/display.html*nodeId=200899270
Disallow: /gp/help/customer/display.html*nodeId=200879440
Disallow: /gp/help/customer/display.html*nodeId=200899330
Disallow: /gp/help/customer/display.html*nodeId=200899350
Disallow: /gp/help/customer/display.html*nodeId=200899390
Disallow: /gp/help/customer/display.html*nodeId=200899410
Disallow: /gp/help/customer/display.html*nodeId=200899430
Disallow: /gp/help/customer/display.html*nodeId=200899220
Disallow: /gp/help/customer/display.html*nodeId=200899450
Disallow: /gp/help/customer/display.html*nodeId=200899670
Disallow: /gp/help/customer/display.html*nodeId=200899530
Disallow: /gp/help/customer/display.html*nodeId=200899470
Disallow: /gp/help/customer/display.html*nodeId=200899550
Disallow: /gp/help/customer/display.html*nodeId=200899570
Disallow: /gp/help/customer/display.html*nodeId=200899590
Disallow: /gp/help/customer/display.html*nodeId=200899490
Disallow: /gp/help/customer/display.html*nodeId=200899510
Disallow: /gp/help/customer/display.html*nodeId=200899610
Disallow: /gp/help/customer/display.html*nodeId=200899630
Disallow: /gp/help/customer/display.html*nodeId=200899650
Disallow: /gp/help/customer/display.html*nodeId=200879180
Disallow: /gp/help/customer/display.html*nodeId=200879060
Disallow: /gp/help/customer/display.html*nodeId=200879300
Disallow: /gp/help/customer/display.html*nodeId=200879420
Disallow: /gp/help/customer/display.html*nodeId=200899290
Disallow: /gp/help/customer/display.html*nodeId=200899310
Disallow: /gp/help/customer/display.html*nodeId=200843380
Disallow: /gp/help/customer/display.html*nodeId=200843420
Disallow: /gp/help/customer/display.html*nodeId=200899230
Disallow: /gp/help/customer/display.html*nodeId=200899250
Disallow: /gp/help/customer/display.html*nodeId=200899370
Disallow: /gp/help/contact-us/general-questions.html*?type&email&skip=true
Disallow: /gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=true
Disallow: /gp/registry/search.html
Disallow: /gp/orc/rml/
Disallow: /gp/digital/fiona/manage
Disallow: /gp/entity-alert/external
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html

# Sitemap files
Sitemap: http://www.amazon.cn/sitemap_feed_index1.xml
Sitemap: http://www.amazon.cn/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.cn/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.cn/sitemaps.c21f969b5f03d33.SitemapIndex_0.xml.gz


           

 我們可以将壓縮包,下載下傳下來,打開可以看到是一個xml檔案!!

閑聊Robots協定

我們再來看一個:

閑聊Robots協定

哎?ia_archiver是什麼爬蟲啊?沒見過啊?

百度一下~~

ia_archiver是alexa的一個 爬蟲程式,用于檢測網站是否做了alexa排名的作弊。 ia_archiver程式會自動在網際網路上爬行,刺探每個Web頁面的流量資訊。尤其是當某個網站的流量超過Alexa設定的門檻值時,IA_Archiver就會馬上爬到該網站的伺服器上,分析此網站的流量是否正常,有沒有作弊行為。

邀請ia_arhiver來訪

到alexa官網進行登記即可。

禁止ia_archiver通路

ia_archiver是一個中等強度的爬蟲。如果你覺得它占用了過多的 伺服器資源,同時不關心網站 alexa排名的話,可以屏蔽這個爬蟲。方法為在伺服器上的網站根目錄建立 robots.txt,包含以下内容: User-agent: ia_archiver Disallow: / 上面在全站之内禁止ia_archiver爬行。或者禁止爬行某個目錄: User-agent: ia_archiver Disallow: /somedir/

基本上就這些了~~~

還有一些好玩的~~大家可以參考:http://lusongsong.com/reed/732.html

關于robots協定就到這裡了!!

繼續閱讀