大資料的流行一定程式導緻的爬蟲的流行，有些企業和公司本身不生産資料，那就隻能從網上爬取資料，筆者關注相關的内容有一定的時間，也寫過很多關于爬蟲的系列，現在收集好的架構希望能為對爬蟲有興趣的人，或者想更進一步的研究的人提供索引，也随時歡迎大家star,fork ,或者提issue，讓我們一起來完善這個awesome系列

github位址

Awesome-crawler

A collection of awesome web crawler,spider and resources in different language

Python

Scrapy - A fast high-level screen scraping and web crawling framework.
pyspider - A powerful spider system.
cola - A distributed crawling framework.
Demiurge - PyQuery-based scraping micro-framework.
feedparser - Universal feed parser.
Grab - Site scraping framework.
MechanicalSoup - A Python library for automating interaction with websites.
portia - Visual scraping for Scrapy.
crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
MSpider - A simple ,easy spider using gevent and js render.

Java

Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
Crawler4j - Simple and lightweight web crawler.
JSoup - Scrapes, parses, manipulates and cleans HTML.
websphinx - Website-Specific Processors for HTML INformation eXtraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Gecco - A easy to use lightweight web crawler
WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Webmagic - A scalable crawler framework.
Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
SeimiCrawler - An agile, distributed crawler framework.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
SimpleCrawler - Simple spider base on mutithreading, regluar expression.
Abot - C# web crawler built for speed and flexibility.
Hawk - Advanced Crawler and ETL tool written in C#/WPF.

JavaScript

simplecrawler - Event driven web crawler.
node-crawler - Node-crawler has clean,simple api.
js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.

PHP

Goutte - A screen scraping and web crawling library for PHP.
- laravel-goutte - Laravel 5 Facade for Goutte.
dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
pspider - Parallel web crawler written in PHP.
php-spider - A configurable and extensible PHP web spider.

C++

open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.

Ruby

wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.

Go

gocrawl - Polite, slim and concurrent web crawler.
fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Scala

crawler - Scala DSL for web crawling.
scrala - Scala crawler(spider) framework, inspired by scrapy.
ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

還在持續更新之中：最新的資源請檢視git:https://github.com/BruceDone/awesome-crawler

[爬蟲資源]各大爬蟲資源大彙總,做我們自己的awesome系列

Awesome-crawler

Python

Java

C#

JavaScript

PHP

C++

Ruby

Go

Scala

繼續閱讀

學習軟體測試基礎測試第七天

淺談企業活動中進行資料分析的重要性

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Ambari介紹和架構原理

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

開源按鍵元件Multi_Button的使用,含測試工程

win10本地scala和spark安裝安裝scala安裝spark

在python中建立excel并寫入