爬蟲入門1---談談網絡爬蟲

爬蟲入門2---爬蟲架構webmagic

爬蟲入門3---爬蟲實戰

3 爬蟲實戰

3.1 需求

每日某時間段從****部落格中爬取文檔，存入文章資料庫中。

3.2 數模準備

下面是****各頻道位址：

爬蟲入門3---爬蟲實戰3 爬蟲實戰

這邊先準備兩張表：

頻道表：

爬蟲入門3---爬蟲實戰3 爬蟲實戰

文章表：

爬蟲入門3---爬蟲實戰3 爬蟲實戰

向tb_channel表添加記錄：

爬蟲入門3---爬蟲實戰3 爬蟲實戰

3.3 代碼編寫

3.3.1子產品編寫

（1）idea建立springboot工程（這裡不做詳細講解），建立子產品article_crawler ，引入依賴

<dependencies>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
 </dependencies>

（2）建立配置檔案application.yml

server:
  port: 9015
spring:
  application:
    name: article-crawler #指定服務名
  datasource:
    driverClassName: com.mysql.jdbc.Driver
    url: jdbc:mysql://****:3306/test_article?characterEncoding=UTF8
    username: ****
    password: ****
  jpa:
    database: MySQL
    show-sql: true
  redis:
    host: ****
    password: ****

（3）建立啟動類

@SpringBootApplication
@EnableScheduling
public class ArticleCrawlerApplication {

    public static void main(String[] args) {

        SpringApplication.run(ArticleCrawlerApplication.class);
    }

    @Value("${spring.redis.host}")
    private String redis_host;

    @Value("${spring.redis.password}")
    private String redis_password;

    @Bean
    public IdWorker idWorker(){
        return  new IdWorker(1,1);
    }

    @Bean
    public RedisScheduler redisScheduler(){
        JedisPoolConfig config = new JedisPoolConfig();// 連接配接池的配置對象
        config.setMaxTotal(100);// 設定最大連接配接數
        config.setMaxIdle(10);// 設定最大空閑連接配接數
        JedisPool jedisPool=new JedisPool(config,redis_host,6379,20000,redis_password);
        return new RedisScheduler(jedisPool);
    }

（4）實體類及資料通路接口（這裡不做詳解）

3.3.2 爬取類

建立文章爬取類ArticleProcessor

/**
 * 文章爬取類
 */
@Component
public class ArticleProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        page.addTargetRequests(  page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
        //文章标題
        String title=page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1").get();
        String content=page.getHtml().xpath("//*[@id=\"article_content\"]/div[2]").get();
        if(title!=null && content!=null){
            page.putField("title" ,title );
            page.putField("content",content);
        }else{
            page.setSkip(true);//跳過
        }

    }

    @Override
    public Site getSite() {
        return Site.me().setRetryTimes(3000).setSleepTime(100);
    }
}

3.3.3 入庫類

建立文章入庫類ArticleDbPipeline ，負責将爬取的資料存入資料庫

@Component
public class ArticleDbPipeline implements Pipeline {

    @Autowired
    private ArticleDao articleDao;

    @Autowired
    private IdWorker idWorker;


    private String channelId;//頻道ID

    public void setChannelId(String channelId) {
        this.channelId = channelId;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        String title=resultItems.get("title");//取出标題
        String content=resultItems.get("content");//内容
        Article article=new Article();
        article.setId(idWorker.nextId()+"");
        article.setChannelid(channelId);
        article.setTitle(title);
        article.setContent(content);
        articleDao.save(article);
    }
}

3.3.4 任務類

建立任務類，可根據@Scheduled設定定時抓取

/**
 * 任務類
 */
@Component
public class ArticleTask {

    @Autowired
    private ArticleProcessor articleProcessor;

    @Autowired
    private ArticleDbPipeline articleDbPipeline;

    @Autowired
    private RedisScheduler redisScheduler;

    /**
     * 爬取AI文章
     */
    @Scheduled(cron = "0 15 15 * * ?")
    public void aiTask(){
        System.out.println("開始爬取CSDN文章");
        Spider spider =Spider.create(articleProcessor);
        spider.addUrl("https://blog.csdn.net/nav/ai");
        articleDbPipeline.setChannelId("ai");
        spider.addPipeline(articleDbPipeline);
        spider.setScheduler(redisScheduler);
        spider.start();
    }
}

運作springboot工程，查詢資料庫資料，可以看到資料庫入庫資料。

當然上面還隻是一個簡單的爬蟲入門工程，真正應用到生産上面是需要設定代理ip，繞驗證碼等較複雜操作，這裡不做詳解，有興趣的童鞋，可以自己研究下。

爬蟲入門3---爬蟲實戰3 爬蟲實戰

爬蟲入門1---談談網絡爬蟲

爬蟲入門2---爬蟲架構webmagic

爬蟲入門3---爬蟲實戰

3 爬蟲實戰

3.1 需求

3.2 數模準備

3.3 代碼編寫

3.3.1子產品編寫

3.3.2 爬取類

3.3.3 入庫類

3.3.4 任務類

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的