一個帶有簡單去重的mongoDB資料庫存儲

2023-08-05 07:55:23

上篇爬蟲文章中我引入了一個mongochache，這裡發一下源碼并做下簡介。

在這個存儲app中，使用了資料序列化和壓縮，但是由于爬取的内容不算多，是以我在實際的應用中将這兩個作用給注釋掉了。

在這裡我簡單的寫入了增删改查的一些應用，有點簡陋，但是.....，對于一般的資料而言應該能夠滿足，具體代碼如下：

import pickle #對象序列化
import zlib #壓縮資料
from datetime import datetime,timedelta  #設定緩存時間間隔
from pymongo import MongoClient
from bson.binary import Binary #mongoDB存儲二進制的類型
import requests

class MongoCache:

    def __init__(self,client=None,expires=timedelta(days=30)):
        self.client=MongoClient('localhost')
        self.db=self.client.LaGou
        web_page=self.db.lagou
        self.db.lagou.create_index('timestamp',expireAfterSeconds=expires.total_seconds())

    def __setitem__(self, key, value):
        # 将資料使用pickle序列化，再使用zlib壓縮轉換成Binary，使用格林威治時間
        # record={'result':Binary(zlib.compress(pickle.dumps(value))),'timestamp':datetime.utcnow()}
        # 使用url作為key存入系統預設的_id字段，存入資料可
        self.db.lagou.update({'_id':key},{'$set':value},upsert=True)


    def __getitem__(self, item):    #字段的查詢
        record=self.db.lagou.find_one({'_id':item})
        if record:
            return pickle.loads(zlib.decompress(record['result']))
        else:
            raise KeyError(item+'does not exit')

    def __contains__(self, item):
        """
        當調用 in ， not in 會調用該方法判斷連結對于網址的資料是否在資料庫中
        :param item:
        :return:
        """
        try:
            self[item]
        except KeyError:
            return False
        else:
            return True

    def clear(self):
        self.db.lagou.drop()

我是一個小白，這裡肯定還有許多不足的地方，希望諸位加以指正，謝謝！

一個帶有簡單去重的mongoDB資料庫存儲

繼續閱讀

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Error: couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error

Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Erro

couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error conne

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

Ubuntu14.04 LTS下安裝mongodb

sort()函數到底是怎樣進行數字排序的