天天看點

網絡爬蟲-爬取一卡通企業資料儲存CVS

最近被各種encode,decode折磨得死去活來的,儲存到json,csv或者txt各種亂碼,實在了令人抓狂,有些明明是正确輸出在pycharm上的,但是儲存的時候就亂碼了,今天就記錄一下采坑過程。

以一卡通世界官網為例(儲存至CSV):

網絡爬蟲-爬取一卡通企業資料儲存CVS
網絡爬蟲-爬取一卡通企業資料儲存CVS

以上便是需要儲存的字段了。

代碼如下:

import requests
from lxml import etree
import csv
import codecs

s = requests.Session()

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
}
s.headers = headers

with open('Pyw.csv', 'w', encoding='utf-8-sig') as csvfile:

    spamwriter = csv.writer(csvfile, dialect=('excel'))
    # 設定标題
    spamwriter.writerow(["企業名", "代理商", "主營産品", "聯系人", "位址", "電話", "手機", "網址"])
    for i in range(1, 277):
        r = s.get('http://company.yktworld.com/comapny_search.asp?page={0}&tdsourcetag=s_pcqq_aiomsg'.format(i))

        r.encoding = 'gb2312'

        print(r.text)

        selector1 = etree.HTML(r.text)

        list1 = selector1.xpath('/html/body/div[4]/div[1]/div[2]/div')

        for i in list1[2:-1]:
            name = i.xpath('b/a/text()')[0]
            url = i.xpath('b/a/@href')[0]
            agent = i.xpath('font/text()')[0]
            product = i.xpath('div/text()[1]')[0]
            r = s.get(url)
            r.encoding = 'gb2312'
            selector2 = etree.HTML(r.text)
            contacts = selector2.xpath('/html/body/div[4]/div[1]/div[6]/text()[2]')
            if contacts:
                contacts = contacts[0]
            else:
                contacts = ''
            address = selector2.xpath('/html/body/div[4]/div[1]/div[6]/text()[3]')
            if address:
                address = address[0]
            else:
                address = ''
            tel = selector2.xpath('/html/body/div[4]/div[1]/div[6]/text()[4]')
            if tel:
                tel = tel[0]
            else:
                tel = ''
            phone = selector2.xpath('/html/body/div[4]/div[1]/div[6]/text()[5]')
            url1 = '暫無'
            try:
                if phone[0]:
                    if phone[0].split(':')[0].strip() == '手機':
                        phone = phone[0]
                    elif phone[0].split(':')[0].strip() == '網址':
                        url1 = phone[0]
                    else:
                        phone = '暫無'
                else:
                    phone = '暫無'
            except:
                phone = '暫無'
            if url1 != "暫無":
                phone = '暫無'
            url2 = selector2.xpath('/html/body/div[4]/div[1]/div[6]/text()[6]')
            if url2:
                url2 = url2[0]
                if url2.split(':')[0].strip() == '網址':
                    url1 = url2
                else:
                    url2 = ''
            else:
                url2 = ''
            print(name, agent, product, contacts, address, tel, phone, url1)
            # 将CsvData中的資料循環寫入到CsvFileName檔案中
            spamwriter.writerow([name, agent, product, contacts, address, tel, phone, url1])
           

注意:

with open('Pyw.csv', 'w', encoding='utf-8') as csvfile
           

這樣插入csv的資料将會報錯。

改成如下代碼:

with open('Pyw.csv', 'w', encoding='utf-8-sig') as csvfile
           

這樣既可正常地插入CSV資料了。

而官方的解釋是:

UTF-8以位元組為編碼單元,它的位元組順序在所有系統中都是一様的,沒有位元組序的問題,也是以它實際上并不需要BOM(“ByteOrder Mark”), 但是UTF-8 with BOM即utf-8-sig需要提供BOM(“ByteOrder Mark”)。

以後使用的時候,以’utf-8-sig’為準,就OK了。

附上成功儲存至CSV的圖:

網絡爬蟲-爬取一卡通企業資料儲存CVS

繼續閱讀