python爬取實習僧招聘資訊字型反爬

參考部落格：http://www.cnblogs.com/eastonliu/p/9925652.html

實習僧招聘的網站采用了字型反爬，在頁面上顯示正常，檢視源碼關鍵資訊亂碼，如下圖所示：

檢視網頁源碼也是看不到關鍵資訊：

查了一下是css3支援自定義字型，實習僧技術人員把一些字型換成了自定義的字型，浏覽器上可以顯示，背景就看不到了。

1.首先找到這些字型是在哪定義的。

右鍵檢視網頁源碼，查找font-face，就會看到字型資訊（加密的資料太多）：

可以看到這些字型源是用了base64加密，用base64庫進行解密，把解密後的字型檔案儲存到shixi.ttf中，下載下傳一個字型軟體FontCreator。連結：https://pan.baidu.com/s/1BPRhWYvOs6KFrgNQ4h7m_g 提取碼：1fa4

1 def parse_ttf():
2     font_face = " 源碼上的font-face"
3     b = base64.b64decode(font_face)
4     with open('shixi.ttf', 'wb') as f:
5         f.write(b)

用軟體打開這個字型檔案,可以右鍵-Captions-Codepoints選擇排序方式：

可以看到這是網頁替換的字型，例如:e588---1，ebbc---5.

2.接下來就是找到這些字型和源碼中相應位置字元的對應關系。

ttf檔案直接打不開，可以轉換成xml檔案打開或者用from fontTools.ttLib import TTFont 這個庫打開。

1 def font_dict():
2     font = TTFont('shixi.ttf')
3     font.saveXML('shixi.xml')

打開shixi.xml，找到cmap，這裡儲存了編碼和字型的對應關系。

接下來就是擷取這種對應關系，code所示的就是網頁上的源碼形式，但是用getBestCmap()函數擷取後又變成十進制的數了，是以需要用hex()函數将10進制整數轉換成16進制，以字元串形式表示成原來的行是。

這裡有一個坑，第一行的map沒有用，如果不删除接下來沒辦法解析。

1 def font_dict():
 2     font = TTFont('shixi.ttf')
 3     font.saveXML('shixi.xml')
 4     ccmap = font['cmap'].getBestCmap()
 5     print("ccmap:\n",ccmap)
 6     newmap = {}
 7     for key,value in ccmap.items():
 8         # value = int(re.search(r'(\d+)', ccmap[key]).group(1)) - 1
 9         #轉換成十六進制
10         key = hex(key)
11         value = value.replace('uni','')
12         a = 'u'+'0' * (4-len(value))+value
13         newmap[key] = a
14     print("newmap:\n",newmap)
15     #删除第一個沒用的元素
16     newmap.pop('0x78')
17     #加上字首u變成unicode....
18     for i,j in newmap.items():
19         newmap[i] = eval("u" + "\'\\" + j + "\'")
20     print("newmap:\n",newmap)
21 
22     new_dict = {}
23     #根據網頁上顯示的字元樣式改變鍵值對的顯示
24     for key, value in newmap.items():
25         key_ = key.replace('0x', '&#x')
26         new_dict[key_] = value
27 
28     return new_dict

這樣就得到了網頁代碼和實際字元額對應關系，如下：

'0xe06b': '天', '0xe0ce': '個', '0xe0d2': 'p', '0xe0d4': 'K', '0xe109': 's'

3.替換網頁上的編碼，提取正确的資訊。

下面是全部源碼：

1 #coding=utf-8
  2 #[email protected]
  3 #2018/11/8 17:01
  4 import requests
  5 import re
  6 from lxml import etree
  7 import base64
  8 import json
  9 import pymysql
 10 import time
 11 from fontTools.ttLib import TTFont
 12 def parse_ttf():
 13     font_face = ""
 14     b = base64.b64decode(font_face)
 15     with open('shixi.ttf', 'wb') as f:
 16         f.write(b)
 17 #處理字型問題，傳回字型對應的字典
 18 def font_dict():
 19     font = TTFont('shixi.ttf')
 20     font.saveXML('shixi.xml')
 21     ccmap = font['cmap'].getBestCmap()
 22     print("ccmap:\n",ccmap)
 23     newmap = {}
 24     for key,value in ccmap.items():
 25         # value = int(re.search(r'(\d+)', ccmap[key]).group(1)) - 1
 26         #轉換成十六進制
 27         key = hex(key)
 28         value = value.replace('uni','')
 29         a = 'u'+'0' * (4-len(value))+value
 30         newmap[key] = a
 31     print("newmap:\n",newmap)
 32     #删除第一個沒用的元素
 33     newmap.pop('0x78')
 34     #加上字首u變成unicode....
 35     for i,j in newmap.items():
 36         newmap[i] = eval("u" + "\'\\" + j + "\'")
 37     print("newmap:\n",newmap)
 38 
 39     new_dict = {}
 40     #根據網頁上顯示的字元樣式改變鍵值對的顯示
 41     for key, value in newmap.items():
 42         key_ = key.replace('0x', '&#x')
 43         new_dict[key_] = value
 44 
 45     return new_dict
 46 
 47 #開始爬取，替換字型
 48 def crawl(url,new_dict):
 49     headers = {
 50         'User_Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
 51         }
 52     response = requests.get(url, headers=headers)
 53     # print(response.text)
 54     html = response.text
 55     # print(new_dict)
 56     #測試這個font-face是不是對的
 57 
 58     for key,value in new_dict.items():
 59         if key in html:
 60             html = html.replace(key,value)
 61             # print('yes')
 62         else:
 63             pass
 64             # print('no')
 65     # print(html)
 66     html = etree.HTML(html)
 67     result = html.xpath("//ul[@class='position-list']//li")
 68 
 69     #擷取職位名稱，位址，公司名稱，薪水，連結
 70     result_data = []
 71     for element in result:
 72         data = {}
 73         try:
 74             link = 'https://www.shixiseng.com'+element.xpath(".//div[1]//div[1]//a/@href")[0]
 75             position_name = element.xpath(".//div[1]//div[1]//a/text()")[0]
 76             company_name = element.xpath(".//div[1]//div[2]//a/text()")[0]
 77             location = element.xpath(".//div[2]//div[1]/text()")[0]
 78             salary = element.xpath(".//div[2]//div[2]//span[1]/text()")[0]
 79             week = element.xpath(".//div[2]//div[2]//span[2]/text()")[0]
 80             month = element.xpath(".//div[2]//div[2]//span[3]/text()")[0]
 81         except:
 82             print('wrong!')
 83         print(position_name,location,company_name,salary,link,week,month)
 84         data['position_name'] = position_name
 85         data['company_name'] = company_name
 86         data['location'] = location
 87         data['salary'] = salary
 88         data['week'] = week
 89         data['month'] = month
 90         data['link'] = link
 91         result_data.append(data)
 92     print(result_data)
 93     return result_data
 94 if __name__ == '__main__':
 95     url = 'https://www.shixiseng.com/interns?k=Python&p='
 96     parse_ttf()
 97     data = font_dict()
 98     print(data)
 99     result = []
100     for i in range(2):
101         result.extend(crawl(url+str(i+1),data))
102     print(result)

font-face會經常變化，這就需要及時更新這個資料。

歡迎指正。

python爬取實習僧招聘資訊字型反爬

繼續閱讀

Python爬蟲之網站超清圖檔爬取(2021.3.29)

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰