天天看點

58 字型反爬

以二手車為例:https://sz.58.com/ershouche/pn2/?PGTID=0d100000-0000-4e81-5801-e3cfbaae2802&ClickID=120

此連結是深圳的二手車:

58 字型反爬

爬取一頁中的所有二手車詳情連結,從中爬取基本資訊,其中的交易價字型加密了,需要處理才能獲得正确的數字:

58 字型反爬

檢視源代碼,找到這個數字,但源代碼中和右鍵檢檢視到的不一樣,就是說它先死字型加密後變為源碼裡面的樣子,在以繁體這樣的形式渲染到前端:

58 字型反爬

我們就來破解這個字型加密,得到正确的數字。

首先,字型反爬,是通過字型檔案加密映射的,一般字尾為 ttf,woff,我們在源碼中搜尋:

58 字型反爬
58 字型反爬

看到了一個 base64 字段,表示是 base64 編碼過的,這段内容就是字型加密的映射内容,從 src:url(' 這裡開始,一直到 format 前面這是一條 url ,請求通路它,直接可以下載下傳字型檔案,但還可以通過解碼 base64 後面這段,得到二進制内容,寫入檔案,這樣做的可以減少網絡請求,節省一點時間。

可用正規表達式,比對出 base64 加密的内容,然後進行解碼,我們這裡隻講怎麼破解。是以就直接複制了。

base64 庫解碼,fontTools 打開字型檔案并存為 xml 檔案,存為 xml 檔案的目的是便于分析映射關系:

import base64
from fontTools.ttLib import TTFont

code = 'import base64
from fontTools.ttLib import TTFont

code = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8N/ZAAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQYmXfRAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOGsqepfDzz1AAsIAAAAAADaiRZgAAAAANqJFmAAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABAAHAAYAAwAKAAUACQABAAIACAAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAEAACVjwAAlY8AAAAHAACZPAAAmTwAAAAGAACaSwAAmksAAAADAACeOgAAnjoAAAAKAACeowAAnqMAAAAFAACfZAAAn2QAAAAJAACfkgAAn5IAAAABAACfpAAAn6QAAAACAACfpQAAn6UAAAAIAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
fontdata = base64.b64decode(code)
file = open('58.woff', 'wb')
file.write(fontdata)
file.close()
onlineFonts = TTFont('58.woff')
onlineFonts.saveXML('test.xml')'
fontdata = base64.b64decode(code)
file = open('58.woff', 'wb')
file.write(fontdata)
file.close()
onlineFonts = TTFont('58.woff')
onlineFonts.saveXML('test.xml')
           

運作上面代碼後,有兩個檔案,一個 woff 字型檔案,一個 轉換後的 xml 檔案:

58 字型反爬

打開 xml 檔案,有兩個重要部分,這兩個部分就是對應的字型映射:

58 字型反爬

我們剛剛檢視源代碼,裡面的是這樣的格式:驋閏.齤

&# 對應 code 屬性中的 0,我們需要替換一下,第一個字元替換後就是 0x9a4b;

它對應的 name 屬性是 glyph00003

glyph00003 對應的 id 屬性是 3,但需要減一才是我們真正的數字,id 實際的值是從 1 開始的

xml 庫讀取 xml檔案的,我們讀取來轉換相應的映射:

import xml.etree.ElementTree as et

content='驋閏.齤'
root=et.parse('test.xml').getroot()
con=root.find('cmap').find('cmap_format_4').findall('map')
for i in con:
    num=i.attrib['name']
    code=i.attrib['code'].replace('0x','&#x')+';'
    content=content.replace(code,str(int(num[-2:])-1))
print(content)
           
58 字型反爬
58 字型反爬

這樣就得到正确的字型了。

爬取二手車的代碼,可掃關注小編公衆号,回複【58二手車】:

58 字型反爬