BeautifulSoup

寫文章之前先吐槽幾句：Python這個玩意哪都好，又簡單又直覺，對于我這種程式設計新人來說的确很不錯，但是python有緻命的坑點就是更新太快。現在比較流行的是2.7版本和3.5版本，偏偏2.7版本有些指令不支援在3.5裡，比如import sys,reload(sys)，2.7可以直接使用，而3.5就不行。比如reduce指令，2.7可以直接使用，3.5就不行。我也是開了眼界，竟然還有進階版本不容納低級版本的！

但是很多人又說3.5是形勢所趨，可2.7又有很多地方很友善。現在網絡課程滿大街都是，很多老師上來就開始講python，講倒是無所謂，講了半天就是不講自己的浏覽器型号，也不講自己的python編譯器版本号，一個破pycharm，一年裡能更新好幾百次，每一次更新不但網頁變樣，裡面東西的位置也跟着變樣。對于初學者來說，特别容易蒙圈。

好，吐槽完畢，畢竟再怎麼吐槽人也得活着。那就克服困難吧。

BeautifulSoup子產品在成功安裝之後，使用from bs4 import BeautifulSoup啟動子產品。啟動完畢之後，就可以輸入想要搜尋的“内容”，BeautifulSoup主要是面對網頁檔案的，因為網頁的源代碼是一對一對出現的，BeautifulSoup能很快的正确定位。

假如，我們要搜尋的内容是：

html_doc="""<html>

<head>

</head>

<body>

<div class="topic"><a href="www.51cto.com/welcome.html">歡迎來到這裡！</a>

<url>

<li><a href="http://www.51cto.com/1.html">這是第一頁</a></li>

<li><a href="http://www.51cto.com/2.html">這是第二頁</a></li>

<li><a href="http://www.51cto.com/3.html">這是第三頁</a></li>

</url>

</div>

</body>

</html>"""

輸入完畢之後，需要搞一個Soup的子產品；

soup=BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")

其中括号裡html_doc是“需要搜尋的範圍”，"html.parser"是用來解析的工具，from_encoding="utf-8"是代碼的格式。

那麼我觀察，他們都是處于<a href="blablabla">文字</a>這樣的樣式裡。那麼我們給所有的連結取一個變量名叫links。

link=soup.find_all("a")

這裡使用的是find_all而不是find，是因為find_all有點類似re.findall，刨根問底攔不住。而find是re.search，查到一個就收手了。

for each in BBB:

print(each.name,each["href"],each.get_text())

這樣的輸出結果會是：

a www.51cto.com/welcome.html 歡迎來到這裡！

a http://www.51cto.com/1.html 這是第一頁

a http://www.51cto.com/2.html 這是第二頁

a http://www.51cto.com/3.html 這是第三頁

=================================分割線=====================================

<code>from</code> <code>bs4 </code><code>import</code> <code>BeautifulSoup</code>

<code><html><head><title>The Dormouse's story</title></head></code>

<code><p class="title"><b>The Dormouse's story</b></p></code>

<code><p class="story">Once upon a time there were three little sisters; and their names were</code>

<code><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</code>

<code><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</code>

<code><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</code>

<code>and they lived at the bottom of a well.</p></code>

<code>soup</code><code>=</code><code>BeautifulSoup(AAA,</code><code>"html.parser"</code><code>,from_encoding</code><code>=</code><code>"utf-8"</code><code>)</code>

<code>AAA</code><code>=</code><code>soup.find(</code><code>"a"</code><code>,href</code><code>=</code><code>"http://example.com/tillie"</code><code>) </code>

<code>#通過<a href="http://example.com/tillie">進行了精準的定位</code>

<code>print</code><code>(AAA.get_text())</code>

這個程式運作的結果就是 Tillie

================================分割線======================================

<code> </code><code><div class="topic"><a href="www.51cto.com/welcome.html">歡迎來到這裡！</a></code>

<code> </code><code><li><a href="http://www.51cto.com/1.html">這是第一頁</a></li></code>

<code> </code><code><li><a href="http://www.51cto.com/2.html">這是第二頁</a></li></code>

<code> </code><code><li><a href="http://www.51cto.com/3.html">這是第三頁</a></li></code>

<code> </code><code>print</code><code>(each.name,each.get_text())</code>

這個程式跟第一個實驗結構一樣，但是卻将關鍵字"a"改成"div"。做一下試驗看看這個結果是什麼？

如果關鍵字再由"div"換成了"url"，那麼結果又是什麼？對比一下關鍵字是"a"的時候，再對比一下關鍵字是"url"的時候，為何會有這樣的不同結果呢？

如果最後一句是 print(each.name,each["href"],each.get_text()),這個結果又是什麼？

本文轉自蘇幕遮618 51CTO部落格，原文連結:http://blog.51cto.com/chenx1242/1730367

BeautifulSoup

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入