天天看點

python xpath問題總結xpath 處理網頁出現的問題總結

xpath 處理網頁出現的問題總結

<div class="name">
    <div class="title">
        <div class="price">
            <span>
                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >網頁欣賞</a>
            </span>
        </div>
    </div>
</div>
           
  1. 當文檔為多層結構時,無法比對内容
    1. 使用text()方法比對不到内容,得到的隻是空的     ‘\n     ’
      data.xpath('//div[@class="name"]//text()')
                 
    2. 使用xpath(‘string(.)’)的方式,則可以正确比對    
      data.xpath('//div[@class="name"]')[0].xpath('string(.)')
                 
  2. 使用xpath得到的div對象,再次使用xpath比對内容時,出現錯誤
    1. 使用 div對象  <class 'lxml.etree._Element'>,再次使用xpath 比對其内容時,失敗
    2. html = etree.parse('./test_xpath.html', etree.HTMLParser())
          strings = etree.HTML(etree.tounicode(html))
      
          # print(strings)
          pp = strings.xpath('//div[@class="name"]')[0]
          print(type(strings))  # <class 'lxml.etree._Element'>
          print(type(pp))  # <class 'lxml.etree._Element'>
          print(pp.xpath('/div[@class="title"]')) # []
                 

      - 為何會比對失敗??

      -  print(pp.xpath('/div[@class="title"]')) # [], 前面使用的是 '/' 而不是 ‘//’, 比對的是根路徑,導緻無法查找

  3. 使用xpath 打開html檔案時,會遇到無法解碼為中文的情況
    html = etree.parse('./test_xpath.html', etree.HTMLParser())
        strings = etree.tostring(html)
        print(strings)
    
    # get page like this
    '''
    b'<!DOCTYPE html>\n<html >&#13;\n<head>&#13;\n    <meta charset="UTF-8"/>&#13;\n    <title>Title</title>&#13;\n</head>&#13;\n<body>&#13;\n<div class="name">&#13;\n    <div class="title">&#13;\n        <div class="price">&#13;\n            <span>&#13;\n                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >&#32593;&#39029;&#27427;&#36175;</a>&#13;\n            </span>&#13;\n        </div>&#13;\n    </div>&#13;\n</div>&#13;\n</body>&#13;\n</html>'
    '''
        html = etree.parse('./test_xpath.html', etree.HTMLParser())
        strings = etree.tostring(html).decode()
        print(strings)
    '''
    <!DOCTYPE html>
    <html >&#13;
    <head>&#13;
        <meta charset="UTF-8"/>&#13;
        <title>Title</title>&#13;
    </head>&#13;
    <body>&#13;
    <div class="name">&#13;
        <div class="title">&#13;
            <div class="price">&#13;
                <span>&#13;
                    <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >&#32593;&#39029;&#27427;&#36175;</a>&#13;
                </span>&#13;
            </div>&#13;
        </div>&#13;
    </div>&#13;
    </body>&#13;
    </html>
    
    '''
    但是還是不對,中文并沒有解析出來
               
    1. 使用etree.tostring()
    2. 使用etree.tounicode(), 則正常解析,并且不需要使用decode,就能得到正常的html
    3. html = etree.parse('./test_xpath.html', etree.HTMLParser())
          strings = etree.tounicode(html)
          print(strings)
      
      
      <!DOCTYPE html>
      <html >&#13;
      <head>&#13;
          <meta charset="UTF-8"/>&#13;
          <title>Title</title>&#13;
      </head>&#13;
      <body>&#13;
      <div class="name">&#13;
          <div class="title">&#13;
              <div class="price">&#13;
                  <span>&#13;
                      <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >網頁欣賞</a>&#13;
                  </span>&#13;
              </div>&#13;
          </div>&#13;
      </div>&#13;
      </body>&#13;
      </html>
                 
movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
'\n    '