天天看点

python获取标签内容,在Python中的两个标签之间获取数据

python获取标签内容,在Python中的两个标签之间获取数据

Granular computing based

data

mining

in the views of rough set and fuzzy set

Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set

I tried using lxml

parser = etree.HTMLParser()

tree = etree.parse(StringIO.StringIO(html), parser)

xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"

rawResponse = tree.xpath(xpath1)

print rawResponse

and getting the following output

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

解决方案

You could use the text_content method:

import lxml.html as LH

html = '''

Granular computing based

data

mining

in the views of rough set and fuzzy set

'''

root = LH.fromstring(html)

for elt in root.xpath('//a'):

print(elt.text_content())

yields

Granular computing based

data

mining

in the views of rough set and fuzzy set

or, to remove whitespace, you could use

print(' '.join(elt.text_content().split()))

to obtain

Granular computing based data mining in the views of rough set and fuzzy set

Here is another option which you might find useful:

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

yields

Granular computing based data mining in the views of rough set and fuzzy set

(Note it leaves an extra space between data and mining however.)

'//a/descendant-or-self::text()' is a more generalized version of

"//a/child::text() | //a/span/child::text()". It will iterate through all children and grandchildren, etc.