Granular computing based
data
mining
in the views of rough set and fuzzy set
Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set
I tried using lxml
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)
print rawResponse
and getting the following output
['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]
解决方案
You could use the text_content method:
import lxml.html as LH
html = '''
Granular computing based
data
mining
in the views of rough set and fuzzy set
'''
root = LH.fromstring(html)
for elt in root.xpath('//a'):
print(elt.text_content())
yields
Granular computing based
data
mining
in the views of rough set and fuzzy set
or, to remove whitespace, you could use
print(' '.join(elt.text_content().split()))
to obtain
Granular computing based data mining in the views of rough set and fuzzy set
Here is another option which you might find useful:
print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))
yields
Granular computing based data mining in the views of rough set and fuzzy set
(Note it leaves an extra space between data and mining however.)
'//a/descendant-or-self::text()' is a more generalized version of
"//a/child::text() | //a/span/child::text()". It will iterate through all children and grandchildren, etc.