3.3 XPath
XPath即XML路径语言(XML Path Language),它是一种用来确定xml文档中某部分位置的语言。
xml文档(html属于xml)是由一系列节点构成的树,例如:
<html> <body> <div > <p>Hello world<p> <a href="/home">Click here</a> </div> </body> </html>
xml文档的节点有多种类型,其中最常用的有以下几种:
● 根节点 整个文档树的根。
● 元素节点html、body、div、p、a。
● 属性节点href。
● 文本节点Hello world、Click here。
节点间的关系有以下几种:
● 父子body是html的子节点,p和a是div的子节点。反过来,div是p和a的父节点。
● 兄弟p和a为兄弟节点。
● 祖先/后裔body、div、p、a都是html的后裔节点;反过来html是body、div、p、a的祖先节点。
3.3.1 基础语法
表3-1列出了XPath常用的基本语法。
表3-1 XPath常用的基本语法
接下来,我们通过一些例子展示XPath的使用。
首先创建一个用于演示的html文档,并用其构造一个HtmlResponse对象:
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse >>> body = ''' ... <html> ... <head> ... <base href='http://example.com/'/> ... <title>Example website</title> ... </head> ... <body> ... <div id='images'> ... <a href='image1.html'>Name: Image 1<br/><img src='image1.jpg'/></a> ... <a href='image2.html'>Name: Image 2<br/><img src='image2.jpg'/></a> ... <a href='image3.html'>Name: Image 3<br/><img src='image3.jpg'/></a> ... <a href='image4.html'>Name: Image 4<br/><img src='image4.jpg'/></a> ... <a href='image5.html'>Name: Image 5<br/><img src='image5.jpg'/></a> ... </div> ... </body> ... </html> ... ''' ... >>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')
● /:描述一个从根开始的绝对路径。
>>> response.xpath('/html') [<Selector xpath='/html' data='<html>\n\t<head>\n\t\t<base href="http://exam'>] >>> response.xpath('/html/head') [<Selector xpath='/html/head' data='<head>\n\t\t<base href="http://example.com/'>]
● E1/E2:选中E1子节点中的所有E2。
# 选中div子节点中的所有a >>> response.xpath('/html/body/div/a') [<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: My image 5 <'>]
● //E:选中文档中的所有E,无论在什么位置。
# 选中文档中的所有a >>> response.xpath('//a') [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
● E1//E2:选中E1后代节点中的所有E2,无论在后代中的什么位置。
# 选中body后代中的所有img >>> response.xpath('/html/body//img') [<Selector xpath='/html/body//img' data='<img src="image1.jpg">'>, <Selector xpath='/html/body//img' data='<img src="image2.jpg">'>, <Selector xpath='/html/body//img' data='<img src="image3.jpg">'>, <Selector xpath='/html/body//img' data='<img src="image4.jpg">'>, <Selector xpath='/html/body//img' data='<img src="image5.jpg">'>]
● E/text():选中E的文本子节点。
# 选中所有a的文本 >>> sel = response.xpath('//a/text()') >>> sel [<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name: My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>] >>> sel.extract() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
● E/*:选中E的所有元素子节点。
# 选中html的所有元素子节点 >>> response.xpath('/html/*') [<Selector xpath='/html/*' data='<head>\n\t\t<base href="http://example.com/'>, <Selector xpath='/html/*' data='<body>\n\t\t<div id="images">\n\t\t\t<a href="i'>] # 选中div的所有后代元素节点 >>> response.xpath('/html/body/div//*') [<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='/html/body/div//*' data='<br>'>, <Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>, <Selector xpath='/html/body/div//*' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='/html/body/div//*' data='<br>'>, <Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>, <Selector xpath='/html/body/div//*' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='/html/body/div//*' data='<br>'>, <Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>, <Selector xpath='/html/body/div//*' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='/html/body/div//*' data='<br>'>, <Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>, <Selector xpath='/html/body/div//*' data='<a href="image5.html">Name: My image 5 <'>, <Selector xpath='/html/body/div//*' data='<br>'>, <Selector xpath='/html/body/div//*' data='<img src="image5.jpg">'>]
● */E:选中孙节点中的所有E。
# 选中div孙节点中的所有img >>> response.xpath('//div/*/img') [<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>, <Selector xpath='//div/*/img' data='<img src="image2.jpg">'>, <Selector xpath='//div/*/img' data='<img src="image3.jpg">'>, <Selector xpath='//div/*/img' data='<img src="image4.jpg">'>, <Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]
● E/@ATTR:选中E的ATTR属性。
# 选中所有img的src属性 >>> response.xpath('//img/@src') [<Selector xpath='//img/@src' data='image1.jpg'>, <Selector xpath='//img/@src' data='image2.jpg'>, <Selector xpath='//img/@src' data='image3.jpg'>, <Selector xpath='//img/@src' data='image4.jpg'>, <Selector xpath='//img/@src' data='image5.jpg'>]
● //@ATTR:选中文档中所有ATTR属性。
# 选中所有的href属性 >>> response.xpath('//@href') [<Selector xpath='//@href' data='http://example.com/'>, <Selector xpath='//@href' data='image1.html'>, <Selector xpath='//@href' data='image2.html'>, <Selector xpath='//@href' data='image3.html'>, <Selector xpath='//@href' data='image4.html'>, <Selector xpath='//@href' data='image5.html'>]
● E/@*:选中E的所有属性。
# 获取第一个a下img的所有属性(这里只有src一个属性) >>> response.xpath('//a[1]/img/@*') [<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]
● .:选中当前节点,用来描述相对路径。
# 获取第1个a的选择器对象 >>> sel = response.xpath('//a')[0] >>> sel <Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'> # 假设我们想选中当前这个a后代中的所有img,下面的做法是错误的, # 会找到文档中所有的img # 因为//img是绝对路径,会从文档的根开始搜索,而不是从当前的a开始 >>> sel.xpath('//img') [<Selector xpath='//img' data='<img src="image1.jpg">'>, <Selector xpath='//img' data='<img src="image2.jpg">'>, <Selector xpath='//img' data='<img src="image3.jpg">'>, <Selector xpath='//img' data='<img src="image4.jpg">'>, <Selector xpath='//img' data='<img src="image5.jpg">'>] # 需要使用.//img来描述当前节点后代中的所有img >>> sel.xpath('.//img') [<Selector xpath='.//img' data='<img src="image1.jpg">'>]
● ..:选中当前节点的父节点,用来描述相对路径。
# 选中所有img的父节点 >>> response.xpath('//img/..') [<Selector xpath='//img/..' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//img/..' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//img/..' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='//img/..' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//img/..' data='<a href="image5.html">Name: My image 5 <'>]
● node[谓语]:谓语用来查找某个特定的节点或者包含某个特定值的节点。
# 选中所有a中的第3个 >>> response.xpath('//a[3]') [<Selector xpath='//a[3]' data='<a href="image3.html">Name: My image 3 <'>] # 使用last函数,选中最后1个 >>> response.xpath('//a[last()]') [<Selector xpath='//a[last()]' data='<a href="image5.html">Name: My image 5 <'>] # 使用position函数,选中前3个 >>> response.xpath('//a[position()<=3]') [<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name: My image 3 <'>] # 选中所有含有id属性的div >>> response.xpath('//div[@id]') [<Selector xpath='//div[@id]' data='<div id="images">\n\t\t\t<a href="image1.htm'>] # 选中所有含有id属性且值为"images"的div >>> response.xpath('//div[@id="images"]') [<Selector xpath='//div[@id="images"]' data='<div id="images">\n\t\t\t<a href="image1.htm'>]
3.3.2 常用函数
XPath还提供许多函数,如数字、字符串、时间、日期、统计等。在上面的例子中,我们已经使用了函数position()、last()。由于篇幅有限,下面仅介绍两个十分常用的字符串函数。
● string(arg):返回参数的字符串值。
>>> from scrapy.selector import Selector >>> text='<a href="#">Click here to go to the <strong>Next Page</strong></a>' >>> sel = Selector(text=text) >>> sel <Selector xpath=None data='<html><body><a href="#">Click here to go'> # 以下做法和sel.xpath('/html/body/a/strong/text()')得到相同结果 >>> sel.xpath('string(/html/body/a/strong)').extract() ['Next Page'] # 如果想得到a中的整个字符串’Click here to go to the Next Page', # 使用text()就不行了,因为Click here to go to the和Next Page在不同元素下 # 以下做法将得到两个子串 >>> sel.xpath('/html/body/a//text()').extract() ['Click here to go to the ', 'Next Page'] # 这种情况下可以使用string()函数 >>> sel.xpath('string(/html/body/a)').extract() ['Click here to go to the Next Page']
● contains(str1, str2):判断str1中是否包含str2,返回布尔值。
>>> text = ''' ... <div> ... <p class="small info">hello world</p> ... <p class="normal info">hello scrapy</p> ... </div> ... ''' >>> sel = Selector(text=text) >>>sel.xpath('//p[contains(@class, "small")]')# 选择class属性中包含"small"的p元素 [<Selector xpath='//p[contains(@class, "small")]' data='<p class="small info">hello world</p>'>] >>>sel.xpath('//p[contains(@class, "info")]') # 选择class属性中包含"info"的p元素 [<Selector xpath='//p[contains(@class, "info")]' data='<p class="small info">hello world</p>'>, <Selector xpath='//p[contains(@class, "info")]' data='<p class="normal info">hello scrapy</p>'>]
关于XPath的使用先介绍到这里,更多详细内容可以参看XPath文档:https://www.w3.org/TR/xpath/。