精通Scrapy网络爬虫
上QQ阅读APP看书,第一时间看更新

3.3 XPath

XPath即XML路径语言(XML Path Language),它是一种用来确定xml文档中某部分位置的语言。

xml文档(html属于xml)是由一系列节点构成的树,例如:

        <html>
          <body>
              <div >
                <p>Hello world<p>
                <a href="/home">Click here</a>
              </div>
          </body>
        </html>

xml文档的节点有多种类型,其中最常用的有以下几种:

● 根节点 整个文档树的根。

● 元素节点html、body、div、p、a。

● 属性节点href。

● 文本节点Hello world、Click here。

节点间的关系有以下几种:

● 父子body是html的子节点,p和a是div的子节点。反过来,div是p和a的父节点。

● 兄弟p和a为兄弟节点。

● 祖先/后裔body、div、p、a都是html的后裔节点;反过来html是body、div、p、a的祖先节点。

3.3.1 基础语法

表3-1列出了XPath常用的基本语法。

表3-1 XPath常用的基本语法

接下来,我们通过一些例子展示XPath的使用。

首先创建一个用于演示的html文档,并用其构造一个HtmlResponse对象:

    >>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse
    >>> body = '''
    ... <html>
    ...    <head>
    ...        <base href='http://example.com/'/>
    ...        <title>Example website</title>
    ...    </head>
    ...    <body>
    ...        <div id='images'>
    ...           <a href='image1.html'>Name: Image 1<br/><img src='image1.jpg'/></a>
    ...           <a href='image2.html'>Name: Image 2<br/><img src='image2.jpg'/></a>
    ...           <a href='image3.html'>Name: Image 3<br/><img src='image3.jpg'/></a>
    ...           <a href='image4.html'>Name: Image 4<br/><img src='image4.jpg'/></a>
    ...           <a href='image5.html'>Name: Image 5<br/><img src='image5.jpg'/></a>
    ...        </div>
    ...    </body>
    ... </html>
    ... '''
    ...
    >>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')

● /:描述一个从根开始的绝对路径。

    >>> response.xpath('/html')
    [<Selector xpath='/html' data='<html>\n\t<head>\n\t\t<base href="http://exam'>]
    >>> response.xpath('/html/head')
    [<Selector xpath='/html/head' data='<head>\n\t\t<base href="http://example.com/'>]

● E1/E2:选中E1子节点中的所有E2。

    # 选中div子节点中的所有a
    >>> response.xpath('/html/body/div/a')
    [<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: My image 1 <'>,
     <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: My image 2 <'>,
     <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: My image 3 <'>,
     <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: My image 4 <'>,
     <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: My image 5 <'>]

● //E:选中文档中的所有E,无论在什么位置。

    # 选中文档中的所有a
    >>> response.xpath('//a')
    [<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
    <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
    <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
    <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
    <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]

● E1//E2:选中E1后代节点中的所有E2,无论在后代中的什么位置。

    # 选中body后代中的所有img
    >>> response.xpath('/html/body//img')
    [<Selector xpath='/html/body//img' data='<img src="image1.jpg">'>,
     <Selector xpath='/html/body//img' data='<img src="image2.jpg">'>,
     <Selector xpath='/html/body//img' data='<img src="image3.jpg">'>,
     <Selector xpath='/html/body//img' data='<img src="image4.jpg">'>,
     <Selector xpath='/html/body//img' data='<img src="image5.jpg">'>]

● E/text():选中E的文本子节点。

    # 选中所有a的文本
    >>> sel = response.xpath('//a/text()')
    >>> sel
    [<Selector xpath='//a/text()' data='Name: My image 1 '>,
     <Selector xpath='//a/text()' data='Name: My image 2 '>,
     <Selector xpath='//a/text()' data='Name: My image 3 '>,
     <Selector xpath='//a/text()' data='Name: My image 4 '>,
     <Selector xpath='//a/text()' data='Name: My image 5 '>]
    >>> sel.extract()
    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']

● E/*:选中E的所有元素子节点。

    # 选中html的所有元素子节点
    >>> response.xpath('/html/*')
    [<Selector xpath='/html/*' data='<head>\n\t\t<base href="http://example.com/'>,
     <Selector xpath='/html/*' data='<body>\n\t\t<div id="images">\n\t\t\t<a href="i'>]

    # 选中div的所有后代元素节点
    >>> response.xpath('/html/body/div//*')
    [<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name: My image 1 <'>,
     <Selector xpath='/html/body/div//*' data='<br>'>,
     <Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>,
     <Selector xpath='/html/body/div//*' data='<a href="image2.html">Name: My image 2 <'>,
     <Selector xpath='/html/body/div//*' data='<br>'>,
     <Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>,
     <Selector xpath='/html/body/div//*' data='<a href="image3.html">Name: My image 3 <'>,
     <Selector xpath='/html/body/div//*' data='<br>'>,
     <Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>,
     <Selector xpath='/html/body/div//*' data='<a href="image4.html">Name: My image 4 <'>,
     <Selector xpath='/html/body/div//*' data='<br>'>,
     <Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>,
     <Selector xpath='/html/body/div//*' data='<a href="image5.html">Name: My image 5 <'>,
     <Selector xpath='/html/body/div//*' data='<br>'>,
     <Selector xpath='/html/body/div//*' data='<img src="image5.jpg">'>]

● */E:选中孙节点中的所有E。

    # 选中div孙节点中的所有img
    >>> response.xpath('//div/*/img')
    [<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>,
     <Selector xpath='//div/*/img' data='<img src="image2.jpg">'>,
     <Selector xpath='//div/*/img' data='<img src="image3.jpg">'>,
     <Selector xpath='//div/*/img' data='<img src="image4.jpg">'>,
     <Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]

● E/@ATTR:选中E的ATTR属性。

    # 选中所有img的src属性
    >>> response.xpath('//img/@src')
    [<Selector xpath='//img/@src' data='image1.jpg'>,
     <Selector xpath='//img/@src' data='image2.jpg'>,
     <Selector xpath='//img/@src' data='image3.jpg'>,
     <Selector xpath='//img/@src' data='image4.jpg'>,
     <Selector xpath='//img/@src' data='image5.jpg'>]

● //@ATTR:选中文档中所有ATTR属性。

    # 选中所有的href属性
    >>> response.xpath('//@href')
    [<Selector xpath='//@href' data='http://example.com/'>,
    <Selector xpath='//@href' data='image1.html'>,
    <Selector xpath='//@href' data='image2.html'>,
    <Selector xpath='//@href' data='image3.html'>,
    <Selector xpath='//@href' data='image4.html'>,
    <Selector xpath='//@href' data='image5.html'>]

● E/@*:选中E的所有属性。

    # 获取第一个a下img的所有属性(这里只有src一个属性)
    >>> response.xpath('//a[1]/img/@*')
    [<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]

● .:选中当前节点,用来描述相对路径。

    # 获取第1个a的选择器对象
    >>> sel = response.xpath('//a')[0]
    >>> sel
    <Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>
    # 假设我们想选中当前这个a后代中的所有img,下面的做法是错误的,
    # 会找到文档中所有的img
    # 因为//img是绝对路径,会从文档的根开始搜索,而不是从当前的a开始
    >>> sel.xpath('//img')
    [<Selector xpath='//img' data='<img src="image1.jpg">'>,
     <Selector xpath='//img' data='<img src="image2.jpg">'>,
     <Selector xpath='//img' data='<img src="image3.jpg">'>,
     <Selector xpath='//img' data='<img src="image4.jpg">'>,
     <Selector xpath='//img' data='<img src="image5.jpg">'>]
    # 需要使用.//img来描述当前节点后代中的所有img
    >>> sel.xpath('.//img')
    [<Selector xpath='.//img' data='<img src="image1.jpg">'>]

● ..:选中当前节点的父节点,用来描述相对路径。

    # 选中所有img的父节点
    >>> response.xpath('//img/..')
    [<Selector xpath='//img/..' data='<a href="image1.html">Name: My image 1 <'>,
     <Selector xpath='//img/..' data='<a href="image2.html">Name: My image 2 <'>,
     <Selector xpath='//img/..' data='<a href="image3.html">Name: My image 3 <'>,
     <Selector xpath='//img/..' data='<a href="image4.html">Name: My image 4 <'>,
     <Selector xpath='//img/..' data='<a href="image5.html">Name: My image 5 <'>]

● node[谓语]:谓语用来查找某个特定的节点或者包含某个特定值的节点。

        # 选中所有a中的第3个
        >>> response.xpath('//a[3]')
        [<Selector xpath='//a[3]' data='<a href="image3.html">Name: My image 3 <'>]

        # 使用last函数,选中最后1个
        >>> response.xpath('//a[last()]')
        [<Selector xpath='//a[last()]' data='<a href="image5.html">Name: My image 5 <'>]

        # 使用position函数,选中前3个
        >>> response.xpath('//a[position()<=3]')
        [<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name: My image 1 <'>,
         <Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name: My image 2 <'>,
         <Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name: My image 3 <'>]

        # 选中所有含有id属性的div
        >>> response.xpath('//div[@id]')
        [<Selector xpath='//div[@id]' data='<div id="images">\n\t\t\t<a href="image1.htm'>]

        # 选中所有含有id属性且值为"images"的div
        >>> response.xpath('//div[@id="images"]')
        [<Selector xpath='//div[@id="images"]' data='<div id="images">\n\t\t\t<a href="image1.htm'>]

3.3.2 常用函数

XPath还提供许多函数,如数字、字符串、时间、日期、统计等。在上面的例子中,我们已经使用了函数position()、last()。由于篇幅有限,下面仅介绍两个十分常用的字符串函数。

● string(arg):返回参数的字符串值。

        >>> from scrapy.selector import Selector
        >>> text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
        >>> sel = Selector(text=text)
        >>> sel
        <Selector xpath=None data='<html><body><a href="#">Click here to go'>
        # 以下做法和sel.xpath('/html/body/a/strong/text()')得到相同结果
        >>> sel.xpath('string(/html/body/a/strong)').extract()
        ['Next Page']
        # 如果想得到a中的整个字符串’Click here to go to the Next Page',
        # 使用text()就不行了,因为Click here to go to the和Next Page在不同元素下
        # 以下做法将得到两个子串
        >>> sel.xpath('/html/body/a//text()').extract()
        ['Click here to go to the ', 'Next Page']
        # 这种情况下可以使用string()函数
        >>> sel.xpath('string(/html/body/a)').extract()
        ['Click here to go to the Next Page']

● contains(str1, str2):判断str1中是否包含str2,返回布尔值。

        >>> text = '''
        ... <div>
        ...    <p class="small info">hello world</p>
        ...    <p class="normal info">hello scrapy</p>
        ... </div>
        ... '''
        >>> sel = Selector(text=text)
        >>>sel.xpath('//p[contains(@class, "small")]')# 选择class属性中包含"small"的p元素
        [<Selector xpath='//p[contains(@class, "small")]' data='<p class="small info">hello world</p>'>]
        >>>sel.xpath('//p[contains(@class, "info")]')  # 选择class属性中包含"info"的p元素
        [<Selector xpath='//p[contains(@class, "info")]' data='<p class="small info">hello world</p>'>,
         <Selector xpath='//p[contains(@class, "info")]' data='<p class="normal info">hello scrapy</p>'>]

关于XPath的使用先介绍到这里,更多详细内容可以参看XPath文档:https://www.w3.org/TR/xpath/。