Python scrapy中xpath的定位及筛选节点的写法大全

由墨香-15607781945 · 发布日期 2022年7月11日 · 已更新 2022年7月11日

scrapy中xpath的定位写法汇总

寻找父元素:://DDD/parent::* DDD节点的所有父节点
寻找祖先节点：ancestor::BOOK[1] 离当前上下文节点最近的book祖先节点
寻找孩子节点：/child::AAA 等价于/AAA
寻找兄弟节点：following-sibling::div[@class='pct']

示例：
current.xpath("ancestor::div[@class='pi']").xpath("following-sibling::div[@class='pct']").xpath("descendant::td[@class='t_f']/text()").extract()

其含义是寻找当前节点current的祖先节点div[@class='pi'的兄弟节点div[@class='pct的后代节点:td[@class='t_f']/的文本内容

python scrapy中xpath的筛选方法

1. //NODE[not(@class)] 所有节点名为node,且不包含class属性的节点

2. //NODE[@class and @id] 所有节点名为node,且同时包含class属性和id属性的节点

3. //NODE[contains(text(),substring] 所有节点名为node,且其文本中包含substring的节点//A[contains(text(),\"下一页\")] 所有包含“下一页”字符串的超链接节点

4.//A[contains(@title,"文章标题")] 所有其title属性中包含“文章标题”字符串的超链接节点4. //NODE[@id="myid"]/text() 节点名为node,且属性id为myid的节点的所有直接text子节点

5. BOOK[author/degree] 所有包含author节点同时该author节点至少含有一个的degree孩子节点的book节点6. AUTHOR[.="Matthew Bob"] 所有值为“Matthew Bob”的author节点

7. //*[count(BBB)=2] 所有包含两个BBB孩子节点的节点

8. //*[count(*)=2] 所有包含两个孩子节点的节点

9. //*[name()='BBB'] 所有名字为BBB的节点，等同于//BBB

10. //*[starts-with(name(),'B')] 所有名字开头为字母B的节点

11. //*[contains(name(),'C')] 所有名字中包含字母C的节点

12. //*[string-length(name()) = 3] 名字长度为3个字母的节点

13. //CCC | //BBB 所有CCC节点或BBB节点

14. /child::AAA 等价于/AAA

15. //CCC/descendant::* 所有以CCC为其祖先的节点

16. //DDD/parent::* DDD节点的所有父节点

17. //BBB[position() mod 2 = 0] 偶数位置的BBB节点

18. AUTHOR[not(last-name = "Bob")] 所有不包含元素last-name的值为Bob的节点

19. P/text()[2] 当前上下文节点中的P节点的第二个文本节点

20. ancestor::BOOK[1] 离当前上下文节点最近的book祖先节点

21. //A[text()="next"] 锚文本内容等于next的A节点

python scrapy 中 css选择器的使用方法

响应对象公开 Selector 实例对 .selector 属性：

>>> response.selector.xpath('//span/text()').get()
'good'

使用xpath和css查询响应非常常见，因此响应中还包含两个快捷方式：

response.xpath() 和 response.css() ：

>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

Scrapy选择器是 Selector 通过传递 TextResponse 对象或标记作为字符串（在 text 争论）。

通常不需要手动构造垃圾选择器： response 对象在spider回调中可用，因此在大多数情况下使用它更方便 response.css() 和 response.xpath() 捷径。通过使用 response.selector 或者这些快捷方式之一，您还可以确保响应主体只解析一次。

但如果需要，可以使用 Selector 直接。从文本构建：

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

从响应构造- HtmlResponse 是其中之一 TextResponse 子类：
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

使用案例

案例HTML代码

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

选择器示例

构造一个XPath来选择title标记内的文本：
response.xpath('//title/text()')

要实际提取文本数据，必须调用选择器 .get() 或 .getall() 方法如下：
response.xpath('//title/text()').getall()
response.xpath('//title/text()').get()

.get() 始终返回单个结果；如果有多个匹配项，则返回第一个匹配项的内容；如果没有匹配项，则不返回任何匹配项。 （或其别名） .extract_first() 常用于旧版本）
.getall() 返回包含所有结果的列表。

请注意，CSS选择器可以使用CSS3伪元素选择文本或属性节点：
response.css('title::text').get()

正如你所看到的， .xpath() 和 .css() 方法返回 SelectorList 实例，它是新选择器的列表。此API可用于快速选择嵌套数据：
response.css('img').xpath('@src').getall()

结果：
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

它返回 None 如果找不到元素：
response.xpath('//div[@id="not-exists"]/text()').get() is None
True

可以将默认返回值作为参数提供，以代替 None ：
response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')

例如 '@src' xpath可以使用 .attrib A的性质 Selector ：
[img.attrib['src'] for img in response.css('img')]

作为捷径， .attrib 也可以直接在SelectorList上使用；它返回第一个匹配元素的属性：
response.css('img').attrib['src']

当只需要一个结果时（例如，按id选择，或在网页上选择唯一元素时），此选项最有用：
response.css('base').attrib['href']

现在我们将获得基本URL和一些图像链接：
response.xpath('//base/@href').get()
response.css('base::attr(href)').get()
response.css('base').attrib['href']
response.xpath('//a[contains(@href, "image")]/@href').getall()
response.css('a[href*=image]::attr(href)').getall()
response.xpath('//a[contains(@href, "image")]/img/@src').getall()
response.css('a[href*=image] img::attr(src)').getall()

scrapy CSS选择器的扩展

要选择文本节点，请使用 ::text
要选择属性值，请使用 ::attr(name) 在哪里？ name 是要为其值的属性的名称
警告:这些伪元素是特定于scrapy-/parsel的。他们很可能不会与其他类库合作 lxml 或 PyQuery .

实例

title::text 选择子代的子文本节点 <title> 元素
response.css('title::text').get()

*::text 选择当前选择器上下文的所有子代文本节点：
response.css('#images *::text').getall()

foo::text 如果 foo 元素存在，但不包含文本（即文本为空）：
response.css('img::text').getall()

a::attr(href) 选择 href 子链接的属性值：
response.css('a::attr(href)').getall()

python scrapy selector 嵌套选择器

选择方法 (.xpath() 或 .css() )返回同一类型选择器的列表，以便您也可以调用这些选择器的选择方法。下面是一个例子
links = response.xpath('//a[contains(@href, "image")]')
links.getall()

for index, link in enumerate(links):
     href_xpath = link.xpath('@href').get()
     img_xpath = link.xpath('img/@src').get()
     print(f'Link number {index} points to url {href_xpath!r} and image {img_xpath!r}')

python scrapy selector 选择元素属性

有几种方法可以获得属性的值。首先，可以使用XPath语法：
response.xpath("//a/@href").getall()

xpath语法有几个优点：它是标准的xpath特性，并且 @attributes 可用于xpath表达式的其他部分-例如，可以按属性值筛选。

scrapy还提供了对css选择器的扩展 (::attr(...) )它允许获取属性值：
response.css('a::attr(href)').getall()

除此之外，还有 .attrib 选择器的属性。如果您喜欢在Python代码中查找属性，而不使用xpath或CSS扩展，则可以使用它：
[a.attrib['href'] for a in response.css('a')]
结果如下 ：
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

此属性在SelectorList上也可用；它返回一个字典，其中包含第一个匹配元素的属性。当选择器预期给出单个结果时（例如，当按元素ID选择时，或在页面上选择唯一元素时），使用它非常方便：
response.css('base').attrib
结果：
{'href': 'http://example.com/'}

.attrib 空SelectorList的属性为空：
response.css('foo').attrib

python scrapy 将选择器与正则表达式一起用

Selector 也有 .re() 使用正则表达式提取数据的方法。但是，与使用不同 .xpath() 或 .css() 方法， .re() 返回字符串列表。所以你不能构造嵌套的 .re() 电话。

下面是一个用于从 HTML code 以上

response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

另外还有一个助手在做往复运动 .get() （及其别名） .extract_first() 为 .re() 命名 .re_first() . 使用它只提取第一个匹配字符串：
response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')

python scrapy HTML响应的选择器示例

这里有一些 Selector 举例说明几个概念。在所有情况下，我们假设 Selector 用一个 HtmlResponse 这样的对象：
sel = Selector(html_response)


选择全部 <h1> 来自HTML响应正文的元素，返回 Selector 对象（即 SelectorList 对象）：
sel.xpath("//h1")

提取所有文本 <h1> 元素，返回字符串列表：
sel.xpath("//h1").getall() # this includes the h1 tag
sel.xpath("//h1/text()").getall() # this excludes the h1 tag

全部迭代 <p> 标记并打印其类属性：
for node in sel.xpath("//p"): 
print(node.attrib['class'])

参考链接：

https://www.csdn.net/tags/MtTaEg0sMDY5ODAwLWJsb2cO0O0O.html

http://www.xoxxoo.com/index/index/article/id/238.html

Python scrapy中xpath的定位及筛选节点的写法大全

scrapy中xpath的定位写法汇总

python scrapy中xpath的筛选方法

python scrapy 中 css选择器的使用方法

使用案例

选择器示例

scrapy CSS选择器的扩展

python scrapy selector 嵌套选择器

python scrapy selector 选择元素属性

python scrapy 将选择器与正则表达式一起用

python scrapy HTML响应的选择器示例

您可能还喜欢...

发表回复取消回复

近期文章

近期评论

归档

分类

Python scrapy中xpath的定位及筛选节点的写法大全

scrapy中xpath的定位写法汇总

python scrapy中xpath的筛选方法

python scrapy 中 css选择器的使用方法

使用案例

选择器示例

scrapy CSS选择器的扩展

python scrapy selector 嵌套选择器

python scrapy selector 选择元素属性

python scrapy 将选择器与正则表达式一起用

python scrapy HTML响应的选择器示例

您可能还喜欢...

马云演讲文章

JavaScript 跨域http请求 Ajax 异步跨域请求 js跨域问题

终身受用的19个世界顶尖思维，带你走向人生的上坡路

发表回复 取消回复

近期文章

近期评论

归档

分类

发表回复取消回复