Python 无法通过scrapy shell访问某些reddit数据_Python_Web Scraping_Scrapy

Python 无法通过scrapy shell访问某些reddit数据

python web-scraping scrapy

Python 无法通过scrapy shell访问某些reddit数据,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我对scrapy很陌生，正在尝试在reddit上刮帖子。为了帮上忙，我已经接触到了这个脏兮兮的外壳，并且正在努力挖掘这些帖子。我正在使用的页面是我已查看了源，并找到了以下我要访问的数据： “class=“usertext body may blank in md container”>在我看来，这位参议员使用“替代事实”一词的方式与康威使用它们的方式相反。他用它们来离散e.t.c” 为什么键入response.xpath（'//div[@class=“md”]）.extract（）时会得到一个

我对scrapy很陌生，正在尝试在reddit上刮帖子。为了帮上忙，我已经接触到了这个脏兮兮的外壳，并且正在努力挖掘这些帖子。我正在使用的页面是

我已查看了源，并找到了以下我要访问的数据：

“class=“usertext body may blank in md container”>

在我看来，这位参议员使用“替代事实”一词的方式与康威使用它们的方式相反。他用它们来离散e.t.c”

为什么键入response.xpath（'//div[@class=“md”]）.extract（）时会得到一个空数组。此外，当我试图通过shell访问此页面上的大量数据时，会得到空数组

提前多谢

如果您想访问每篇文章的文本，可以使用以下xpath：

response.xpath（'//form[contains（@id，“form-t1”）]//div//div//p/text（））.extract（）

您可以在此处了解有关XPath的更多信息：

最后，如果您想测试XPath，这里有一个非常有用的工具：。在左边的文本区域粘贴要解析的HTML，在右边粘贴xpath。您现在可以轻松地测试代码

希望这有帮助。

如果您想访问每篇文章的文本，可以使用以下xpath：

response.xpath（'//form[contains（@id，“form-t1”）]//div//div//p/text（））.extract（）

您可以在此处了解有关XPath的更多信息：

最后，如果您想测试XPath，这里有一个非常有用的工具：。在左边的文本区域粘贴要解析的HTML，在右边粘贴xpath。您现在可以轻松地测试代码

希望这有帮助。

尝试使用

response.css

和

response.xpath

这两种方法，避免使用

表单

id，因为它似乎会改变：

>>> response.css('div.entry form div.usertext-body div.md p ::text').extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'
>>> 
>>> response.xpath("//div[contains(@class, 'entry')]/form/div/div/p[1]/text()").extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'

使用

response.css

和

response.xpath

尝试此操作，避免使用

form

id，因为它似乎会改变：

>>> response.css('div.entry form div.usertext-body div.md p ::text').extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'
>>> 
>>> response.xpath("//div[contains(@class, 'entry')]/form/div/div/p[1]/text()").extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'

嘿，非常感谢你的帮助。当我输入response.xpath（'/*[@id=“form-t1_dhbsy4cbgx”]/div/div//p'）时，我仍然得到一个空数组。我更新了我的答案，xpath不起作用，因为站点在每次加载页面时都会生成新的标记id。嘿，非常感谢您的帮助。当我输入response.xpath（'/*[@id=“form-t1_dhbsy4cbgx”]/div/div//p'）时，我仍然得到一个空数组。我更新了我的答案，xpath不起作用，因为站点在每次加载页面时都会生成新的标记id。非常感谢。这对我有用。似乎我认为我比我知道的更多。你知道为什么使用//div[@class='md']不起作用吗？如果你在

shell

中运行

//div[@class='md']

，你会得到你期望的结果以及

html标记

和其他注释，你需要从结果中过滤出来，使用

text（）

删除标记。我只是把xpath指向了您期望的结果。啊，是的。我遇到了一个问题，因为我使用了双引号：response.xpath（“//div[@class=“md”]”）。永远感谢你的帮助：）非常感谢。这对我有用。似乎我认为我比我知道的更多。你知道为什么使用//div[@class='md']不起作用吗？如果你在

shell

中运行

//div[@class='md']

，你会得到你期望的结果以及

html标记

和其他注释，你需要从结果中过滤出来，使用

text（）

删除标记。我只是把xpath指向了您期望的结果。啊，是的。我遇到了一个问题，因为我使用了双引号：response.xpath（“//div[@class=“md”]”）。永远感谢您的帮助：）