PythonScrapy-删除注释掉的数据的问题_Python_Web Scraping_Scrapy

PythonScrapy-删除注释掉的数据的问题

python web-scraping scrapy

PythonScrapy-删除注释掉的数据的问题,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,经过数小时的故障排除，我终于能够确定我无法刮取这些数据的原因是因为最重要的数据被注释掉了，js必须加载它。“打印响应”确实看到了它，但scrapy不会提取该数据 xpath有comment（）来获取注释但是它以普通文本的形式给出注释，您必须删除并对其进行解析，以便在HTML中进行搜索。在scrapy中，可以使用classSelector（）对其进行解析最小工作代码 from scrapy.selector import Selector sel = Selector(text=''' &

经过数小时的故障排除，我终于能够确定我无法刮取这些数据的原因是因为最重要的数据被注释掉了，js必须加载它。“打印响应”确实看到了它，但scrapy不会提取该数据

xpath

有

comment（）

来获取注释

但是它以普通文本的形式给出注释，您必须删除

并对其进行解析，以便在

HTML

中进行搜索。在

scrapy

中，可以使用class

Selector（）

对其进行解析

最小工作代码

from scrapy.selector import Selector

sel = Selector(text='''
<div>
<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->
</div>''')

comment = sel.xpath('//comment()').get()
print(comment)

#html = comment.replace('<!--', '').replace('-->', '')
html = comment[4:-3]
print(html)

sel = Selector(text=html)

divs = sel.xpath('//div').getall()
print(divs)

从scrapy.selector导入选择器
sel=选择器（文本=“”）
''')
comment=sel.xpath（'//comment（）'）.get（）
打印（评论）
#html=注释。替换（“”，“”）
html=注释[4:-3]
打印（html）
sel=选择器（文本=html）
divs=sel.xpath（'//div'）.getall（）
打印（divs）

结果:

<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->

<div class="outer">
<div class="inner">Hello World</div>
</div>

['<div class="outer">\n<div class="inner">Hello World</div>\n</div>', '<div class="inner">Hello World</div>']


你好，世界
['\nHello World\n'，Hello World']

xpath

具有

comment（）

获取注释

但是它以普通文本的形式给出注释，您必须删除

并对其进行解析，以便在

HTML

中进行搜索。在

scrapy

中，可以使用class

Selector（）

对其进行解析

最小工作代码

from scrapy.selector import Selector

sel = Selector(text='''
<div>
<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->
</div>''')

comment = sel.xpath('//comment()').get()
print(comment)

#html = comment.replace('<!--', '').replace('-->', '')
html = comment[4:-3]
print(html)

sel = Selector(text=html)

divs = sel.xpath('//div').getall()
print(divs)

从scrapy.selector导入选择器
sel=选择器（文本=“”）
''')
comment=sel.xpath（'//comment（）'）.get（）
打印（评论）
#html=注释。替换（“”，“”）
html=注释[4:-3]
打印（html）
sel=选择器（文本=html）
divs=sel.xpath（'//div'）.getall（）
打印（divs）

结果:

<!--
<div class="outer">
<div class="inner">Hello World</div>
</div>
-->

<div class="outer">
<div class="inner">Hello World</div>
</div>

['<div class="outer">\n<div class="inner">Hello World</div>\n</div>', '<div class="inner">Hello World</div>']


你好，世界
['\nHello World\n'，Hello World']

您尝试了什么？您是否使用Google搜索如何使用ie从HTML获取注释。

xpath

？现在使用Google，我发现在xpath中可以使用

comment（）

。但是，您可能需要将其作为文本进行解析（即使用

scrapy

中的class

Selector（）

或类似

BeatifulSoup

的模块），您尝试了什么？您是否使用Google搜索如何使用ie从HTML获取注释。

xpath

？现在使用Google，我发现在xpath中可以使用

comment（）

。但是，您可能需要将其作为文本进行解析（例如，使用

scrapy

中的class

Selector（）

或类似

BeatifulSoup

的模块）这行代码是什么？html=comment[4:-3]它获取子字符串-开头没有4个字符（

）。顺便说一句：下次只需使用

print（）

print（comment）

和

print（comment[4:-3]）

-查看差异，您将不必问：）我的问题是它不是一个块。有十几个评论部分。但这为我指明了正确的方向，我明白了。谢谢如果您使用

.get_all（）

而不是

.get（）

，那么您将获得包含所有注释的列表。这一行是什么？html=comment[4:-3]它获取子字符串-开头没有4个字符（

）。顺便说一句：下次只需使用

print（）

print（comment）

和

print（comment[4:-3]）

-查看差异，您将不必问：）我的问题是它不是一个块。有十几个评论部分。但这为我指明了正确的方向，我明白了。谢谢如果您使用

.get_all（）

而不是

.get（）

，那么您将获得包含所有注释的列表。