Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/353.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用scrapy提取h2标记下的所有文本_Python_Html_Web_Scrapy_Screen Scraping - Fatal编程技术网

Python 使用scrapy提取h2标记下的所有文本

Python 使用scrapy提取h2标记下的所有文本,python,html,web,scrapy,screen-scraping,Python,Html,Web,Scrapy,Screen Scraping,我需要搜索具有特定值的h2标记,并提取它后面的所有文本,直到下一个h2标记或页面结束。所以如果页面是 <h1 id="DDPSupport-InternalResources"><span style="color: rgb(0,51,102);"><strong>Internal Resources</strong></span></h1> <h2 id="DDPSu

我需要搜索具有特定值的h2标记,并提取它后面的所有文本,直到下一个h2标记或页面结束。所以如果页面是

<h1 id="DDPSupport-InternalResources"><span style="color: rgb(0,51,102);"><strong>Internal Resources</strong></span></h1>
<h2 id="DDPSupport-GeneralInformation">General Information</h2>
<ul><li><a href="/display/ladtechtme/DDP+overview">DDP overview</a></li>
<li><a href="/display/ladtechtme/DDP+Configuration+guide">DDP Config guide</a></li>
<li><a href="/pages/viewpage.action?pageId=1338281922">Custom DPR</a></li>
<li><a href="/display/ladtechtme/Build+custom+package">Build custom package</a></li>
<li><a href="/display/ladtechtme/Unit+testing">Unit testing</a></li>
<li><a href="/display/ladtechtme/FAQ">FAQ </a></li>
<li><a href="/display/ladtechtme/Misc+BKMs">Misc BKMs</a></li></ul>
<h2 id="DDPSupport-UseCases">Use Cases</h2>
<ul><li><a href="/pages/viewpage.action?pageId=1338281922">Custom DPR </a></li>...
我正在使用以下代码:

for head in response.xpath("//div[@class='wiki-content']/h2"):
   sub=str(head.xpath("text()").extract())
   sub = sub.replace("[","")
   sub = sub.replace("'","")
   sub = sub.replace("]","")
   if sub == 'General Information':
        lines = head.xpath("//following-sibling::*[count(following-sibling::h2)=1]//text()").extract()
        print(str(lines))
我得到了一些结果,但不是期望的结果。 我的输出由下一个h2标记的文本组成。
任何帮助都将不胜感激。

我看到您在
head.xpath()
中使用
/
而不是
/
,这可能是无意的,也是您获得意外结果的原因。@Gallaecio谢谢。我查过了。它仍然没有给出期望的结果。我甚至尝试了
//following sibling::*[count(following sibling::h2[text()=sub_next])///text()
其中sub_next是“用例”,只是为了尝试不同的东西。还是一样的
//下面的兄弟姐妹::ul[1]//text()
for head in response.xpath("//div[@class='wiki-content']/h2"):
   sub=str(head.xpath("text()").extract())
   sub = sub.replace("[","")
   sub = sub.replace("'","")
   sub = sub.replace("]","")
   if sub == 'General Information':
        lines = head.xpath("//following-sibling::*[count(following-sibling::h2)=1]//text()").extract()
        print(str(lines))