Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/87.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
pythonweb抓取问题_Python_Html_Web Scraping_Beautifulsoup_Html Parsing - Fatal编程技术网

pythonweb抓取问题

pythonweb抓取问题,python,html,web-scraping,beautifulsoup,html-parsing,Python,Html,Web Scraping,Beautifulsoup,Html Parsing,基本上,我有一个大的html文档,我想刮。类似文档的一个非常简化的示例如下: <a name = 'ID_0'></a> <span class='c2'>Date</span> <span class='c2'>December 12,2005</span> <span class='c2'>Source</span> <span class='c2'>NY Times</span

基本上,我有一个大的html文档,我想刮。类似文档的一个非常简化的示例如下:

<a name = 'ID_0'></a>
<span class='c2'>Date</span>
<span class='c2'>December 12,2005</span>
<span class='c2'>Source</span>
<span class='c2'>NY Times</span>
<span class='c2'>Author</span>
<span class='c2'>John</span>

<a name = 'ID_1'></a>
<span class='c2'>Date</span>
<span class='c2'>January 21,2008</span>
<span class='c2'>Source</span>
<span class='c2'>LA Times</span>

<a name = 'ID_2'></a>
<span class='c2'>Source</span>
<span class='c2'>Wall Street Journal</span>
<span class='c2'>Author</span>
<span class='c2'>Jane</span>
但是,由于某些ID缺少作者或日期,刮板将从下一个ID中获取下一个可用的作者或日期。ID_1将具有ID_2作者。ID_2将有ID_3日期。我的第一个想法是以某种方式跟踪每个标记处的索引,如果索引超过下一个“a”标记索引,则追加null。有更好的解决方案吗?

我会使用(或)获取所有标签,直到下一个
a
链接或文档结束,而不是
find_next()
。大致如下:

links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])

for link in links:
    item = [link["name"]]
    for elm in link.find_next_siblings():
        if elm.name == "a":
            break  # hit the next "a" element - break

        if elm.text in columns:
            item.append(elm.find_next().text)

     data.append(item)

使用lxml和xpath。。
links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])

for link in links:
    item = [link["name"]]
    for elm in link.find_next_siblings():
        if elm.name == "a":
            break  # hit the next "a" element - break

        if elm.text in columns:
            item.append(elm.find_next().text)

     data.append(item)