Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup-处理类似表格的网站结构|返回字典_Python_Beautifulsoup - Fatal编程技术网

Python BeautifulSoup-处理类似表格的网站结构|返回字典

Python BeautifulSoup-处理类似表格的网站结构|返回字典,python,beautifulsoup,Python,Beautifulsoup,我有一些html,那种看起来像字典: 制造商网站:网址: 总部:地点等 .. 每个部分都包含在它自己的div中(so findAll,div类名) 是否有一种优雅且简单的方法将此类代码提取到字典中?或者必须迭代每个div,找到两个文本项,并假设第一个文本项是dictionary的键,第二个值是相同dict元素的值 示例站点代码: car = ''' <div class="info flexbox"> <div class="infoEntit

我有一些html,那种看起来像字典:

制造商网站:网址:

总部:地点等 ..

每个部分都包含在它自己的div中(so findAll,div类名)

是否有一种优雅且简单的方法将此类代码提取到字典中?或者必须迭代每个div,找到两个文本项,并假设第一个文本项是dictionary的键,第二个值是相同dict元素的值

示例站点代码:

    car = '''
     <div class="info flexbox">
       <div class="infoEntity">
        <span class="manufacturer website">
         <a class="link" href="http://www.ford.com" rel="nofollow noreferrer" target="_blank">
          www.ford.com
         </a>
        </span>
       </div>
       <div class="infoEntity">
        <label>
         Headquarters
        </label>
        <span class="value">
         Dearbord, MI
        </span>
       </div>
       <div class="infoEntity">
        <label>
         Model
        </label>
        <span class="value">
         Mustang
        </span>
       </div>
    '''

car_soup = BeautifulSoup(car, 'lxml')
print(car_soup.prettify())

elements = car_soup.findAll('div', class_ = 'infoEntity')
for x in elements:
    print(x)  ###and then we start iterating over x, with beautiful soup, to find value of each element.

顺便说一句,在这一点上,我已经用不优雅的方式做过几次了,只是想知道我是否遗漏了什么,是否有更好的方法来做到这一点。提前谢谢你

当前的HTML结构非常通用,它包含多个
infoEntity
div,子内容可以通过多种方式格式化。要处理此问题,您可以迭代
infoEntity
divs并应用格式化对象,如下所示:

from bs4 import BeautifulSoup as soup
result, label = {}, None
for i in soup(car, 'html.parser').find_all('div', {'class':'infoEntity'}):
   for b in i.find_all(['span', 'label']):
      if b.name == 'label':
         label = b.get_text(strip=True)
      elif b.name == 'span' and label is not None:
         result[label] = b.get_text(strip=True)
         label = None
      else:
         result[' '.join(b['class'])] = b.get_text(strip=True)
输出:

{'manufacturer website': 'www.ford.com', 'Headquarters': 'Dearbord, MI', 'Model': 'Mustang'}

或者,为了使事情更加通用和简单,您可以使用标签和制造商网站链接拆分字段处理:

soup = BeautifulSoup(car, 'lxml')

car_info = soup.select_one('.info')
data = {
    label.get_text(strip=True): label.find_next_sibling().get_text(strip=True)
    for label in car_info.select('.infoEntity label')
}
data['manufacturer website'] = car_info.select_one('.infoEntity a').get_text(strip=True)

print(data)
印刷品:

{'Headquarters': 'Dearbord, MI', 
 'Model': 'Mustang', 
 'manufacturer website': 'www.ford.com'}

谢谢你们两位的回复。这件短的看起来非常优雅!现在我只需要在整个代码中重新实现它:)
{'Headquarters': 'Dearbord, MI', 
 'Model': 'Mustang', 
 'manufacturer website': 'www.ford.com'}