Python 带美化组的Webscraping：将H4头ID分配给列表中的元素_Python_Web Scraping_Beautifulsoup

Python 带美化组的Webscraping：将H4头ID分配给列表中的元素

python web-scraping

Python 带美化组的Webscraping：将H4头ID分配给列表中的元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我在做webscraping，有几个h4标签，每个标签下面都有列表。我想刮取每个列表的元素，并将其分配给每个h4标记的id。以下是HTML： <h4 class="dataHeaderWithBorder" id="Production" name="production">Production</h4> <ul class="simpleList"> <li><

我在做webscraping，有几个h4标签，每个标签下面都有列表。我想刮取每个列表的元素，并将其分配给每个h4标记的id。以下是HTML：

<h4 class="dataHeaderWithBorder" id="Production" name="production">Production</h4> <ul class="simpleList"> <li><a href="/company/co0308?ref_=xtco_co_1">Red Claw </a></li> <li><a href="/company/co0386?ref_=xtco_co_2">Haven </a></li> <li><a href="/company/co0487?ref_=xtco_co_3">Frame</a></li> </ul> <h4 class="dataHeaderWithBorder" id="Distribution" name="Distribution">Distribution</h4> <ul class="simpleList"> <li><a href="/company/co0017?ref_=xtco_co_1">Broadside Attractions</a> </li> <li><a href="/company/co0208?ref_=xtco_co_2"> Global Acquisitions</a></li> </ul>
我可以获取两个列表的所有元素，但无法获取id。我的代码如下所示：

for h4 in soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'}): id = h4.get_text() #print(id) for ul in h4.find_all('ul', attrs={'class':'simpleList'}): #print(ul) # Find the items that mention a budget productionCompany = ul.find_all('a') for company in productionCompany: text = company.get_text() print(id, text) productionComps.append(id, text)
我不知道如何从每个h4标签中获取id。如果我删除前两行并用soup.find_all替换h4.find_all，我的输出结果如下所示

Red Claw Haven Frame Broadside Attractions Global Acquisition

id
不是项目文本；这是一种属性。beautifulsoup中的元素属性可以像字典一样访问。试试这个：

item_id = h4['id']

您可以使用
itertools.groupby
：

from itertools import groupby from bs4 import BeautifulSoup as soup import re d = [[i.name, i.text] for i in soup(data, 'html.parser').find_all(re.compile('h4|a'))] new_d = [list(b) for _, b in groupby(d, key=lambda x:x[0] == 'h4')] grouped = [[new_d[i][0][-1], [a for _, a in new_d[i+1]]] for i in range(0, len(new_d), 2)] result = '\n'.join('\n'.join(f'{a}, {i}' for i in b) for a, b in grouped) print(result)
输出：

Production, Red Claw Production, Haven Production, Frame Distribution, Broadside Attractions Distribution, Global Acquisitions
使用拉链

h4_list=soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'}) ul_list=soup.find_all('ul', attrs={'class':'simpleList'}) productionComps=[] for h4,ul in zip(h4_list,ul_list): id_ = h4.get_text() productionCompany = ul.find_all('a') for company in productionCompany: text = company.get_text() print(id_, text) productionComps.append((id_, text))

谢谢这很有帮助，但并没有解决问题。哇——超级简单。非常感谢。这是可行的，但它真的很复杂，我无法解释它为什么有效。谢谢你的帮助。
Production, Red Claw Production, Haven Production, Frame Distribution, Broadside Attractions Distribution, Global Acquisitions

h4_list=soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'}) ul_list=soup.find_all('ul', attrs={'class':'simpleList'}) productionComps=[] for h4,ul in zip(h4_list,ul_list): id_ = h4.get_text() productionCompany = ul.find_all('a') for company in productionCompany: text = company.get_text() print(id_, text) productionComps.append((id_, text))