Python 如何使用beuatiful soup从HTML中提取数据_Python_Html_Web Scraping_Beautifulsoup

Python 如何使用beuatiful soup从HTML中提取数据

python html web-scraping

Python 如何使用beuatiful soup从HTML中提取数据,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在尝试抓取一个网页并将结果存储在csv/excel文件中。我用的是漂亮的汤我试图使用find_all函数从一个soup中提取数据，但我不确定如何在字段名或标题中捕获数据 HTML文件的格式如下 <h3 class="font20"> <span itemprop="position">36.</span> <a class="font20 c_name_head weight700 detail_page" href="/companie

我正在尝试抓取一个网页并将结果存储在csv/excel文件中。我用的是漂亮的汤

我试图使用find_all函数从一个soup中提取数据，但我不确定如何在字段名或标题中捕获数据

HTML文件的格式如下

<h3 class="font20">
 <span itemprop="position">36.</span> 
 <a class="font20 c_name_head weight700 detail_page" 
 href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank" 
 title="Nimblechapps Pvt. Ltd."> 
     <span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>

我试过使用以下方法-

Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>

输入：cont.h3.a.span
产出：Nimblechapps私人有限公司。

我想提取公司名称——“Nimblechapps私人有限公司”

您可以使用列表理解：

from bs4 import BeautifulSoup as BS
import requests

page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])

您将获得：

['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']

同样的事情，但是使用子代组合器

“

将类型选择器

与属性=值选择器

[itemprop=“name”]

尽量不要在脚本中使用复合类，因为它们很容易中断。下面的脚本也应该为您获取所需的内容

import requests
from bs4 import BeautifulSoup

link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"

res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
    names = items.find(class_='detail_page').text
    print(names)

发布您尝试过的代码，以及它的具体问题。@ScottHunter done！请检查您想要的问题的编辑版本

cont.h3.a.span.text

？要获取标签属性，请使用

tag[attr]

，要获取标签文本，请使用

tag.text

。请注意，

.find_all（）

返回元素列表。如果您只想第一次使用

.find（）

或按索引选择。简单，选择每个元素的文本，例如：cont.find\u all（“span”，itemprop=“name”）：print（tag.text）

names = [item.text for item in cont.select('a [itemprop="name"]')]

import requests
from bs4 import BeautifulSoup

link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"

res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
    names = items.find(class_='detail_page').text
    print(names)