使用python和beautifulsoup将数据从网站刮取为csv文件格式
我正在尝试将所有图形卡详细信息放入csv文件中,但无法刮取数据(将此作为一个项目来刮取数据,以便于学习)。我不熟悉python和html。 我正在使用request和beautifulsoup库使用python和beautifulsoup将数据从网站刮取为csv文件格式,python,beautifulsoup,Python,Beautifulsoup,我正在尝试将所有图形卡详细信息放入csv文件中,但无法刮取数据(将此作为一个项目来刮取数据,以便于学习)。我不熟悉python和html。 我正在使用request和beautifulsoup库 import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'https://www.newegg.com/Product/ProductList.aspx?Su
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'
uClient = uReq(my_url)
Negg = uClient.read()
uClient.close
Complete_Graphics_New_Egg = soup(Negg,"html.parser")
Container_Main = Complete_Graphics_New_Egg.findAll("div",{"class":"item-container"})
Container_Main5 = str(Container_Main[5])
path_file='C:\\Users\\HP\\Documents\\Python\\Container_Main5.txt'
file_1 = open(path_file,'w')
file_1.write(Container_Main5)
file_1.close()
##Container_Main_details = Container_Main5.a
#div class="item-badges"
Container_5_1 = str(Container_Main[5].findAll("ul",{"class":"item-features"}))
path_file='C:\\Users\\HP\\Documents\\Python\\Container_test_5_1.txt'
file_5_1 = open(path_file,'w')
file_5_1.write(Container_5_1)
file_5_1.close()
Container_5_1.li
Container_5_2 = str(Container_Main[5].findAll("p",{"class":"item-promo"}))
path_file='C:\\Users\\HP\\Documents\\Python\\Container_test_5_2.txt'
file_5_2 = open(path_file,'w')
file_5_2.write(Container_5_2)
file_5_2.close()
##p class="item-promo"
##div class="item-info"
这应该让你开始。我也会为你把它分解一下,这样你就可以在学习的时候修改和玩了。我还建议使用Pandas,因为它是一个受欢迎的数据处理库,如果您还没有使用它,您将在不久的将来使用它 我首先初始化一个结果数据框,以存储您将要分析的所有数据:
import bs4
import requests
import pandas as pd
results = pd.DataFrame()
接下来,从站点获取html表单并将其传递到BeautifulSoup:
my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'
response = requests.get(my_url)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
然后你让它找到所有你感兴趣的标签。我添加的唯一一件事是让它迭代找到的每个标记/元素:
Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:
然后在每个容器中,从商品特性和商品促销中获取所需的数据。我将该数据存储到一个临时数据框(1行)中,然后将其附加到我的结果数据框中。因此,在每次迭代之后,临时数据帧都会被新信息覆盖,但结果是成功的;不会被覆盖,它只会被添加
最后,使用pandas将数据帧保存到csv
results.to_csv('path/file.csv', index=False)
因此,完整代码:
import bs4
import requests
import pandas as pd
results = pd.DataFrame()
my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'
response = requests.get(my_url)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:
item_features = container.find("ul",{"class":"item-features"})
# if there are no item-fetures, move on to the next container
if item_features == None:
continue
temp_df = pd.DataFrame(index=[0])
features_list = item_features.find_all('li')
for feature in features_list:
split_str = feature.text.split(':')
header = split_str[0]
data = split_str[1].strip()
temp_df[header] = data
promo = container.find_all("p",{"class":"item-promo"})[0].text
temp_df['promo'] = promo
results = results.append(temp_df, sort = False).reset_index(drop = True)
results.to_csv('path/file.csv', index=False)
你能告诉我们你到底有什么问题吗?仅供参考,这是刮(刮,刮,刮)而不是刮。谢谢。这很有帮助。我试图拉的图形卡的标题以及。我将发布它是如何进行的。