如何使用python和bs4修复scrape web表输出csv
请帮帮我,, 我想在“td”、“条形码”和“nama produk”中获取2个数据,但我得到的数据非常糟糕。我该修什么如何使用python和bs4修复scrape web表输出csv,python,csv,web-scraping,beautifulsoup,scrape,Python,Csv,Web Scraping,Beautifulsoup,Scrape,请帮帮我,, 我想在“td”、“条形码”和“nama produk”中获取2个数据,但我得到的数据非常糟糕。我该修什么 import csv import requests from bs4 import BeautifulSoup outfile = open("dataaa.csv","w",newline='') writer = csv.writer(outfile) page = 0 while page < 3 : url = "http://ciumi.com
import csv
import requests
from bs4 import BeautifulSoup
outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)
page = 0
while page < 3 :
url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
response = requests.get(url)
tree = BeautifulSoup(response.text, 'html.parser')
page += 1
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("tr")]
for row_data in table_tag.select("td")]
for data in tab_data:
writer.writerow(data)
print(table_tag)
print(response, url, ' '.join(data))
import fileinput
seen = set()
for line in fileinput.FileInput('dataaa.csv', inplace=1):
if line in seen: continue
seen.add(line)
print (line)
导入csv
导入请求
从bs4导入BeautifulSoup
outfile=open(“dataaa.csv”,“w”,换行符=”)
writer=csv.writer(输出文件)
第页=0
而第3页:
url=”http://ciumi.com/cspos/barcode-ritel.php?page={:d}.格式(第页)
response=requests.get(url)
tree=BeautifulSoup(response.text'html.parser')
页码+=1
table_tag=树。选择(“表”)[0]
tab_data=[[item.text用于行_data中的项目。选择(“tr”)]
对于表标记中的行数据,选择(“td”)]
对于tab_数据中的数据:
writer.writerow(数据)
打印(表格标签)
打印(响应、url、.join(数据))
导入文件输入
seen=set()
对于fileinput.fileinput('dataaa.csv',inplace=1)中的行:
如果看到行,则继续
已看到。添加(行)
打印(行)
我需要改进什么才能获得漂亮的效果?您可以使用pandas来简化这一点。熊猫在引擎盖下使用BeautifulSoup解析表格,顺便说一下:
import pandas as pd
results_df = pd.DataFrame()
for page in range(1,3):
url = 'http://ciumi.com/cspos/barcode-ritel.php?page=%s' %page
results_df = results_df.append(pd.read_html(url)[0], sort=True)
results_df.columns = ['Barcode', 'Nama Produk']
results_df = results_df.reset_index(drop=True)
results_df.to_csv('dataaa.csv', index=False)
输出:
print (results_df)
Barcode Nama Produk
0 8992694242533 ZWITSAL SOAP 80G PACK 4
1 8992694247163 ZWITSAL SOAP 80G MILK&HONEY
2 8992694242502 ZWITSAL SOAP 80G CLASSIC
3 8992694245435 ZWITSAL SKIN GUARD LOT 100ML SPRAY
4 8992694246074 ZWITSAL SHP 600ML C&R
5 8992694242908 ZWITSAL SHP 50ML REBORN
6 8992694020025 ZWITSAL SHP 500ML REF AVKS
7 8992694246333 ZWITSAL SHP 500ML C&R REF
8 8992694246364 ZWITSAL SHP 300ML AVKS
9 8992694246319 ZWITSAL SHP 250ML REF CLEAN&R
10 8992694246357 ZWITSAL SHP 250ML REF AVKS
11 8992694242922 ZWITSAL SHP 200ML REBORN
12 8992694242915 ZWITSAL SHP 100ML CLASSIC
13 8992694246340 ZWITSAL SHP 100ML AVKS
14 8992694242601 ZWITSAL PWD 50G SOFTFLOWER
15 8992694244254 ZWITSAL PWD 50G FRESH
16 8992694242656 ZWITSAL PWD 500G SOFTFLORAL
17 8992694241055 ZWITSAL PWD 500G FRESH F
18 8992694244056 ZWITSAL PWD 300G SOFT FLORAL
19 8992694244513 ZWITSAL PWD 300G MILK&HONEY
看起来页面从1开始,所以我的范围循环从那里开始。然后,您可以使用对象来提高重用连接的效率。如果您明智地选择css选择器,则所有过滤都可以在该级别完成,然后您只能处理检索到的必需元素。您可以使用更轻的
csv
而不是更重的pandas
导入
需要bs4 4.7.1+作为利用:具有
伪选择器
快速解释:
print (results_df)
Barcode Nama Produk
0 8992694242533 ZWITSAL SOAP 80G PACK 4
1 8992694247163 ZWITSAL SOAP 80G MILK&HONEY
2 8992694242502 ZWITSAL SOAP 80G CLASSIC
3 8992694245435 ZWITSAL SKIN GUARD LOT 100ML SPRAY
4 8992694246074 ZWITSAL SHP 600ML C&R
5 8992694242908 ZWITSAL SHP 50ML REBORN
6 8992694020025 ZWITSAL SHP 500ML REF AVKS
7 8992694246333 ZWITSAL SHP 500ML C&R REF
8 8992694246364 ZWITSAL SHP 300ML AVKS
9 8992694246319 ZWITSAL SHP 250ML REF CLEAN&R
10 8992694246357 ZWITSAL SHP 250ML REF AVKS
11 8992694242922 ZWITSAL SHP 200ML REBORN
12 8992694242915 ZWITSAL SHP 100ML CLASSIC
13 8992694246340 ZWITSAL SHP 100ML AVKS
14 8992694242601 ZWITSAL PWD 50G SOFTFLOWER
15 8992694244254 ZWITSAL PWD 50G FRESH
16 8992694242656 ZWITSAL PWD 500G SOFTFLORAL
17 8992694241055 ZWITSAL PWD 500G FRESH F
18 8992694244056 ZWITSAL PWD 300G SOFT FLORAL
19 8992694244513 ZWITSAL PWD 300G MILK&HONEY
以下内容通过仅将center
元素与center
soup.select('center')
然后
通过使用为第二列选择,以获取左侧表格单元格(td)旁边的右侧相邻表格单元格,该单元格具有中心
子元素
检索到的标记列表在列表理解范围内提取并剥离它们的.text
,然后将其压缩并再次转换为列表;并附加到最终列表结果
,该结果随后循环写入csv
css选择器保持最小,以允许更快的匹配
附加阅读:
print (results_df)
Barcode Nama Produk
0 8992694242533 ZWITSAL SOAP 80G PACK 4
1 8992694247163 ZWITSAL SOAP 80G MILK&HONEY
2 8992694242502 ZWITSAL SOAP 80G CLASSIC
3 8992694245435 ZWITSAL SKIN GUARD LOT 100ML SPRAY
4 8992694246074 ZWITSAL SHP 600ML C&R
5 8992694242908 ZWITSAL SHP 50ML REBORN
6 8992694020025 ZWITSAL SHP 500ML REF AVKS
7 8992694246333 ZWITSAL SHP 500ML C&R REF
8 8992694246364 ZWITSAL SHP 300ML AVKS
9 8992694246319 ZWITSAL SHP 250ML REF CLEAN&R
10 8992694246357 ZWITSAL SHP 250ML REF AVKS
11 8992694242922 ZWITSAL SHP 200ML REBORN
12 8992694242915 ZWITSAL SHP 100ML CLASSIC
13 8992694246340 ZWITSAL SHP 100ML AVKS
14 8992694242601 ZWITSAL PWD 50G SOFTFLOWER
15 8992694244254 ZWITSAL PWD 50G FRESH
16 8992694242656 ZWITSAL PWD 500G SOFTFLORAL
17 8992694241055 ZWITSAL PWD 500G FRESH F
18 8992694244056 ZWITSAL PWD 300G SOFT FLORAL
19 8992694244513 ZWITSAL PWD 300G MILK&HONEY