Python 我正在尝试webscrape,结果被输出到csv文件中

Python 我正在尝试webscrape,结果被输出到csv文件中,python,beautifulsoup,python-requests,Python,Beautifulsoup,Python Requests,我正在尝试使用Python进行webscrape,结果输出到csv文件中,但是,当我运行脚本时,我得到了相同产品名称的多个条目。这是我的密码- import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphic

我正在尝试使用Python进行webscrape,结果输出到csv文件中,但是,当我运行脚本时,我得到了相同产品名称的多个条目。这是我的密码-

import bs4
from urllib.request
import urlopen as uReq
from bs4
import BeautifulSoup as soup

my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each product
containers = page_soup.findAll("div", {
    "class": "item-container"
})

filename = "products.csv"
f = open(filename, "w")

headers = "product_name, shipping\n"

f.write(headers)


for container in containers:
    container = page_soup.findAll("div", {
        "class": "item-info"
    })
print(container[0].div.a.img["title"])

container = page_soup.findAll("a", {
    "class": "item-title"
})
product_name = container[0].text

container = page_soup.findAll("li", {
    "class": "price-ship"
})
shipping = container[0].text.strip()


print("product_name: " + product_name)
print("shipping: " + shipping)


f.write(product_name.replace(",", "|") + "," + shipping + "\n")

f.close()

要获取关于同一项目的各种信息,可以使用
zip()
函数。对于编写CSV文件,我建议使用
CSV
模块()-它将自动处理引号和分隔符:

from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

with open('out.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(["product_name", "shipping"])
    for product_name, shipping in zip(soup.select('.item-container .item-title'), soup.select('.item-container .price-ship')):
        csvwriter.writerow([product_name.get_text(strip=True), shipping.get_text(strip=True)])
out.csv
的输出将是:

product_name,shipping
"EVGA GeForce RTX 2080 Ti XC ULTRA GAMING, 11G-P4-2383-KR, 11GB GDDR6, Dual HDB Fans & RGB LED",Free Shipping
XFX Radeon RX 5700 XT DirectX 12 RX-57XT8MFD6 Video Card,Free Shipping
GIGABYTE GeForce RTX 2060 DirectX 12 GV-N2060GAMINGOC PRO WHITE-6GD Video Card,Free Shipping
"Aorus AD27QD 27"" 144Hz 1440P FreeSync Gaming Monitor + GIGABYTE Radeon RX ...",
ASUS ROG Strix GeForce RTX 2070 DirectX 12 ROG-STRIX-RTX2070-8G-GAMING Video Card,Free Shipping
"EVGA GeForce RTX 2060 SC Ultra GAMING, 06G-P4-2067-KR, 6GB GDDR6, Dual HDB Fans",Free Shipping
PowerColor AMD Radeon RX 5700 XT 8GB GDDR6 AXRX 5700XT 8GBD6-M3DH,Free Shipping
MSI GeForce RTX 2080 DirectX 12 RTX 2080 VENTUS 8G Video Card,Free Shipping
ZOTAC GeForce GTX 1060 DirectX 12 ZT-P10620A-10M Video Card,Free Shipping
ASRock Phantom Gaming X Radeon VII DirectX 12 Radeon VII 16G Video Card,$6.99 Shipping
"Sapphire PULSE Radeon RX 580 8GB GDDR5 PCI-E Dual HDMI / DVI-D / Dual DP OC w/ Backplate (UEFI), 100411P8GOCL",Free Shipping
XFX Radeon RX 590 Fatboy DirectX 12 RX-590P8DFD6 8GB 256-Bit DDR5 PCI Express 3.0 CrossFireX Support Video Card,Free Shipping
在LibreOffice中打开此文件:


要获取关于同一项目的各种信息,可以使用
zip()
函数。对于编写CSV文件,我建议使用
CSV
模块()-它将自动处理引号和分隔符:

from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

with open('out.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(["product_name", "shipping"])
    for product_name, shipping in zip(soup.select('.item-container .item-title'), soup.select('.item-container .price-ship')):
        csvwriter.writerow([product_name.get_text(strip=True), shipping.get_text(strip=True)])
out.csv
的输出将是:

product_name,shipping
"EVGA GeForce RTX 2080 Ti XC ULTRA GAMING, 11G-P4-2383-KR, 11GB GDDR6, Dual HDB Fans & RGB LED",Free Shipping
XFX Radeon RX 5700 XT DirectX 12 RX-57XT8MFD6 Video Card,Free Shipping
GIGABYTE GeForce RTX 2060 DirectX 12 GV-N2060GAMINGOC PRO WHITE-6GD Video Card,Free Shipping
"Aorus AD27QD 27"" 144Hz 1440P FreeSync Gaming Monitor + GIGABYTE Radeon RX ...",
ASUS ROG Strix GeForce RTX 2070 DirectX 12 ROG-STRIX-RTX2070-8G-GAMING Video Card,Free Shipping
"EVGA GeForce RTX 2060 SC Ultra GAMING, 06G-P4-2067-KR, 6GB GDDR6, Dual HDB Fans",Free Shipping
PowerColor AMD Radeon RX 5700 XT 8GB GDDR6 AXRX 5700XT 8GBD6-M3DH,Free Shipping
MSI GeForce RTX 2080 DirectX 12 RTX 2080 VENTUS 8G Video Card,Free Shipping
ZOTAC GeForce GTX 1060 DirectX 12 ZT-P10620A-10M Video Card,Free Shipping
ASRock Phantom Gaming X Radeon VII DirectX 12 Radeon VII 16G Video Card,$6.99 Shipping
"Sapphire PULSE Radeon RX 580 8GB GDDR5 PCI-E Dual HDMI / DVI-D / Dual DP OC w/ Backplate (UEFI), 100411P8GOCL",Free Shipping
XFX Radeon RX 590 Fatboy DirectX 12 RX-590P8DFD6 8GB 256-Bit DDR5 PCI Express 3.0 CrossFireX Support Video Card,Free Shipping
在LibreOffice中打开此文件:


您当前的代码有点难理解,但是,我很确定它与
容器[0]有关
,您每次都在选择列表中的第一项。使用
print()
查看变量中的值,它应该可以帮助您找到问题。也许您应该在容器
container.findAll()中搜索
而不是
page\u source.findAll()
。使用
page\u source.findAll()
始终可以获得相同的元素。不要将结果分配到
容器中,因为您无法访问容器。作为旁白,我不确定Newegg是否会喜欢您这样删除他们的网站,看看您是否可以使用他们专用的API:但是,理解您当前的代码有点困难,我很确定这与
container[0]
有关,您每次都在选择列表中的第一项。使用
print()
查看变量中的值,这会帮助您找到问题。也许您应该在container
container.findAll()
中搜索,而不是
page\u source.findAll()
。使用
page\u source.findAll()
始终可以获得相同的元素。不要将结果分配到
容器中,因为您失去了对容器的访问权限。作为旁白,我不确定Newegg是否会感激您像这样删除他们的网站,看看您是否可以使用他们专用的API: