Python 如何水平地将刮取的数据导出到Excel？_Python_Web Scraping_Beautifulsoup

Python 如何水平地将刮取的数据导出到Excel？

python web-scraping

Python 如何水平地将刮取的数据导出到Excel？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我对Python比较陌生。以这个网站为例，我试图搜集餐馆的信息，但我不确定在垂直读取数据时如何水平旋转这些数据。我希望Excel表格有以下六列：姓名、街道、城市、州、邮编、电话。这是我正在使用的代码： from selenium import webdriver from bs4 import BeautifulSoup from urllib.request import urlopen import time driver = webdriver.Chrome(executable_pat

我对Python比较陌生。以这个网站为例，我试图搜集餐馆的信息，但我不确定在垂直读取数据时如何水平旋转这些数据。我希望Excel表格有以下六列：姓名、街道、城市、州、邮编、电话。这是我正在使用的代码：

from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time

driver = webdriver.Chrome(executable_path=r"C:\Downloads\chromedriver_win32\chromedriver.exe")


driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')
time.sleep(10)
with urlopen(driver.current_url) as response:
    soup = BeautifulSoup(response, 'html.parser')
    pageList = soup.findAll("div", attrs={"class": {"details"}})
    list_of_inner_text = [x.text for x in pageList]
    text = ', '.join(list_of_inner_text)
    print(text)

谢谢

编辑：根据反馈，以下是我对本页前五家餐厅的期望：

这里有一种方法。其他页面上的里程数可能有所不同

这条线

details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!=''

基本上是通过拆分“\n”上的

标记并进行少量字符串清理来处理输出列（条形图名称）的生成

import requests, re
from bs4 import BeautifulSoup 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome(executable_path=r"C:\Users\User\Documents\chromedriver.exe")
driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".restaurants")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
restuarants = soup.select('.restaurants')
results = []

for restuarant in restuarants:
    details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!='']
    details.insert(0, restuarant.select_one('h3 a').text)
    results.append(details)

df = pd.DataFrame(results, columns= ['Name','Address', 'City', 'State', 'Zip', 'Phone'])
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

这里有一种方法。其他页面上的里程数可能有所不同

这条线

details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!=''

基本上是通过拆分“\n”上的

标记并进行少量字符串清理来处理输出列（条形图名称）的生成

import requests, re
from bs4 import BeautifulSoup 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome(executable_path=r"C:\Users\User\Documents\chromedriver.exe")
driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".restaurants")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
restuarants = soup.select('.restaurants')
results = []

for restuarant in restuarants:
    details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!='']
    details.insert(0, restuarant.select_one('h3 a').text)
    results.append(details)

df = pd.DataFrame(results, columns= ['Name','Address', 'City', 'State', 'Zip', 'Phone'])
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

可能会显示所需输出的前几行。也不需要urllib.request，因为您希望在BeautifulSoup中使用driver.page_源。您当前正在检索的内容包含很多不需要的内容（我认为），因此查看预期输出会有所帮助。感谢@QHarr，我添加了一个图像以进行澄清。可能会显示所需输出的前几行。也不需要urllib.request，因为您希望在BeautifulSoup中使用driver.page_源。您正在检索的内容当前包含许多不需要的材料（我认为），因此查看预期输出会有所帮助。谢谢@QHarr，我添加了一张图片以供澄清。