Python 使用pandas和bs4从网页获取信息并写入.xls文件_Python_Pandas_Web Scraping_Beautifulsoup

Python 使用pandas和bs4从网页获取信息并写入.xls文件

python pandas web-scraping

Python 使用pandas和bs4从网页获取信息并写入.xls文件,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我是Python编程的初学者。我正在用python中的bs4模块练习web抓取我已经从网页中提取了一些字段，但是当我试图将它们写入.xls文件时，.xls文件除了标题外仍然是空的。请告诉我哪里做错了，如果可能的话，建议应该怎么做 from bs4 import BeautifulSoup as bs import pandas as pd res = requests.get('https://rwbj.com.au/find-an-agent.html') soup = bs(res.c

我是Python编程的初学者。我正在用python中的bs4模块练习web抓取

我已经从网页中提取了一些字段，但是当我试图将它们写入.xls文件时，.xls文件除了标题外仍然是空的。请告诉我哪里做错了，如果可能的话，建议应该怎么做

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://rwbj.com.au/find-an-agent.html')
soup = bs(res.content, 'lxml')

data = soup.find_all("div",{"class":"fluidgrid-cell fluidgrid-cell-2"})

records = []
name =[]
phone =[]
email=[]
title=[]
location=[]
for item in data:
    name = item.find('h3',class_='heading').text.strip()
    phone = item.find('a',class_='text text-link text-small').text.strip()
    email = item.find('a',class_='text text-link text-small')['href']
    title = item.find('div',class_='text text-small').text.strip()
    location = item.find('div',class_='text text-small').text.strip()

    records.append({'Names': name, 'Title': title, 'Email': email, 'Phone': phone, 'Location': location})

df = pd.DataFrame(records,columns=['Names','Title','Phone','Email','Location'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\R&W.xls', sheet_name='MyData2', index = False, header=True)

您可以使用类似selenium的方法来允许javascript呈现内容。然后，您可以获取页面源代码以继续使用脚本。我故意保留了你的脚本，只为等待内容添加了新行

您可以运行selenium headless或切换到使用HTMLSession

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

d = webdriver.Chrome()
d.get('https://rwbj.com.au/find-an-agent.html')

WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h3")))

soup = bs(d.page_source, 'lxml')
d.quit()
data = soup.find_all("div",{"class":"fluidgrid-cell fluidgrid-cell-2"})

records = []
name =[]
phone =[]
email=[]
title=[]
location=[]
for item in data:
    name = item.find('h3',class_='heading').text.strip()
    phone = item.find('a',class_='text text-link text-small').text.strip()
    email = item.find('a',class_='text text-link text-small')['href']
    title = item.find('div',class_='text text-small').text.strip()
    location = item.find('div',class_='text text-small').text.strip()
    records.append({'Names': name, 'Title': title, 'Email': email, 'Phone': phone, 'Location': location})

df = pd.DataFrame(records,columns=['Names','Title','Phone','Email','Location'])
print(df)

我可以考虑，如果每个人都有项目，比如：

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd

options = Options()
options.headless = True

d = webdriver.Chrome(options = options) 
d.get('https://rwbj.com.au/find-an-agent.html')

WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h3")))

soup = bs(d.page_source, 'lxml')
d.quit()
names = [item.text for item in soup.select('h3')]
titles = [item.text for item in soup.select('h3 ~ div:nth-of-type(1)')]
tels = [item.text for item in soup.select('h3 + a')]
emails = [item['href'] for item in soup.select('h3 ~ a:nth-of-type(2)')]
locations = [item.text for item in soup.select('h3 ~ div:nth-of-type(2)')]      
records = list(zip(names, titles, tels, emails, positions))
df = pd.DataFrame(records,columns=['Names','Title','Phone','Email','Location'])
print(df)

如果您不想使用selenium，那么您可以使用web页面发出的相同post请求。这将为您提供一个

xml

响应，您可以使用

Beautifulsoup

对其进行解析，以获得所需的输出

我们可以使用inspect工具中的network选项卡来获取正在发出的请求以及该请求的表单数据

接下来，我们必须使用

python请求

发出相同的请求，并解析输出

import requests
from bs4 import BeautifulSoup
import pandas as pd
number_of_agents_required=20 # they only have 20 on the site
payload={
'act':'act_fgxml',
'15[offset]':0,
'15[perpage]':number_of_agents_required,
'require':0,
'fgpid':15,
'ajax':1
}
records=[]
r=requests.post('https://www.rwbj.com.au/find-an-agent.html',data=payload)
soup=BeautifulSoup(r.text,'lxml')
for row in soup.find_all('row'):
    name=row.find('name').text
    title=row.position.text.replace('&amp;','&')
    email=row.email.text
    phone=row.phone.text
    location=row.office.text
    records.append([name,title,email,phone,location])
df=pd.DataFrame(records,columns=['Names','Title','Phone','Email','Location'])
df.to_excel('R&W.xls', sheet_name='MyData2', index = False, header=True)

输出：

奇怪的是，我在回答时没有看到这个+1为更好的方法。是的，它工作得很好。如果你能给我解释一下上述代码的工作原理，我将不胜感激。在“元素”选项卡中，我找不到“行”标记。甚至使用“&”和“&”的原因。如果你能解释一下你的代码，我是Python编程的初学者，我想了解更多关于这个方法的知识。代码部分，如果你能解释一下：

用于soup中的行。find_all（'row'）：name=row.find（'name'）。text title=row.position.text.replace（'amp；'，'&'）email=row.email.text phone=row.phone.text location=row.office.text

@ag2019如果你

打印（soup）

你可以看到

xml

的响应。代码的意思是-对于xml中的每个

行

标记，获取

名称

，

标题

，

电子邮件

，

电话

，

办公室

标记中的文本，并将它们保存到左侧的变量中。好的，非常感谢您的解释。在这里，我们将“number\u of_agent”设置为20，因为我们可以在网站上看到20个代理，但是如果网页上有更多的代理，这需要时间来计算，那么我们将为变量“number\u of_agent”分配什么值？好的，我现在将尝试学习Selenium，但是你能告诉我不允许从网页中提取信息的原因吗？该网页通过xhr请求更新内容。您需要向该url发出请求，或者使用selenium为内容提供呈现时间。