Python 接收请求
我正在尝试获取请求Python 接收请求,python,html,pandas,web-scraping,beautifulsoup,Python,Html,Pandas,Web Scraping,Beautifulsoup,我正在尝试获取请求 from bs4 import BeautifulSoup import requests import pandas as pd html_page = requests.get('"https://www.dataquest.io"') soup = BeautifulSoup(html_page, "lxml") soup.find_all('<\a>') 但是,这只返回一个空列表这将提取表中的行,并将每一行分配给一个字典,字典将附加到列表中。您可能
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_page = requests.get('"https://www.dataquest.io"')
soup = BeautifulSoup(html_page, "lxml")
soup.find_all('<\a>')
但是,这只返回一个空列表这将提取表中的行,并将每一行分配给一个字典,字典将附加到列表中。您可能需要稍微调整选择器
from bs4 import BeautifulSoup
import requests
from pprint import pprint
output_data = [] # This is a LoD containing all of the table data
for i in range(1, 453): # For loop used to paginate
data_page = requests.get(f'https://www.dataquest.io?')
print(data_page)
soup = BeautifulSoup(data_page.text, "lxml")
# Find all of the table rows
elements = soup.select('div.head_table_t')
try:
secondary_elements = soup.select('div.list_table_subs')
elements = elements + secondary_elements
except:
pass
print(len(elements))
# Iterate through the rows and select individual column and assign it to the dictionary with the correct header
for element in elements:
data = {}
data['Name'] = element.select_one('div.col_1 a').text.strip()
data['Page URL'] = element.select_one('div.col_1 a')['href']
output_data.append(data) # Append dictionary (contact info) to the list
pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)
这将拉动表中的行,并将每一行分配给字典,字典将附加到列表中。您可能需要稍微调整选择器
from bs4 import BeautifulSoup
import requests
from pprint import pprint
output_data = [] # This is a LoD containing all of the table data
for i in range(1, 453): # For loop used to paginate
data_page = requests.get(f'https://www.dataquest.io?')
print(data_page)
soup = BeautifulSoup(data_page.text, "lxml")
# Find all of the table rows
elements = soup.select('div.head_table_t')
try:
secondary_elements = soup.select('div.list_table_subs')
elements = elements + secondary_elements
except:
pass
print(len(elements))
# Iterate through the rows and select individual column and assign it to the dictionary with the correct header
for element in elements:
data = {}
data['Name'] = element.select_one('div.col_1 a').text.strip()
data['Page URL'] = element.select_one('div.col_1 a')['href']
output_data.append(data) # Append dictionary (contact info) to the list
pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)
试试汤。findAll'a'这只是返回所有数据。试着看看如何将表中的数据提取到dftry汤中。findAll'a'这只是返回所有数据。试图了解如何将表中的数据提取到数据流中,非常感谢。如何对整个网站执行此操作,即如何从下一页提取表格等等?由于要提取的表有453页,我已经修改了我的原始答案,这包括多页循环,您需要稍微清理一下,并将请求移动到try/except中。如果这对你有效,你能把这个标记为答案吗?你原来的问题已经回答了,你能把这个标记为解决方案吗?至于剩下的问题,你需要将次要元素添加到元素列表中,我会在我的答案标记为解决方案后更新它。我已经用添加的其他元素进行了更新,根据您的操作方式,您可能希望将它们添加到相关词典中,例如,2M控股有限公司将在其自己的词典中与其他公司进行对比。非常感谢。如何对整个网站执行此操作,即如何从下一页提取表格等等?由于要提取的表有453页,我已经修改了我的原始答案,这包括多页循环,您需要稍微清理一下,并将请求移动到try/except中。如果这对你有效,你能把这个标记为答案吗?你原来的问题已经回答了,你能把这个标记为解决方案吗?至于剩下的问题,你需要将次要元素添加到元素列表中,我会在我的答案标记为解决方案后更新它。我已经用添加的其他元素进行了更新,根据您的操作方式,您可能希望将它们添加到关联的字典中,例如,2M控股有限公司会将其他公司放在下面,而不是放在自己的字典中。