Python 接收请求_Python_Html_Pandas_Web Scraping_Beautifulsoup

Python 接收请求

python html pandas web-scraping

Python 接收请求,python,html,pandas,web-scraping,beautifulsoup,Python,Html,Pandas,Web Scraping,Beautifulsoup,我正在尝试获取请求 from bs4 import BeautifulSoup import requests import pandas as pd html_page = requests.get('"https://www.dataquest.io"') soup = BeautifulSoup(html_page, "lxml") soup.find_all('<\a>') 但是，这只返回一个空列表这将提取表中的行，并将每一行分配给一个字典，字典将附加到列表中。您可能

我正在尝试获取请求

from bs4 import BeautifulSoup
import requests 
import pandas as pd

html_page = requests.get('"https://www.dataquest.io"')

soup = BeautifulSoup(html_page, "lxml")
soup.find_all('<\a>')

但是，这只返回一个空列表

这将提取表中的行，并将每一行分配给一个字典，字典将附加到列表中。您可能需要稍微调整选择器

from bs4 import BeautifulSoup
import requests
from pprint import pprint

output_data = [] # This is a LoD containing all of the table data

for i in range(1, 453): # For loop used to paginate
    data_page = requests.get(f'https://www.dataquest.io?')
    print(data_page)

    soup = BeautifulSoup(data_page.text, "lxml")

    # Find all of the table rows
    elements = soup.select('div.head_table_t')
    try:
        secondary_elements = soup.select('div.list_table_subs')
        elements = elements + secondary_elements
    except:
        pass
    print(len(elements))
    # Iterate through the rows and select individual column and assign it to the dictionary with the correct header
    for element in elements:
        data = {}
        data['Name'] = element.select_one('div.col_1 a').text.strip()
        data['Page URL'] = element.select_one('div.col_1 a')['href']
        output_data.append(data) # Append dictionary (contact info) to the list
        pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)

这将拉动表中的行，并将每一行分配给字典，字典将附加到列表中。您可能需要稍微调整选择器

from bs4 import BeautifulSoup
import requests
from pprint import pprint

output_data = [] # This is a LoD containing all of the table data

for i in range(1, 453): # For loop used to paginate
    data_page = requests.get(f'https://www.dataquest.io?')
    print(data_page)

    soup = BeautifulSoup(data_page.text, "lxml")

    # Find all of the table rows
    elements = soup.select('div.head_table_t')
    try:
        secondary_elements = soup.select('div.list_table_subs')
        elements = elements + secondary_elements
    except:
        pass
    print(len(elements))
    # Iterate through the rows and select individual column and assign it to the dictionary with the correct header
    for element in elements:
        data = {}
        data['Name'] = element.select_one('div.col_1 a').text.strip()
        data['Page URL'] = element.select_one('div.col_1 a')['href']
        output_data.append(data) # Append dictionary (contact info) to the list
        pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)

试试汤。findAll'a'这只是返回所有数据。试着看看如何将表中的数据提取到dftry汤中。findAll'a'这只是返回所有数据。试图了解如何将表中的数据提取到数据流中，非常感谢。如何对整个网站执行此操作，即如何从下一页提取表格等等？由于要提取的表有453页，我已经修改了我的原始答案，这包括多页循环，您需要稍微清理一下，并将请求移动到try/except中。如果这对你有效，你能把这个标记为答案吗？你原来的问题已经回答了，你能把这个标记为解决方案吗？至于剩下的问题，你需要将次要元素添加到元素列表中，我会在我的答案标记为解决方案后更新它。我已经用添加的其他元素进行了更新，根据您的操作方式，您可能希望将它们添加到相关词典中，例如，2M控股有限公司将在其自己的词典中与其他公司进行对比。非常感谢。如何对整个网站执行此操作，即如何从下一页提取表格等等？由于要提取的表有453页，我已经修改了我的原始答案，这包括多页循环，您需要稍微清理一下，并将请求移动到try/except中。如果这对你有效，你能把这个标记为答案吗？你原来的问题已经回答了，你能把这个标记为解决方案吗？至于剩下的问题，你需要将次要元素添加到元素列表中，我会在我的答案标记为解决方案后更新它。我已经用添加的其他元素进行了更新，根据您的操作方式，您可能希望将它们添加到关联的字典中，例如，2M控股有限公司会将其他公司放在下面，而不是放在自己的字典中。