无法使用python脚本从网站上刮取html表_Python_Python 3.x_Beautifulsoup_Python Requests

无法使用python脚本从网站上刮取html表

python python-3.x

无法使用python脚本从网站上刮取html表,python,python-3.x,beautifulsoup,python-requests,Python,Python 3.x,Beautifulsoup,Python Requests,我实际上是想在表中划出“Name”列，并将其保存为csv文件我编写了一个python脚本，如下所示： from bs4 import BeautifulSoup import requests import csv # Step 1: Sending a HTTP request to a URL url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+

我实际上是想在表中划出

“Name”

列，并将其保存为csv文件

我编写了一个python脚本，如下所示：

from bs4 import BeautifulSoup
import requests
import csv


# Step 1: Sending a HTTP request to a URL
url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text


# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html


# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table")
gdp_table_data = gdp_table.find_all("th")  # contains 2 rows

# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
    # remove any newlines and extra spaces from left and right
    headings.append(td.b.text.replace('\n', ' ').strip())

# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
    # Get headers of table i.e., Rank, Country, GDP.
    t_headers = []
    for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())

    # Get all the rows of table
    table_data = []
    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        # Each table row is stored in the form of
        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}

        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

    # Put the data for the table with his heading.
    data[heading] = table_data
    print("table_data")

但是当我运行这个脚本时，我什么也得不到。

请帮我解决这个问题。

似乎您的列表

gdp\u表数据[0]。find\u all（“td”）

是空的，因此解释说您没有找到任何东西（for循环没有做任何事情）。如果你的战略没有更多的背景，就很难有所帮助

顺便说一句，如果你不反对使用外部库，那么使用它将使刮取此类网页变得非常容易。让你知道：

>>> import pandas as pd
>>> url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
>>> df = pd.read_html(url)[0]
>>> print(df)
                                                  Name              Email  Work Phone  Phone          Dept/College
 0      AIESEC at the University of Minnesota (aiesec)     aiesec@umn.edu         NaN    NaN  Student Organization
 1   Ayn Rand Study Group University of Minnesota (...    aynrand@umn.edu         NaN    NaN                   NaN
 2                               Balance UMD (balance)  balance@d.umn.edu         NaN    NaN  Student Organization
 3   Christians on Campus the University of Minneso...     cocumn@umn.edu         NaN    NaN  Student Organization
 4          Climb Club University of Minnesota (climb)      climb@umn.edu         NaN    NaN  Student Organization
 ..                                                ...                ...         ...    ...                   ...
 74   University of Minnesota Tourism Center (tourism)    tourism@umn.edu         NaN    NaN            Department
 75  University of Minnesota Treasury Accounting (t...   treasury@umn.edu         NaN    NaN            Department
 76  University of Minnesota Twin Cities HOSA (umnh...    umnhosa@umn.edu         NaN    NaN  Student Organization
 77           University of Minnesota U Write (uwrite)                NaN         NaN    NaN            Department
 78        University of Minnesota VoiceMail (cs-vcml)    cs-vcml@umn.edu         NaN    NaN  OIT Network & Design

 [79 rows x 5 columns]

现在，只获取姓名非常简单：

>>> print(df.Name)
0        AIESEC at the University of Minnesota (aiesec)
1     Ayn Rand Study Group University of Minnesota (...
2                                 Balance UMD (balance)
3     Christians on Campus the University of Minneso...
4            Climb Club University of Minnesota (climb)
                            ...
74     University of Minnesota Tourism Center (tourism)
75    University of Minnesota Treasury Accounting (t...
76    University of Minnesota Twin Cities HOSA (umnh...
77             University of Minnesota U Write (uwrite)
78          University of Minnesota VoiceMail (cs-vcml)
Name: Name, Length: 79, dtype: object

要仅将该列导出到

.csv

中，请使用：

>>> df[["Name"]].to_csv("./filename.csv")

您是否尝试过使用调试器或只是打印变量以查看值是否正确？在我尝试运行您的脚本后，

gdp\u table\u data

的值是

[姓名、电子邮件、工作电话、电话、系/学院]

。这是你期望的吗？是的，我需要那个网站上的所有名字，你能帮我实现吗this@SSC只有姓名而不是电子邮件和其他我想得到唯一的姓名栏我如何才能实现？编辑我的答案。一个简单的

df.Name.tolist（）

可以工作。当我把它保存在一个.py文件中并运行时，我没有得到任何东西，实际上我需要它作为一个脚本。如果你想用脚本打印列表，你可以使用

print（df.Name.tolist（））

。不是这样，我希望这些名称被上传到另一个csv文件