Web数据（wiki）抓取python_Python_Web Scraping_Beautifulsoup

Web数据（wiki）抓取python

python web-scraping

Web数据（wiki）抓取python,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从维基百科获取一些大学的lat lng，我有一个基本url=''和大学列表，我从href获取每所大学的wiki页面，以获取其wiki页面上的lat lng。我得到这个错误一个错误“非类型”对象没有属性“文本”我无法纠正这个错误，我哪里做错了 import time import csv from bs4 import BeautifulSoup import re import requests from selenium import webdriver driver = webdri

我正试图从维基百科获取一些大学的lat lng，我有一个基本url=''和大学列表，我从href获取每所大学的wiki页面，以获取其wiki页面上的lat lng。我得到这个错误一个错误“非类型”对象没有属性“文本”我无法纠正这个错误，我哪里做错了

import time
import csv
from bs4 import BeautifulSoup
import re
import requests
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien')
html = driver.page_source
base_url = 'https://de.wikipedia.org'
url = 'https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien'
res = requests.get(url)
soup = BeautifulSoup(res.text)

university = []
while True:
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    links = soup.find_all('a', href=re.compile('.*\/wiki\/.*'))
    for l in links:
        full_link = base_url + l['href']
        town = l['title']
        res = requests.get(full_link)
        soup = BeautifulSoup(res.text)
        info = soup.find('span', attrs={"title":["Breitengrad","Längengrad"]})
        latlong = info.text
        university.append(dict(town_name=town, lat_long=latlong))
        print(university)

编辑1 感谢@rll，我完成了编辑：

if info is not None:
           latlong = info.text
           university.append(dict(town_name=town, postal_code=latlong))
           print(university)

现在代码运行了，但我只看到了纬度，而没有看到经度

样本输出：

{'postail_code'：'49°\xa072\xa036,73\xa0N'，'town_name'：'Schönborn glymacum bruchal'}，{'postail_code'：'49°\xa072\xa030,73\xa0N'，'town_name'：'St.Paulusheim'}

无论如何，关于如何格式化这个输出以获得经度，以及如何格式化输出，对不起，我的正则表达式不好

编辑2

通过更新代码，我也得到了经度

info = soup.find('span', attrs={"title":"Breitengrad"})
info1 = soup.find('span',attrs={"title":"Längengrad"})
        if info is not None:
           latlong = info.text
           longitude = info1.text
           university.append(dict(town_name=town, postal_code=latlong,postal_code1=longitude))
           print(university)

现在，我的输出如下所示：

{'postal_code': '48°\xa045′\xa046,9″\xa0N',
  'postal_code1': '8°\xa014′\xa044,8″\xa0O',
  'town_name': 'Gymnasium Hohenbaden'},

因此，我需要帮助设置lat和long的格式，因为我不知道如何转换，例如：

48°\xa045′\xa046,9〃\xa0N到48°45′9〃N

谢谢

很抱歉没有直接回答，但我总是更喜欢使用MediaWiki的API。幸运的是，我们有了Python，这使得使用API更加容易

因此，不管它值多少钱，下面是我将如何使用

mwclient

：

import re
import mwclient

site = mwclient.Site('de.wikipedia.org')
start_page = site.Pages['Liste_altsprachlicher_Gymnasien']

results = {}
for link in start_page.links():
    page = site.Pages[link['title']]
    text = page.text()

    try:
        pattern = re.compile(r'Breitengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/N')
        breiten = [float(b) for b in pattern.search(text).group(1).split('/')]

        pattern = re.compile(r'Längengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/E')
        langen = [float(b) for b in pattern.search(text).group(1).split('/')]
    except:
        continue

    results[link['title']] = breiten, langen

这将为成功找到坐标的每个链接提供一个列表元组：

>>> results

{'Akademisches Gymnasium (Wien)': ([48.0, 12.0, 5.0], [16.0, 22.0, 34.0]),
 'Akademisches Gymnasium Salzburg': ([47.0, 47.0, 39.9], [13.0, 2.0, 2.9]),
 'Albertus-Magnus-Gymnasium (Friesoythe)': ([53.0, 1.0, 19.13], [7.0, 51.0, 46.44]),
 'Albertus-Magnus-Gymnasium Regensburg': ([49.0, 1.0, 23.95], [12.0, 4.0, 32.88]),
 'Albertus-Magnus-Gymnasium Viersen-Dülken': ([51.0, 14.0, 46.29], [6.0, 19.0, 42.1]),
 ...
}

您可以按自己喜欢的方式格式化：

for uni, location in results.items():
    lat, lon = location
    string = """University {} is at {}˚{}'{}"N, {}˚{}'{}"E"""
    print(string.format(uni, *lat+lon))

或将DMS坐标转换为十进制度数：

def dms_to_dec(coord):
    d, m, s = coord
    return d + m/60 + s/(60*60)

decimal = {uni: (dms_to_dec(b), dms_to_dec(l)) for uni, (b, l) in results.items()}

注意，并非所有链接页面都是大学；我没有仔细检查。

BeautifulSoup没有找到这些标题的任何跨度，因此

info

没有。尝试对任何跨度进行查找并打印出来，我猜您会发现为什么具体信息是非的。我相信他们提供了一个API，这是推荐的方法。请尝试此链接，谢谢@rll对代码所做的更改，但仍然存在一些问题。您真的应该使用WikiData（）或DbPedia，在那里所有的工作都已经为您完成了！感谢@dstudeba链接，我会尝试一下，但我没有尝试通过这个api获取数据，所以我使用了这个方法，无论如何，你可以帮助我修复代码，这会很有帮助。感谢这个，从未尝试过的api，这是非常有帮助的，我试图用非常长的方法来完成，然而，我们能把这个lat和lng从度数格式转换成十进制吗？那将非常有用，很高兴它有潜在的帮助。我编辑了我的答案，但我想说您可能需要问另一个问题。我与mwclient有问题，无法导入。。我将发布另一个问题，以便您能提供帮助。。谢谢