在此Python脚本中，用另一个（标准）HTML解析模块替换BeautifulSoup_Python_Html Parsing_Beautifulsoup

在此Python脚本中，用另一个（标准）HTML解析模块替换BeautifulSoup

python

在此Python脚本中，用另一个（标准）HTML解析模块替换BeautifulSoup,python,html-parsing,beautifulsoup,Python,Html Parsing,Beautifulsoup,我已经用BeautifulSoup制作了一个脚本，它工作得很好，可读性很强，但我希望有一天能重新发布它，而BeautifulSoup是我希望避免的外部依赖，特别是考虑到Windows的使用这是代码，它从给定的GoogleMaps用户那里获取每个usermap链接。标记的线是使用BeautifulSoup的线： # coding: utf-8 import urllib, re from BeautifulSoup import BeautifulSoup as bs uid = '2009

我已经用BeautifulSoup制作了一个脚本，它工作得很好，可读性很强，但我希望有一天能重新发布它，而BeautifulSoup是我希望避免的外部依赖，特别是考虑到Windows的使用

这是代码，它从给定的GoogleMaps用户那里获取每个usermap链接。标记的线是使用BeautifulSoup的线：

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+$'))  #################
    for table in maptables:
        for line in table.findAll('a', 'maptitle'):  ################
            mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
            mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
            print shown, mapid, '\t', mapname
            shown += 1

            urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
                               '&msa=0&output=kml', mapname + '.kml')


    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

#编码：utf-8
导入urllib，re
从BeautifulSoup导入BeautifulSoup作为bs
uid='200931058040775970557'
开始=0
所示=1
尽管如此：
url='1〕http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str（start）
source=urllib.urlopen（url.read（））
汤=bs（来源）####
maptables=soup.findAll（id=re.compile（“^map[0-9]+$”）#################
对于maptables中的表：
对于table.findAll（'a'，'maptitle'）中的行：################
mapid=re.search（uid+'\.（[^“]*），str（line））.group（1）
mapname=re.search（'>（.*），str（line））.group（1.strip（）[：-3]
显示打印，mapid，'\t'，mapname
显示+=1
urllib.urlretrieve（'http://maps.google.com.br/maps/ms?msid=“+uid+”.+str（mapid）+
“&msa=0&output=kml'，mapname+'.kml'）
如果str中的“下一步”（源）：
开始+=5
其他：
打破

如您所见，使用BSoup的代码只有三行，但我不是程序员，在尝试使用其他标准HTML和XML解析工具时遇到了很多困难，我想可能是因为我尝试了错误的方法

EDIT：这个问题更多的是关于替换这个脚本的三行代码，而不是找到一种方法来解决可能存在的通用html解析问题。

非常感谢您的帮助！要解析我看到的HTML代码，有三种解决方案：

使用简单的字符串搜索（.find（），…）快速
使用正则表达式（又称正则表达式）
使用HTMLPasser

要解析我看到的HTML代码，有三种解决方案：

使用简单的字符串搜索（.find（），…）快速
使用正则表达式（又称正则表达式）
使用HTMLPasser

不幸的是，Python在标准库中没有有用的HTML解析功能，因此解析HTML的唯一合理方法是使用第三方模块，如或

BeautifulSoup

。这并不意味着您必须有一个单独的依赖项——这些模块是免费软件，如果您不需要外部依赖项，您就可以使用欢迎将它们与您的代码捆绑在一起，这样就不会使它们比您自己编写的代码更具依赖性。

不幸的是，Python在标准库中没有有用的HTML解析，因此解析HTML的唯一合理方法是使用第三方模块，如or

BeautifulSoup

。这并不意味着您必须有一个独立的依赖项——这些模块是自由软件，如果您不需要外部依赖项，欢迎您将它们与代码捆绑在一起，这样就不会使它们比您自己编写的代码更具依赖性。

我已经尝试过这段代码（见下文）它显示了一个链接列表。因为我没有安装漂亮的汤，也不想安装，所以我很难对照你的代码检查结果。没有任何“汤”的“纯”python代码甚至更短，可读性更强。不管怎样，给你。告诉我你的想法！友好，路易斯

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

#编码：utf-8
导入urllib，re
uid='200931058040775970557'
开始=0
所示=1
尽管如此：
url='1〕http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str（start）
source=urllib.urlopen（url.read（））
尽管如此：
endit=source.find（'maptitle'））
mapid=re.search（uid+'\.（[^“]*），str（source））.group（1）
mapname=re.search（'>（.*），str（source））.group（1.strip（）[：-3]
显示打印，mapid，'\t'，mapname
显示+=1
urllib.urlretrieve（'http://maps.google.com.br/maps/ms?msid=“+uid+”.+str（mapid）+'&msa=0&output=kml'，mapname+'.kml'）
如果str中的“下一步”（源）：
开始+=5
其他：
打破

我尝试了这段代码（见下文），它显示了一个链接列表。因为我没有安装漂亮的soup，也不想安装，所以我很难对照您的代码检查结果。没有任何“汤”的“纯”python代码甚至更短，可读性更强。不管怎样，给你。告诉我你的想法！很友好，路易斯

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

#编码：utf-8
导入urllib，re
uid='200931058040775970557'
开始=0
所示=1
尽管如此：
url='1〕http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str（start）
source=urllib.urlopen（url.read（））
尽管如此：
endit=source.find（'maptitle'））
mapid=re.search（uid+'\.（[^“]*），str（source））.group（1）
mapname=re.search（'>（.*），str（source））.group（1.strip（）[：-3]
显示打印，mapid，'\t'，mapname
显示+=1
urllib.urlretrieve（'http://maps.google.com.br/maps/ms?msid=“+uid+”.+str（mapid）+'&msa=0&output=kml'，mapname+'.kml'）
如果str中的“下一步”（源）：
开始+=5
其他：
打破

适用于对代码本身感兴趣的任何人（从谷歌地图用户处下载地图），我有一个特别的问题：如果不添加依赖项，你就不会得到一个真正好的HTML解析器。BeautifulSoup的存在是有原因的。依赖Python代码并不是那么糟糕，用户也不需要C编译器。另外，

easy\u install

在windows上也很容易获得。也许我还没有说清楚，我只是想寻找一种方法来执行代码中标记的操作，而不使用非标准模块，而不是将模块本身替换为通用解析操作。@heltonbiker，此代码执行的逻辑需要解析HTML。请注意，Google Maps确实有用于检索数据的API