Python lxml在查找链接时错误地解析Doctype_Python_Html_Beautifulsoup_Lxml

Python lxml在查找链接时错误地解析Doctype

python html

Python lxml在查找链接时错误地解析Doctype,python,html,beautifulsoup,lxml,Python,Html,Beautifulsoup,Lxml,我有一个BeautifulSoup4（4.2.1）解析器，它从我们的模板文件中收集所有href属性，到目前为止，它还非常完美。但随着lxml的安装，我们的一个家伙现在得到了一个 TypeError:字符串索引必须是整数我在我的LinuxMintVM上成功地复制了这一点，唯一的区别似乎是lxml，所以我假设当bs4使用html解析器时，问题就出现了问题函数是 def collecttemplateurls(templatedir, urlslist): """ Uses Bea

我有一个BeautifulSoup4（4.2.1）解析器，它从我们的模板文件中收集所有

href

属性，到目前为止，它还非常完美。但随着lxml的安装，我们的一个家伙现在得到了一个

TypeError:字符串索引必须是整数

我在我的LinuxMintVM上成功地复制了这一点，唯一的区别似乎是lxml，所以我假设当bs4使用html解析器时，问题就出现了

问题函数是

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

因此，对于这个家伙，行

if link[“href”].startswith（'http://'）：

给出了类型错误，因为BS4认为html Doctype是一个链接

谁能解释一下这里的问题是什么，因为没有其他人能重现它

我不明白这样使用SoupStrainer时怎么会发生这种情况。我认为这与系统设置问题有关

我看不出我们的Doctype有什么特别之处

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

SoupTrainer

不会过滤出文档类型；它过滤文档中保留的元素，但保留文档类型，因为它是过滤元素的“容器”的一部分。您正在循环文档中的所有元素，因此遇到的第一个元素是

DocType

对象

在“应变”文档上使用

。find_all（）

：

document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):

或者过滤掉

DocType

对象：

from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue

我无法复制你的问题。我使用了BeautifulSoup版本3.2.0和

BeautifulSoup（html，parseOnlyThese=SoupStrainer（target=“_blank”））

@Sudipta在我的Windows机器上安装lxml（3.2.3）后，它就开始出现了。我不确定bs4以前使用的是什么html解析器，但我必须假设这个问题现在来自lxml。太好了，谢谢！我使用了

find_all

方法，因为我从未使用过

：continue

（不确定这与原则和最佳实践是如何工作的！）

continue

是一个完美的陈述；它就在那里，带有

返回

和

中断

：-）