Python 使用lxml和请求解析HTML列表时出现问题_Python_Html_Html Parsing_Lxml_Python Requests

Python 使用lxml和请求解析HTML列表时出现问题

python html

Python 使用lxml和请求解析HTML列表时出现问题,python,html,html-parsing,lxml,python-requests,Python,Html,Html Parsing,Lxml,Python Requests,我有一个存储在变量href中的URL列表。当我通过下面的函数传递它时，唯一返回的recipe_链接来自href中的第一个URL。我的代码有任何明显的错误吗？我不知道为什么它不会遍历我存储在href中的所有20个URL。我为href中的第一个URL获得的返回结果按预期检索，但我无法获取到下一个URL的循环 def first_page_links(link): recipe_links = [] recipe_html = [] for x in link:

我有一个存储在变量

href

中的URL列表。当我通过下面的函数传递它时，唯一返回的recipe_链接来自

href

中的第一个URL。我的代码有任何明显的错误吗？我不知道为什么它不会遍历我存储在

href

中的所有20个URL。我为

href

中的第一个URL获得的返回结果按预期检索，但我无法获取到下一个URL的循环

def first_page_links(link):
    recipe_links = []
    recipe_html = []

    for x in link: 
        page_request = requests.get(x)
        recipe_html.append(html.fromstring(page_request.text))

        print recipe_html

        for x in recipe_html:
            recipe_links.append(x.xpath('//*[@id="content"]/ul/li/a/@href'))

            return recipe_links

观察

返回

的位置。您可能希望在所有循环完成后返回：

def first_page_links(link):
    recipe_links = []
    recipe_html = []

    for x in link: 
        page_request = requests.get(x)
        recipe_html.append(html.fromstring(page_request.text))

        print recipe_html

        for x in recipe_html:
            recipe_links.append(x.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links

尝试推出第二个循环和

返回

行，这样就不会发生重复的迭代，并且正确返回最终列表，如下所示：

from lxml import html
import requests as rq

def first_page_links(links):

    recipe_links = []
    recipe_html = []

    for link in links:
        r = rq.get(link)
        recipe_html.append(html.fromstring(r.text))

    for rhtml in recipe_html:
        recipe_links.append(rhtml.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links

y_list = []
final_list = []
for x in x_list:
    y_list.append(x)
    for y in y_list:
        final_list.append(y)

让我们知道这是否有效

编辑：

考虑以下几点：

from lxml import html
import requests as rq

def first_page_links(links):

    recipe_links = []
    recipe_html = []

    for link in links:
        r = rq.get(link)
        recipe_html.append(html.fromstring(r.text))

    for rhtml in recipe_html:
        recipe_links.append(rhtml.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links

y_list = []
final_list = []
for x in x_list:
    y_list.append(x)
    for y in y_list:
        final_list.append(y)

这是你的函数，简化了。假设在

x_列表中

有3个URL，会发生以下情况：

from lxml import html
import requests as rq

def first_page_links(links):

    recipe_links = []
    recipe_html = []

    for link in links:
        r = rq.get(link)
        recipe_html.append(html.fromstring(r.text))

    for rhtml in recipe_html:
        recipe_links.append(rhtml.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links

y_list = []
final_list = []
for x in x_list:
    y_list.append(x)
    for y in y_list:
        final_list.append(y)

x1

被附加到

y\u列表中

到目前为止，

y_列表

仅使用

x1

进行处理，因此仅

x1

被附加到

最终_列表

<代码>最终列表现在包含：

[x1]

x2

被附加到

y\u列表中

y\u列表

现在包含

x1

和

x2

这两个都将被处理并附加到

最终列表

<代码>最终列表现在包含：

[x1，x1，x2]

x3

附加到

y\u列表中

<代码>y_列表现在包含

x1

、

x2

和

x3

看到了吗？：）由于处理第一个列表中的项目的第二个循环位于第一个循环内，第一个循环以增量方式添加到第一个列表中，因此第二个循环将在第一个循环的每次迭代中处理第一个列表。这使得它成为一个冗余的迭代

有很多方法可以执行您想要执行的操作，但是如果您只是附加到列表，并且需要在这两个列表上进行一次循环，上面的修复就是所需要的。

所以我刚刚尝试了这个，我得到了大量重复的结果。@Barnaby:有可能在你的帖子中编辑html吗？这就是配方链接的来源：

['recipe1'，'recipe1'，'recipe2'，'recipe1'，'recipe2'，'recipe3'，'recipe4'，'recipe1'，'recipe2'，'recipe3'，'recipe4'，recipe5'.]

由于某些原因，我看不到HTML，它显示为

[][，][，][，[，][，][，][，][，][，]

这就是为什么我认为我得到了关于发布您试图获取的URL的recipe linksHow的上述输出？另外，请检查我的答案是否有帮助。这正是我想要的工作方式！谢谢！您能解释一下为什么我让它创建了冗余迭代吗？非常感谢您的详细解释。