Python 如何从一个链接生成一个已解析项，并从同一项列表中的其他链接生成其他已解析项_Python_Web Scraping_Scrapy

Python 如何从一个链接生成一个已解析项，并从同一项列表中的其他链接生成其他已解析项

python web-scraping scrapy

Python 如何从一个链接生成一个已解析项，并从同一项列表中的其他链接生成其他已解析项,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,问题是，我一直在从一个地方列表中迭代，以获取经纬度和海拔。问题是，当我得到我刮回来的东西时，我无法将它与我当前的df链接，因为我迭代的名称可能已经被修改或跳过我已经设法得到了我所看到的东西的名称，但是由于它是从外部解析的，而不是从其他项目的链接解析的，所以它不能正常工作 import scrapy import pandas as pd from ..items import latlonglocItem df = pd.read_csv('wine_df_final.csv') df =

问题是，我一直在从一个地方列表中迭代，以获取经纬度和海拔。问题是，当我得到我刮回来的东西时，我无法将它与我当前的df链接，因为我迭代的名称可能已经被修改或跳过

我已经设法得到了我所看到的东西的名称，但是由于它是从外部解析的，而不是从其他项目的链接解析的，所以它不能正常工作

import scrapy
import pandas as pd
from ..items import latlonglocItem


df = pd.read_csv('wine_df_final.csv')
df = df[pd.notnull(df.real_place)]
real_place = list(set(df.real_place))


class latlonglocSpider(scrapy.Spider):


    name = 'latlonglocs'
    start_urls = []


    for place in real_place:
        baseurl =  place.replace(',', '').replace(' ', '+')
        cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto'
        start_urls.append(cleaned_href)



    def parse(self, response):

        items = latlonglocItem()

        items['base_name'] = response.xpath('string(/html/head/title)').get().split(' coordinates')[0]
        for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall():
            if href.startswith('/url?q=https://www.distancesto'):
                yield response.follow(href, self.parse_distancesto)
            else:
                pass
        yield items

    def parse_distancesto(self, response):
        items = latlonglocItem()

        try:
            items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
            items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
            items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
            items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
            yield items
        except Exception:
            pass
#output
 appellation      base_name       elevation    latitude    longitude
                  Chalone, USA
 Santa Cruz, USA                  56.81        35           9.23

现在发生的事情是，我解析我寻找的内容，然后它进入一个链接并解析其余的信息。然而，很明显，在我的数据框中，我得到了我所寻找的内容的名称，与其他项目完全无关，即使这样，也很难找到匹配项。我希望将信息传递给另一个函数，以便它将所有项目一起生成。

这可能会起作用。我将对我正在做的事情和你的代码进行一点评论，你对我正在做的事情有一点了解

import scrapy
import pandas as pd
from ..items import latlonglocItem


df = pd.read_csv('wine_df_final.csv')
df = df[pd.notnull(df.real_place)]
real_place = list(set(df.real_place))


class latlonglocSpider(scrapy.Spider): # latlonglocSpider is a child class of scrapy.Spider

    name = 'latlonglocs'
    start_urls = []

    for place in real_place:
        baseurl =  place.replace(',', '').replace(' ', '+')
        cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto'
        start_urls.append(cleaned_href)

    def __init__(self): # Constructor for our class
        # Since we did our own constructor we need to call the parents constructor
        scrapy.Spider.__init__(self)
        self.base_name = None # Here is the base_name we can now use class wide

    def parse(self, response):

        items = latlonglocItem()

        items['base_name'] = response.xpath('string(/html/head/title)').get().split(' coordinates')[0]
        self.base_name = items['base_name'] # Lets store the base_name in the class
        for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall():
            if href.startswith('/url?q=https://www.distancesto'):
                yield response.follow(href, self.parse_distancesto)
            else:
                pass
        yield items

    def parse_distancesto(self, response):
        items = latlonglocItem()

        try:
            # If for some reason self.base_name is never assigned in
            # parse() then we want to use an empty string instead of the self.base_name

            # The following syntax means use self.base_name unless it is None or empty
            # in which case just use and empty string.
            base_name = self.base_name or "" # If for some reason

            items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
            items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
            items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
            items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
            yield items
        except Exception:
            pass

这可能行得通。我将对我正在做的事情和你的代码进行一点评论，你对我正在做的事情有一点了解

import scrapy
import pandas as pd
from ..items import latlonglocItem


df = pd.read_csv('wine_df_final.csv')
df = df[pd.notnull(df.real_place)]
real_place = list(set(df.real_place))


class latlonglocSpider(scrapy.Spider): # latlonglocSpider is a child class of scrapy.Spider

    name = 'latlonglocs'
    start_urls = []

    for place in real_place:
        baseurl =  place.replace(',', '').replace(' ', '+')
        cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+latitude+longitude+distancesto'
        start_urls.append(cleaned_href)

    def __init__(self): # Constructor for our class
        # Since we did our own constructor we need to call the parents constructor
        scrapy.Spider.__init__(self)
        self.base_name = None # Here is the base_name we can now use class wide

    def parse(self, response):

        items = latlonglocItem()

        items['base_name'] = response.xpath('string(/html/head/title)').get().split(' coordinates')[0]
        self.base_name = items['base_name'] # Lets store the base_name in the class
        for href in response.xpath('//*[@id="ires"]/ol/div/h3/a/@href').getall():
            if href.startswith('/url?q=https://www.distancesto'):
                yield response.follow(href, self.parse_distancesto)
            else:
                pass
        yield items

    def parse_distancesto(self, response):
        items = latlonglocItem()

        try:
            # If for some reason self.base_name is never assigned in
            # parse() then we want to use an empty string instead of the self.base_name

            # The following syntax means use self.base_name unless it is None or empty
            # in which case just use and empty string.
            base_name = self.base_name or "" # If for some reason

            items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
            items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
            items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
            items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
            yield items
        except Exception:
            pass

多亏了错误——语法上的悔恨。并发请求必须设置为1才能工作，并将base_name放入循环中

多亏了错误——语法上的悔恨。并发请求必须设置为1才能工作，并将base_name放入循环中。

什么是

latlonglocItem（）

？latlongItem（）正在调用项目列表。这基本上是用要填充的列名设置数据框。当你得到lat、long和elev时，你可以拼接名称？你有一个可以添加的示例响应吗？问题是，当它进入链接时，我解析的内容不一定就是我开始寻找的内容。这就是为什么我在进入内部之前先在外部对其进行分析，但是我遇到了一个问题，即项目与我在Begging中查找的名称不一致。

items['base_name']

不会对所有lat/long进行更改，对吗？您可以将方法放在一个类中，为

base\u name

创建一个类变量，或者使用一个全局变量。（我建议先上课，然后再上环球）。如果可以的话，我可以给你一些示例代码。什么是

latlonglocItem（）

？latlongItem（）调用一个项目列表。这基本上是用要填充的列名设置数据框。当你得到lat、long和elev时，你可以拼接名称？你有一个可以添加的示例响应吗？问题是，当它进入链接时，我解析的内容不一定就是我开始寻找的内容。这就是为什么我在进入内部之前先在外部对其进行分析，但是我遇到了一个问题，即项目与我在Begging中查找的名称不一致。

items['base_name']

不会对所有lat/long进行更改，对吗？您可以将方法放在一个类中，为

base\u name

创建一个类变量，或者使用一个全局变量。（我建议先上课，然后再上环球）。如果可行的话，我可以给你一些示例代码。我从来没有在类内部看到过

for

循环，但在方法外部看到过，所以我不知道这是否会有点奇怪。它确实有效，但没有有效：）。你所做的非常有帮助，现在我可以让一个base_名称与其他名称对齐，但是base_名称根本不匹配，我会继续尝试，也许我得到了一个将其放入for循环的方法，我不确定。但是你确实帮了我很多忙，现在我有更多的机会去尝试。我不能给你投票，因为没有足够的代表@BB没问题。祝你好运：）你可以看看我是如何修改你的答案的。我必须将base_name放入循环中，并将并发请求从13设置为1才能工作。否则，base_名称将与其他名称不匹配。谢谢我从来没有在类内部而在方法外部看到过

for

循环，所以我不知道这是否会有点奇怪。它确实起作用了，也没有起作用：）。你所做的非常有帮助，现在我可以让一个base_名称与其他名称对齐，但是base_名称根本不匹配，我会继续尝试，也许我得到了一个将其放入for循环的方法，我不确定。但是你确实帮了我很多忙，现在我有更多的机会去尝试。我不能给你投票，因为没有足够的代表@BB没问题。祝你好运：）你可以看看我是如何修改你的答案的。我必须将base_name放入循环中，并将并发请求从13设置为1才能工作。否则，base_名称将与其他名称不匹配。谢谢