Python 具有不同数据元素和所有span标记的刮削页面具有相同的名称_Python_Xpath_Web Scraping_Scrapy

Python 具有不同数据元素和所有span标记的刮削页面具有相同的名称

python xpath web-scraping scrapy

Python 具有不同数据元素和所有span标记的刮削页面具有相同的名称,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,我面临的问题是，不同的页面具有不同的数据元素，例如：和。如您所见，第一个链接有申请人，但第二个链接没有，项目描述中还有更多链接。它们使用相同的span类和值。那么，我如何在我想要的变量下刮取它们呢？例如： applicant= response.xpath('//.......') #I want the applicant data here 我想在这里得到申请人的资料。但是我不能使用div[index]格式，因为不同的页面将有不同的数据元素。span类和值相同，因此我无法使用特定名称获

我面临的问题是，不同的页面具有不同的数据元素，例如：和。如您所见，第一个链接有

申请人

，但第二个链接没有，项目描述中还有更多链接。它们使用相同的

span

类和值。那么，我如何在我想要的变量下刮取它们呢？例如：

applicant= response.xpath('//.......')  #I want the applicant data here

我想在这里得到申请人的资料。但是我不能使用

div[index]

格式，因为不同的页面将有不同的数据元素。

span

类和值相同，因此我无法使用特定名称获取所需的数据

我在第二个链接上的尝试：

def parse_product(self, response):
   title = response.xpath('//div[@id="detailseite"]//div[@class="details"]/h3/text()').extract()
   area = response.xpath('//div[@class="firstUnderAntragsbeteiligte"]/span[@class="value"]/text()').extract()
   website = response.xpath('//div[@class="details"]//span[@class="value"]/a/@href').extract()
   identifier = response.xpath('//div[@class="projektnummer"]/span[@class="value"]/text()').extract()
   description = response.xpath('//div[@id="projekttext"]/text()').extract()
   programme = response.xpath('//div[@id="projektbeschreibung"]/div[2]/span[@class="value"]/text()').extract()
   institution = response.xpath('//div[@id="projektbeschreibung"]/div[3]/span[@class="value"]/a/text()').extract()
   institution_add = response.xpath('//div[@id="projektbeschreibung"]/div[3]/span[@class="value"]/text()').extract()
   spokeperson = response.xpath('//div[@id="projektbeschreibung"]/div[4]/span[@class="value"]/a/text()').extract()
   spokeperson_add = response.xpath('//div[@id="projektbeschreibung"]/div[4]/span[@class="value"]/text()').extract() 
   scientist = response.xpath('//div[@id="projektbeschreibung"]/div[5]/span[@class="value"]/a/text()').extract()
   programme_contact = response.xpath('//div[@class="dfg_contact"]/span[2]/span/a/text()').extract()

第一个链接的html代码：

    <div class="details">
       <h3>Symbioses in Macaranga: ontogeny and partner conflicts</h3>
       <div>
          <span class="name">
          Applicant
          </span>
          <span class="value">
             <a class="intern" href="/gepris/person/1174218">Professor Dr. Ulrich  Maschwitz</a>                                                                                        
          </span><!-- value -->
    </div>
    <div class="firstUnderAntragsbeteiligte">
         <span class="name">Subject Area</span>
         <span class="value">
    Zoology                                                                                                                                                         
         </span>
    </div>
    <div>
         <span class="name">Term </span>
         <span class="value">
         from 1999 to 2002
         </span><!-- value -->
    </div>
    <div class="projektnummer">
         <span class="name">Project identifier</span>
         <span class="value">Deutsche Forschungsgemeinschaft (DFG) - Projekt number 5214212</span>
    </div>
    </div>

假设您想要一个JSON输出：

features=response.xpath（'//div[@class=“details”]//'
“span[@class=“name”]”
“//text（）”）.getall（）
values=response.xpath(
“//div[@class=“details”]//”
'span[@class=“value”和'
'not（包含（text（），“\t”）]'
“/text（）”
“|//div[@class=“details”]”
“//span[@class=“value”]”
“/a/text（）”）。getall（）
out=dict（zip（特性、值））

作为输出，JSON对象包含

class=“name”

作为名称和

class=“value”

作为值。
更新：对于您在评论

parse（）

中提到的问题：

def解析（self，response）：
res={}
对于response.xpath（'//div[@class=“details”]/div'）中的详细信息：
项目=列表（）
对于详细信息项。xpath（'./span[@class=“value”和contains（text（），“”）或@class=“name”]/text（））。getall（）：
如果item.strip（）！=''：
items.append（item.strip（））
资源更新({
项目[0]：项目[1:]
})

res

末尾的值：

{‘申请人’：['Universität zu Köln'，'Department für Geowissenshaften'，'Geographisches Institut'，'Gottfried Wilhelm Leibniz Universität Hannover'，'Institut für Wirtschafts-u
nd Kulturgeographie']，“主题领域”:[“人文地理”]，“术语”:[“自2015年起”]，“项目标识符”:[“德意志联邦铁路公司（DFG）-项目编号275355279']]

请在您的帖子中包含相关的HTML，不要只是链接到页面。不，不要作为图像。请看：。另外，你能强调一下你想得到什么数据吗？啊，现在事情变得更清楚了。是的，跨距有相同的等级，但那很好！据我所知，没有任何难以区分的跨度元素。您所需要做的就是迭代外部div，并将每个带有类“name”的span的内容与带有类“value”的同级span的内容相关联。它们形成一个键值对或映射，就像Python中的字典一样。在每一个

中都有两个

-从第一个

中获取两个，并使用第一个

中的文本来识别第二个

中得到的内容。但是如果一些

类=值

中没有文本会怎样呢。相反，他们有子

。。您可以在我发布的第一个html代码的第一个span值上看到它。您可以通过

//text（）

表示获取任何子代和子代的文本，而不是只获取第一个子文本的

/text（）

。但显然不是，它将为第一个span类文本（）留出空白。。。输出结果如下{'applicator'：''，'SubjectArea'：'ProfessorDr.UlrichMaschwitz'，'Term'：''，'Projectidentifier'：'Zoology'，'DFGProgramme'：'from1999to2002'，'InternationalConnection'：'Deutscheforschungsgeminschaft（DFG）-项目编号521212'，'Participations'：'ResearchGrants'}啊哈，我明白了，这需要一些额外的字符，让我想想。@adrian我刚刚回答了你在主要帖子上的其他评论

            <div class="details">
                <h3>
                GRK 6:&nbsp;
                Spatial Statistics
                </h3>
                <div class="firstUnderAntragsbeteiligte">
                    <span class="name">Subject Area</span>
                    <span class="value">
                    Mathematics
                    </span>
                </div>
                <div>
                    <span class="name">Term </span>
                    <span class="value">
                    from 1997 to 2003
                    </span><!-- value -->
                </div>
                <div>
                    <span class="name">Website</span>
                    <span class="value">
                <a class="extern" href="http://www.mathe.tu-freiberg.de/math/inst/stoch/Gradu/index.html" title="Website" target="_blank">
                    Homepage
                </a>
                    </span>
                </div>
                <div class="projektnummer">
                    <span class="name">Project identifier</span>
                    <span class="value">Deutsche Forschungsgemeinschaft (DFG) - Projekt number 268853</span>
                </div>
            </div>

Title : Symbioses in Macaranga: ontogeny and partner conflicts
Applicant: Professor Dr. Ulrich Maschwitz
Subject Area:Zoology
Term : from 1999 to 2002
Project identifier:Deutsche Forschungsgemeinschaft (DFG) - Projekt number 5214212

Title : GRK 6:  Spatial Statistics
Subject Area : Mathematics
Term : from 1997 to 2003
Website : http://www.mathe.tu-freiberg.de/math/inst/stoch/Gradu/index.html
Project identifier : Deutsche Forschungsgemeinschaft (DFG) - Projekt number 268853