Python 2.7 如何在Scrapy中使用for循环?
我在一个项目中使用Scrapy,在这个项目中,我从xml中提取信息 在xml文档中,我希望实现for循环的格式:Python 2.7 如何在Scrapy中使用for循环?,python-2.7,for-loop,web-scraping,scrapy,Python 2.7,For Loop,Web Scraping,Scrapy,我在一个项目中使用Scrapy,在这个项目中,我从xml中提取信息 在xml文档中,我希望实现for循环的格式: <relatedPersonsList> <relatedPersonInfo>...</relatedPersonInfo> <relatedPersonInfo> <relatedPersonName> <firstName>Mark</first
<relatedPersonsList>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>
<relatedPersonName>
<firstName>Mark</firstName>
<middleName>E.</middleName>
<lastName>Lucas</lastName>
</relatedPersonName>
<relatedPersonAddress>
<street1>1 IMATION WAY</street1>
<city>OAKDALE</city>
<stateOrCountry>MN</stateOrCountry>
<stateOrCountryDescription>MINNESOTA</stateOrCountryDescription>
<zipCode>55128</zipCode>
</relatedPersonAddress>
<relatedPersonRelationshipList>
<relationship>Executive Officer</relationship>
<relationship>Director</relationship>
</relatedPersonRelationshipList>
<relationshipClarification/>
</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
<relatedPersonInfo>...</relatedPersonInfo>
</relatedPersonsList>
以下是我在spider上使用的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from scrapy.http import Request
import urlparse
from formds.items import SecformD
class SecDform(CrawlSpider):
name = "DFORM"
allowed_domain = ["http://www..gov"]
start_urls = [
""
]
rules = (
Rule(
SgmlLinkExtractor(restrict_xpaths=["/html/body/div/table/tr/td[3]/a[2]"]),
callback='parse_formd',
#follow= True no need of follow thing
),
Rule(
SgmlLinkExtractor(restrict_xpaths=('/html/body/div/center[1]/a[contains(., "[NEXT]")]')),
follow=True
),
)
def parse_formd(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="formDiv"]/div/table/tr[3]/td[3]/a/@href').extract()
for site in sites:
yield Request(url=urlparse.urljoin(response.url, site), callback=self.parse_xml_document)
def parse_xml_document(self, response):
xxs = XmlXPathSelector(response)
item = SecformD()
item["stateOrCountryDescription"] = xxs.select('./primaryIssuer/issuerAddress/stateOrCountryDescription/text()').extract()[0]
item["zipCode"] = xxs.select('./primaryIssuer/issuerAddress/zipCode/text()').extract()[0]
item["issuerPhoneNumber"] = xxs.select('./primaryIssuer/issuerPhoneNumber/text()').extract()[0]
for person in xxs.select('./relatedPersonsList//relatedPersonInfo'):
#item = SecDform()
item["firstName"] = person.select('./relatedPersonName/firstName/text()').extract()[0]
item["middleName"] = person.select('./relatedPersonName/middleName/text()')
if item["middleName"]:
item["middleName"] = item["middleName"].extract()[0]
else:
item["middleName"] = "NA"
return item
我使用以下命令将信息提取到.json文件:
scrapy crawl DFORM-o tes4.json-t json尝试以下方法:
def parse_xml_document(self, response):
xxs = XmlXPathSelector(response)
items = []
# common field values
stateOrCountryDescription = xxs.select('./primaryIssuer/issuerAddress/stateOrCountryDescription/text()').extract()[0]
zipCode = xxs.select('./primaryIssuer/issuerAddress/zipCode/text()').extract()[0]
issuerPhoneNumber = xxs.select('./primaryIssuer/issuerPhoneNumber/text()').extract()[0]
for person in xxs.select('./relatedPersonsList//relatedPersonInfo'):
# instantiate one item per loop iteration
item = SecformD()
# save common parameters
item["stateOrCountryDescription"] = stateOrCountryDescription
item["zipCode"] = zipCode
item["issuerPhoneNumber"] = issuerPhoneNumber
item["firstName"] = person.select('./relatedPersonName/firstName/text()').extract()[0]
item["middleName"] = person.select('./relatedPersonName/middleName/text()')
if item["middleName"]:
item["middleName"] = item["middleName"].extract()[0]
else:
item["middleName"] = "NA"
items.append(item)
return items
尝试
//relatedPersonInfo
xpath。你能展示一下蜘蛛的全部代码吗?“我仍然只得到第一人称的信息。”你怎么知道你得到了什么信息?您是否正在某处打印项的内容?如果是这样的话,请提供你这样做的代码。嗨,保罗,谢谢你的帮助,但不幸的是,在做了所有的更改后,我仍然有相同的结果。(第一人称信息)@Tony,当我运行我用scrapy shell发布的代码时,我得到了8个人的信息。请将您确切的蜘蛛代码发布到gist.github.com之类的网站,以便我们进行调查。您完全正确。保罗,对不起,我的错误来自:stateOrCountryDescription我仍在使用旧项目[“stateOrCountryDescription”]。。。非常感谢。
def parse_xml_document(self, response):
xxs = XmlXPathSelector(response)
items = []
# common field values
stateOrCountryDescription = xxs.select('./primaryIssuer/issuerAddress/stateOrCountryDescription/text()').extract()[0]
zipCode = xxs.select('./primaryIssuer/issuerAddress/zipCode/text()').extract()[0]
issuerPhoneNumber = xxs.select('./primaryIssuer/issuerPhoneNumber/text()').extract()[0]
for person in xxs.select('./relatedPersonsList//relatedPersonInfo'):
# instantiate one item per loop iteration
item = SecformD()
# save common parameters
item["stateOrCountryDescription"] = stateOrCountryDescription
item["zipCode"] = zipCode
item["issuerPhoneNumber"] = issuerPhoneNumber
item["firstName"] = person.select('./relatedPersonName/firstName/text()').extract()[0]
item["middleName"] = person.select('./relatedPersonName/middleName/text()')
if item["middleName"]:
item["middleName"] = item["middleName"].extract()[0]
else:
item["middleName"] = "NA"
items.append(item)
return items