Python 将HTML抓取到CSV中
我想提取的内容,如副作用,警告,剂量从网站中提到的开始URL。下面是我的代码。正在创建csv文件,但未显示任何内容。输出为:Python 将HTML抓取到CSV中,python,scrapy,Python,Scrapy,我想提取的内容,如副作用,警告,剂量从网站中提到的开始URL。下面是我的代码。正在创建csv文件,但未显示任何内容。输出为: before for [] # it is displaying empty list after for 这是我的代码: from scrapy.selector import Selector from medicinelist_sample.items import MedicinelistSampleItem from scrapy.contrib.spider
before for
[] # it is displaying empty list
after for
这是我的代码:
from scrapy.selector import Selector
from medicinelist_sample.items import MedicinelistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MedSpider(CrawlSpider):
name = "med"
allowed_domains = ["medindia.net"]
start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"]
rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),]
global Selector
def parse(self, response):
hxs = Selector(response)
fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()')
final = fullDesc.extract()
print "before for" # this is just to see if it was printing
print final
print "after for" # this is just to see if it was printing
你的
scrapy
spider类的parse
方法应该返回项
。使用当前代码,我看不到任何项目被返回。例如
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
sel = Selector(response)
item = Item()
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
return item
有关更多信息,请查看中的。代码中的另一个问题是您正在重写爬行器的解析方法以实现回调逻辑。这不能用爬行器来完成,因为在其逻辑中使用了解析方法 Ashish Nitin Patil已经通过将其示例函数命名为*parse_item*,隐式地指出了这一点
爬行爬行器的解析方法的默认实现基本上是调用您在规则定义中指定的回调;所以如果你覆盖它,我想你的回调根本不会被调用。请参见我刚刚对您正在爬行的站点进行了一些实验。由于您想从这个域的不同站点提取一些关于药物的数据(如名称、适应症、禁忌症等):下面或类似的XPath表达式是否适合您的需要?我认为您当前的查询只会提供“标题”,但该站点上的实际信息位于这些粗体呈现标题后面的文本节点中
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem
from scrapy.item import Item, Field
class Medicine(Item):
name = Field()
dosage = Field()
indications = Field()
contraindications = Field()
warnings = Field()
class TestmedSpider(CrawlSpider):
name = 'testmed'
allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']
rules = (
Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
)
def parse_item(self, response):
drug_info = Medicine()
selector = Selector(response)
name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')
drug_info['name'] = name.extract()
drug_info['dosage'] = dosage.extract()
drug_info['indications'] = indications.extract()
drug_info['contraindications'] = contraindications.extract()
drug_info['warnings'] = warnings.extract()
return drug_info
这将为您提供以下信息:
>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
'name': [u'\xa0Zafirlukast'],
'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]
>scrapy parse--spider=testmed--verbose-d2-c parse_item--logfile c:\Python27\Scripts\Test\Test\spider\Test.loghttp://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>>深度级别:1是的,我后来意识到要添加返回项..谢谢..但是我想要的内容没有在csv文件中被删除..而是html代码中的所有文本都显示在csv中..我如何获得所需的内容?我还想问为什么应该使用.xpath()而不是.select()?…我使用的是.xpath()但它没有给我正确的答案:.xpath()
&.select()
取决于用例。您正在抓取的url是什么?您想从中提取什么?