导入带有URL的.csv并对其进行处理(PYTHON)
我正在编写一个scrip,它将导入URL列表,然后检查源代码中的一些内容。我需要关于导入.csv和处理它的帮助,如果有人可以在这里提供帮助,这是代码的一部分导入带有URL的.csv并对其进行处理(PYTHON),python,csv,web-scraping,lxml,python-2.x,Python,Csv,Web Scraping,Lxml,Python 2.x,我正在编写一个scrip,它将导入URL列表,然后检查源代码中的一些内容。我需要关于导入.csv和处理它的帮助,如果有人可以在这里提供帮助,这是代码的一部分 from lxml import html import csv def main(): with open('urls.csv', 'r') as csvfile: urls = [row[0] for row in csv.reader(csvfile)] for url in urls: doc = html.p
from lxml import html
import csv
def main():
with open('urls.csv', 'r') as csvfile:
urls = [row[0] for row in csv.reader(csvfile)]
for url in urls:
doc = html.parse(url)
linkziel = 'http://dandydiary.de/de'
if doc.xpath('//a[@href=$url]', url=linkziel):
for anchor_node in doc.xpath('//a[@href=$url]', url=linkziel):
if anchor_node.xpath('./ancestor::div[contains(@class, "sidebar")]'):
print 'Sidebar'
elif anchor_node.xpath('./parent::div[contains(@class, "widget")]'):
print 'Sidebar'
elif anchor_node.xpath('./ancestor::div[contains(@class, "comment")]'):
print 'Kommentar'
elif anchor_node.xpath('./ancestor::div[contains(@id, "comment")]'):
print 'Kommentar'
elif anchor_node.xpath('./ancestor::div[contains(@class, "foot")]'):
print "Footer"
elif anchor_node.xpath('./ancestor::div[contains(@id, "foot")]'):
print "Footer"
elif anchor_node.xpath('./ancestor::div[contains(@class, "post")]'):
print "Contextual"
else:
print 'Unidentified Link'
else:
print 'Link is Dead'
if __name__ == '__main__':
main()
我不想只指定一个url,而是希望使用一个csv,该csv将在我使用Python 2时运行。Python提供了一个模块,您可以使用该模块导入列表。假设您有一个input.csv文件,每一行上都有一个url:
http://de.wikipedia.org
http://spiegel.de
http://www.vickysmodeblog.com/
然后,您可以通过模块将其读入列表,并对其进行迭代:
import csv
from lxml import html
with open('input.csv', 'r') as csvfile:
urls = [row[0] for row in csv.reader(csvfile)]
for url in urls:
print url
doc = html.parse(url)
linkziel = 'http://dandydiary.de/de'
if doc.xpath('//a[@href=$url]', url=linkziel):
for anchor_node in doc.xpath('//a[@href=$url]', url=linkziel):
if anchor_node.xpath('./ancestor::div[contains(@class, "sidebar")]'):
print 'Sidebar'
elif anchor_node.xpath('./parent::div[contains(@class, "widget")]'):
print 'Sidebar'
elif anchor_node.xpath('./ancestor::div[contains(@class, "comment")]'):
print 'Kommentar'
elif anchor_node.xpath('./ancestor::div[contains(@id, "comment")]'):
print 'Kommentar'
elif anchor_node.xpath('./ancestor::div[contains(@class, "foot")]'):
print "Footer"
elif anchor_node.xpath('./ancestor::div[contains(@id, "foot")]'):
print "Footer"
elif anchor_node.xpath('./ancestor::div[contains(@class, "post")]'):
print "Contextual"
else:
print 'Unidentified Link'
else:
print 'Link is Dead'
它的输出是:
http://de.wikipedia.org
Link is Dead
http://spiegel.de
Link is Dead
http://www.vickysmodeblog.com/
Contextual
谢谢,很好用。但仍然存在一个问题,我在csv中有3个输入URL,它确实通过它们运行,但不是告诉我2个URL不包含链接,1个包含侧栏(例如,它只给出一行而不是3行),然后听起来像是XPath的问题。谈到http://www.vickysmodeblog.com/: 它正确地找到了a节点,但它不在类中带有侧边栏的div中。我在main post中更新了代码预览,也许这有助于澄清它似乎只读取csv中的最后一行,而不是全部。不,我已经选中了-print url打印csv中的每个url。你看到了吗?