如何使用python仅从CSV文件中抓取特定URL?
我有一个CSV文件,里面有很多URL,都有不同的域扩展名(如何使用python仅从CSV文件中抓取特定URL?,python,csv,selenium-webdriver,web-crawler,Python,Csv,Selenium Webdriver,Web Crawler,我有一个CSV文件,里面有很多URL,都有不同的域扩展名(.com,.eu,.org等等)。但我只想在python 2.7中使用第行中的if'.nl:对扩展名为.nl的域进行爬网: from selenium import webdriver import csv fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion'] def csv_writerheader(path): with o
.com
,.eu
,.org
等等)。但我只想在python 2.7中使用第行中的if'.nl:
对扩展名为.nl
的域进行爬网:
from selenium import webdriver
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'output!.csv'
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
keywords = ['@media', 'googleadservices.com/pagead/conversion']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv') as example_file:
example_reader = csv.reader(example_file)
for row in example_reader:
# INITIALIZE DICT
data = {'Website': row}
if '.nl' in row: # MAKING THE DOMAIN DISTINCTION HERE
try:
driver.get(row[0])
html = driver.page_source
for searchstring in keywords:
if searchstring.lower() in html.lower():
print (row, searchstring, 'FOUND!')
data[searchstring] = 'FOUND!'
else:
print (row, searchstring, 'not found')
data[searchstring] = 'not found'
csv_writer(data, csv_output_file)
except:
pass
打印结果:
C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py"
Process finished with exit code 0
所以我的脚本在这种状态下基本上不做任何事情,除了导出一个几乎没有结果的CSV文件
但是,当我在第行中省略了if'.nl:
时,脚本工作得非常好
我应该做什么调整,以便只使用脚本导入/刮取.nl
域URL
for row in example_reader:
行类型
是一个列表。因此,它正在列表中查找一个正好是“.nl”的项。你有几个选择。如果CSV文件仅包含一列URL,则可以更改:
if '.nl' in row:
为此:
if '.nl' in row[0]:
编辑:此外,您对行
的任何分配都需要更改为行[0]
,例如数据={'Website':行[0]}