Python 使用lxml刮取时返回奇数列
我正在学习python,并试图构建一个从供应商站点收集零件数据的刮板。我现在的问题是,我从解析的表行中得到不同的列计数,我知道每一行都有相同的列计数。这个问题一定是我忽略了的,在尝试了两天不同的事情之后,我要求对我的代码进行更多的观察,以定位我的错误。毫无疑问,没有太多python编码经验是我最大的障碍 首先是数据。我不会粘贴我存储在数据库中的html,而是给你一个链接,指向我爬网并存储在数据库中的实时站点。第一个环节是 问题是我得到的结果大多是正确的。但是,我经常会在列计数中得到扭曲的值。我似乎找不到原因 以下是有缺陷结果的一个示例:Python 使用lxml刮取时返回奇数列,python,lxml,Python,Lxml,我正在学习python,并试图构建一个从供应商站点收集零件数据的刮板。我现在的问题是,我从解析的表行中得到不同的列计数,我知道每一行都有相同的列计数。这个问题一定是我忽略了的,在尝试了两天不同的事情之后,我要求对我的代码进行更多的观察,以定位我的错误。毫无疑问,没有太多python编码经验是我最大的障碍 首先是数据。我不会粘贴我存储在数据库中的html,而是给你一个链接,指向我爬网并存储在数据库中的实时站点。第一个环节是 问题是我得到的结果大多是正确的。但是,我经常会在列计数中得到扭曲的值。我似
----------------------------------------------------------------------------------
Record: 1 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc
Rec:1 Row 6 Col 5 engine
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
**请注意,发动机值已移到vin_代码字段
并证明它在某些时候起作用:
**请注意,此记录中排列的字段 我怀疑我的解析器没有在表单元格中查找某些内容,或者我遗漏了一些琐碎的内容 以下是我的代码的重要部分:
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
cnty = 0 # Column Counter
found_oil = 0 # Found oil filter record flag
data = {} # Data
# Data fields
field_data = {'part':'', 'part_no':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
print
print '----------------------------------------------------------------------------------'
print 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
#
# Rules for extracting data columns
# 1. First column always has a link to the bullet image
# 2. Second column is part name
# 3. Third column always empty
# 4. Fourth column is part number
# 5. Fith column is empty
# 6. Sixth column is part location
# 7. Seventh column is always empty
# 8. Eigth column is engine size
# 9. Ninth column is vin code
# 10. Tenth column is COmment
# 11. Eleventh column does not exist.
#
for column in row.xpath('./td[@class="blackmedium"][text()="0xa0"] | ./td[@class="blackmedium"][text()="\n"]/text() | ./td[@class="blackmeduim"]/img[@src]/text() | ./td[@class="blackmedium"][text()=""]/text() | ./td[@class="blackmedium"]/b/text() | ./td[@class="blackmedium"]/a/text() |./td[@class="blackmedium"]/text() | ./td[@class="blackmedium"][text()=" "]/text() | ./td[@class="blackmedium"][text()=" "]/text() | ./td[@class="blackmedium"][text()=None]/text()'):
#' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
cnty+=1
if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
found_oil = 1
if found_oil == 1:
print 'Rec:'+str(rec_cnt), 'Row '+str(cntx), 'Col '+str(cnty), _fields[cnty], column.strip()
#cnty+= 1
#print
else:
print 'Rec: '+str(rec_cnt), 'Col: '+str(cnty)
field_data[ str(_fields[cnty]) ] = str(column.strip())
#cnty = cnty+1
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
print 'Filed Data:'
print field_data
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
谢谢你们的帮助和你们的眼睛 通常当我遇到这样的问题时,我会做两件事:
importpdb;pdb.set_trace()
在表示cnty+=1的行之前使用
c
允许程序继续(直到下一个断点或设置跟踪);使用n
转到程序中的下一行。使用q
引发异常(通常是中止)。您能传递报废流程的详细信息吗?间歇性故障可能基于对html数据的解析 问题似乎在于xpath表达式搜索文本节点。找不到空单元格的匹配项,导致代码“跳过”列。尝试迭代td
元素本身,然后从元素“向下看”到其内容。要开始,请执行以下操作:
# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
# use the xpath function string() to get the string value for the element
# this will yield an empty string for empty elements
print col_no, td.xpath('string()')
请注意,在某些情况下,使用
string()
xpath函数可能不够/太简单,无法满足您的需要。在您的示例中,您可能会发现类似于5133453
(请参阅机油滤清器)的内容。我的例子会给你“5133453”,你似乎需要“51334”(不确定这是否是故意的,或者你是否没有注意到“缺失”部分,如果你真的只想在超链接中显示,请使用td.findtext('a')
)我想感谢过去几天来为我提供帮助的所有人。您的所有输入都产生了一个我现在正在使用的工作应用程序。我想将结果更改发布到我的代码中,这样那些看这里的人就可以找到答案,或者至少可以找到关于如何解决问题的信息。下面是我的代码重写部分,它解决了我遇到的问题:
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
我可以告诉您,通过单步遍历代码,有些列没有被行返回:for column in row.xpath('./td[@class=“blackmedium”][text()=“0xa0”]./td[@class=“blackmedium”]][text()=“\n”]/text()。/td[@class=“blackmeduim”]/img[@src]/text()。/td[@class=“blackmedium”][text()=”]/text()。/td[@class=“blackmedium”]][@class=“blackmedia”]/b/text()|/td[@class=“blackmedia”]/a/text()|/td[@class=“blackmedia”]/text()|/td[@class=“blackmedia”][text()=”/text()。/td[@class=“blackmedia”][text()=”和#160”][text()。/td[@class=“blackmedia”][text()=None]/text()”)我使用我编写的另一个应用程序(我的第一个python应用程序)对网站进行了爬网并将页面存储在mysql数据库中。然后,我从mysql数据库中提取页面,并进行检查,以确保从数据库中获得正确的结果。问题在于列循环中。列的lxml查询似乎没有返回所有列。问题似乎与我能看到的某些空克隆有关LH.但是看看html,我找不到一个押韵或问题的原因。这个网站不允许我发布更多的代码。我本来打算发布整个应用程序。但正如我所说,问题似乎是lxml找不到所有的列。不确定为什么……谢谢,我会重置和查看。再乱来一点,我想问题可能是一些表格单元格没有内容。甚至没有空间。例如,如何使用lxml来定位这种类型的单元格,以及如何测试文本属性是否不存在?抱歉,我正在使用xml.etree.ElementTree和xml.dom.minidom来进行解析,因此没有关于lxml的建议。听起来你已经缩小了间歇因子的范围,接近修复。祝你好运。谢谢。53是超级文本,用于识别我不需要的注释。我只在零件号之后。
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries