Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用lxml刮取时返回奇数列_Python_Lxml - Fatal编程技术网

Python 使用lxml刮取时返回奇数列

Python 使用lxml刮取时返回奇数列,python,lxml,Python,Lxml,我正在学习python,并试图构建一个从供应商站点收集零件数据的刮板。我现在的问题是,我从解析的表行中得到不同的列计数,我知道每一行都有相同的列计数。这个问题一定是我忽略了的,在尝试了两天不同的事情之后,我要求对我的代码进行更多的观察,以定位我的错误。毫无疑问,没有太多python编码经验是我最大的障碍 首先是数据。我不会粘贴我存储在数据库中的html,而是给你一个链接,指向我爬网并存储在数据库中的实时站点。第一个环节是 问题是我得到的结果大多是正确的。但是,我经常会在列计数中得到扭曲的值。我似

我正在学习python,并试图构建一个从供应商站点收集零件数据的刮板。我现在的问题是,我从解析的表行中得到不同的列计数,我知道每一行都有相同的列计数。这个问题一定是我忽略了的,在尝试了两天不同的事情之后,我要求对我的代码进行更多的观察,以定位我的错误。毫无疑问,没有太多python编码经验是我最大的障碍

首先是数据。我不会粘贴我存储在数据库中的html,而是给你一个链接,指向我爬网并存储在数据库中的实时站点。第一个环节是

问题是我得到的结果大多是正确的。但是,我经常会在列计数中得到扭曲的值。我似乎找不到原因

以下是有缺陷结果的一个示例:

----------------------------------------------------------------------------------
Record: 1 Section:Passenger  /  Light Truck Make: ACURA SubMake: 
Model: CL SubModel:  Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2 
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc 
Rec:1 Row 6 Col 5 engine 
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
**请注意,发动机值已移到vin_代码字段

并证明它在某些时候起作用:


**请注意,此记录中排列的字段

我怀疑我的解析器没有在表单元格中查找某些内容,或者我遗漏了一些琐碎的内容

以下是我的代码的重要部分:

# Per Query
while records:
    # Per Query Loop
    #print str(records)
    for record in records:
        print 'Record Count:'+str(rec_cnt)
        items = ()
        item = {}
        source = record['doc']
        page = html.fromstring(source)

        for rows in page.xpath('//div/table'):
            #records = []
            item = {}
            cntx = 0
            for row in list(rows):
                cnty = 0 # Column Counter
                found_oil = 0 # Found oil filter record flag
                data = {} # Data
                # Data fields
                field_data = {'part':'',   'part_no':'', 'filter_loc':'',  'engine':'',  'vin_code':'',  'comment':'', 'year':''}
                print
                print '----------------------------------------------------------------------------------'
                print 'Record: '+str(record['id']), 'Section:'+str(record['section']),  'Make: '+str(record['make']),   'SubMake: '+str(record['submake'])
                print  'Model: '+str(record['model']),  'SubModel: '+str(record['submodel']),  'Year: '+str(record['year']),  'Engine: '+str(record['engine'])
                print '----------------------------------------------------------------------------------'

                #
                # Rules for extracting data columns
                # 1. First column always has a link to the bullet image
                # 2. Second column is part name
                # 3. Third column always empty
                # 4. Fourth column is  part number
                # 5. Fith column is empty
                # 6. Sixth column is part location
                # 7. Seventh column is always empty
                # 8. Eigth column is engine size
                # 9. Ninth column is vin code
                # 10. Tenth column is COmment
                # 11. Eleventh column does not exist.
                #
                for column in row.xpath('./td[@class="blackmedium"][text()="0xa0"] | ./td[@class="blackmedium"][text()="\n"]/text() | ./td[@class="blackmeduim"]/img[@src]/text()  | ./td[@class="blackmedium"][text()=""]/text() | ./td[@class="blackmedium"]/b/text() | ./td[@class="blackmedium"]/a/text() |./td[@class="blackmedium"]/text() | ./td[@class="blackmedium"][text()=" "]/text() | ./td[@class="blackmedium"][text()="&#160"]/text() | ./td[@class="blackmedium"][text()=None]/text()'): 
                    #' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
                    cnty+=1
                    if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
                        found_oil = 1

                    if found_oil == 1:
                        print 'Rec:'+str(rec_cnt), 'Row '+str(cntx),  'Col '+str(cnty),  _fields[cnty],  column.strip()
                        #cnty+= 1
                        #print
                    else:
                        print 'Rec: '+str(rec_cnt),  'Col: '+str(cnty)

                    field_data[ str(_fields[cnty]) ] = str(column.strip())
                    #cnty = cnty+1

                # Save data to db dest table
                if found_oil == 1:
                    data['source_id'] = record['id']
                    data['section_id'] = record['section_id']
                    data['section'] = record['section']
                    data['make_id'] = record['make_id']
                    data['make'] = record['make']
                    data['submake_id'] = record['submake_id']
                    data['submake'] = record['submake']
                    data['model_id'] = record['model_id']
                    data['model'] = record['model']
                    data['submodel_id'] = record['submodel_id']
                    data['submodel'] = record['submodel']
                    data['year_id'] = record['year_id']
                    data['year'] = record['year']
                    data['engine_id'] = record['engine_id']
                    data['engine'] = record['engine']
                    data['part'] = field_data['part']
                    data['part_no'] = field_data['part_no']
                    data['filter_loc'] = field_data['filter_loc']
                    data['vin_code'] = field_data['vin_code']
                    data['comment'] = conn.escape_string(field_data['comment'])

                    data['url'] = record['url']
                    save_data(data)
                    print 'Filed Data:'
                    print field_data

                cntx+=1
            rec_cnt+=1
    #End main per query loop 
    delay() # delay if wait was passed on cmd line
    records = get_data()
    has_offset = 1
    #End Queries

谢谢你们的帮助和你们的眼睛

通常当我遇到这样的问题时,我会做两件事:

  • 把问题分成小块。使用python函数或类来执行功能子集,以便可以单独测试函数的正确性
  • 使用在代码运行时检查代码,以了解其失败的地方。例如,在本例中,我将添加
    importpdb;pdb.set_trace()
    在表示cnty+=1的行之前
  • 然后,当代码运行时,您将得到一个交互式解释器,您可以检查各种变量并发现为什么没有得到预期的结果

    使用pdb的几个技巧:


    使用
    c
    允许程序继续(直到下一个断点或设置跟踪);使用
    n
    转到程序中的下一行。使用
    q
    引发异常(通常是中止)。

    您能传递报废流程的详细信息吗?间歇性故障可能基于对html数据的解析

    问题似乎在于xpath表达式搜索文本节点。找不到空单元格的匹配项,导致代码“跳过”列。尝试迭代
    td
    元素本身,然后从元素“向下看”到其内容。要开始,请执行以下操作:

    # just iterate over child elements of the row, which are always td
    # use enumerate to easily get a counter for the columns
    for col_no, td in enumerate(row, start=1):
        # use the xpath function string() to get the string value for the element
        # this will yield an empty string for empty elements
        print col_no, td.xpath('string()')
    

    请注意,在某些情况下,使用
    string()
    xpath函数可能不够/太简单,无法满足您的需要。在您的示例中,您可能会发现类似于
    5133453
    (请参阅机油滤清器)的内容。我的例子会给你“5133453”,你似乎需要“51334”(不确定这是否是故意的,或者你是否没有注意到“缺失”部分,如果你真的只想在超链接中显示,请使用
    td.findtext('a')

    我想感谢过去几天来为我提供帮助的所有人。您的所有输入都产生了一个我现在正在使用的工作应用程序。我想将结果更改发布到我的代码中,这样那些看这里的人就可以找到答案,或者至少可以找到关于如何解决问题的信息。下面是我的代码重写部分,它解决了我遇到的问题:

    #
    # get_column_index()
    # returns a dict of column names/column number pairs
    #
    def get_column_index(row): 
        index = {}
        col_no = 0
        td = None
        name = ''
        for col_no,  td in enumerate(row,  start=0):
            mystr = str(td.xpath('string()').encode('ascii',  'replace'))
            name =  str.lower(mystr).replace(' ', '_')
            idx = name.replace('.', '')
            index[idx] =  col_no
    
        if int(options.verbose) > 2:
            print 'Field Index:',  str(index)
    
        return index
    
    
    
    
    def run():
        global has_offset
        records = get_data()
    
        #print 'Records',  records
        rec_cnt = 0
    
        # Per Query
        while records:
            # Per Query Loop
            #print str(records)
            for record in records:
                if int(options.verbose) > 0:
                    print 'Record Count:'+str(rec_cnt)
    
                items = ()
                item = {}
                source = record['doc']
                page = html.fromstring(source)
                col_index = {}
    
                for rows in page.xpath('//div/table'):
                    #records = []
                    item = {}
                    cntx = 0
                    for row in list(rows):
                        data = {} # Data
                        found_oil = 0 #found proper part flag
                        # Data fields
                        field_data = {'part':'',   'part_no':'', 'part_note':'',  'filter_loc':'',  'engine':'',  'vin_code':'',  'comment':'', 'year':''}
    
                        if int(options.verbose) > 0:
                            print
                            print '----------------------------------------------------------------------------------'
                            print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']),  'Make: '+str(record['make']),   'SubMake: '+str(record['submake'])
                            print  'Model: '+str(record['model']),  'SubModel: '+str(record['submodel']),  'Year: '+str(record['year']),  'Engine: '+str(record['engine'])
                            print '----------------------------------------------------------------------------------'
    
                       # get column indexes
                        if cntx == 1:
                            col_index = get_column_index(row)
    
                        if col_index != None and cntx > 1:
                            found_oil = 0
    
                            for col_no,  td in enumerate(row):
    
                                if ('part' in col_index) and (col_no == col_index['part']):
                                    part = td.xpath('string()').strip()
                                    if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
                                        found_oil = 1
                                        field_data['part'] = td.xpath('string()').strip()
    
                                # Part Number
                                if ('part_no' in col_index) and (col_no == col_index['part_no']):
                                    field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
                                    field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
    
                                # Filter Location
                                if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
                                    field_data['filter_loc'] = td.xpath('string()').strip()
    
                                # Engine
                                if ('engine' in col_index) and (col_no == col_index['engine']):
                                    field_data['engine'] = td.xpath('string()').strip()
    
                                if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
                                    field_data['vin_code'] = td.xpath('string()').strip()
    
                                if ('comment' in col_index) and (col_no == col_index['comment']):
                                    field_data['comment'] = td.xpath('string()').strip()
    
                                if int(options.verbose) == 0:
                                    print ',' 
    
    
                            if int(options.verbose) > 0:
                                print 'Field Data: ',  str(field_data)
                            elif int(options.verbose) == 0:
                                print '.'
    
                        # Save data to db dest table
                        if found_oil == 1:
                            data['source_id'] = record['id']
                            data['section_id'] = record['section_id']
                            data['section'] = record['section']
                            data['make_id'] = record['make_id']
                            data['make'] = record['make']
                            data['submake_id'] = record['submake_id']
                            data['submake'] = record['submake']
                            data['model_id'] = record['model_id']
                            data['model'] = record['model']
                            data['submodel_id'] = record['submodel_id']
                            data['submodel'] = record['submodel']
                            data['year_id'] = record['year_id']
                            data['year'] = record['year']
                            data['engine_id'] = record['engine_id']
                            data['engine'] = field_data['engine'] #record['engine']
                            data['part'] = field_data['part']
                            data['part_no'] = field_data['part_no']
                            data['part_note'] = field_data['part_note']
                            data['filter_loc'] = field_data['filter_loc']
                            data['vin_code'] = field_data['vin_code']
                            data['comment'] = conn.escape_string(field_data['comment'])
    
                            data['url'] = record['url']
                            save_data(data)
                            found_oil = 0
    
                            if int(options.verbose) > 2:
                                print 'Data:', str(data)
    
                        cntx+=1
                    rec_cnt+=1
            #End main per query loop 
            delay() # delay if wait was passed on cmd line
            records = get_data()
            has_offset = 1
            #End Queries
    

    我可以告诉您,通过单步遍历代码,有些列没有被行返回:for column in row.xpath('./td[@class=“blackmedium”][text()=“0xa0”]./td[@class=“blackmedium”]][text()=“\n”]/text()。/td[@class=“blackmeduim”]/img[@src]/text()。/td[@class=“blackmedium”][text()=”]/text()。/td[@class=“blackmedium”]][@class=“blackmedia”]/b/text()|/td[@class=“blackmedia”]/a/text()|/td[@class=“blackmedia”]/text()|/td[@class=“blackmedia”][text()=”/text()。/td[@class=“blackmedia”][text()=”和#160”][text()。/td[@class=“blackmedia”][text()=None]/text()”)我使用我编写的另一个应用程序(我的第一个python应用程序)对网站进行了爬网并将页面存储在mysql数据库中。然后,我从mysql数据库中提取页面,并进行检查,以确保从数据库中获得正确的结果。问题在于列循环中。列的lxml查询似乎没有返回所有列。问题似乎与我能看到的某些空克隆有关LH.但是看看html,我找不到一个押韵或问题的原因。这个网站不允许我发布更多的代码。我本来打算发布整个应用程序。但正如我所说,问题似乎是lxml找不到所有的列。不确定为什么……谢谢,我会重置和查看。再乱来一点,我想问题可能是一些表格单元格没有内容。甚至没有空间。例如,如何使用lxml来定位这种类型的单元格,以及如何测试文本属性是否不存在?抱歉,我正在使用xml.etree.ElementTree和xml.dom.minidom来进行解析,因此没有关于lxml的建议。听起来你已经缩小了间歇因子的范围,接近修复。祝你好运。谢谢。53是超级文本,用于识别我不需要的注释。我只在零件号之后。
    #
    # get_column_index()
    # returns a dict of column names/column number pairs
    #
    def get_column_index(row): 
        index = {}
        col_no = 0
        td = None
        name = ''
        for col_no,  td in enumerate(row,  start=0):
            mystr = str(td.xpath('string()').encode('ascii',  'replace'))
            name =  str.lower(mystr).replace(' ', '_')
            idx = name.replace('.', '')
            index[idx] =  col_no
    
        if int(options.verbose) > 2:
            print 'Field Index:',  str(index)
    
        return index
    
    
    
    
    def run():
        global has_offset
        records = get_data()
    
        #print 'Records',  records
        rec_cnt = 0
    
        # Per Query
        while records:
            # Per Query Loop
            #print str(records)
            for record in records:
                if int(options.verbose) > 0:
                    print 'Record Count:'+str(rec_cnt)
    
                items = ()
                item = {}
                source = record['doc']
                page = html.fromstring(source)
                col_index = {}
    
                for rows in page.xpath('//div/table'):
                    #records = []
                    item = {}
                    cntx = 0
                    for row in list(rows):
                        data = {} # Data
                        found_oil = 0 #found proper part flag
                        # Data fields
                        field_data = {'part':'',   'part_no':'', 'part_note':'',  'filter_loc':'',  'engine':'',  'vin_code':'',  'comment':'', 'year':''}
    
                        if int(options.verbose) > 0:
                            print
                            print '----------------------------------------------------------------------------------'
                            print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']),  'Make: '+str(record['make']),   'SubMake: '+str(record['submake'])
                            print  'Model: '+str(record['model']),  'SubModel: '+str(record['submodel']),  'Year: '+str(record['year']),  'Engine: '+str(record['engine'])
                            print '----------------------------------------------------------------------------------'
    
                       # get column indexes
                        if cntx == 1:
                            col_index = get_column_index(row)
    
                        if col_index != None and cntx > 1:
                            found_oil = 0
    
                            for col_no,  td in enumerate(row):
    
                                if ('part' in col_index) and (col_no == col_index['part']):
                                    part = td.xpath('string()').strip()
                                    if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
                                        found_oil = 1
                                        field_data['part'] = td.xpath('string()').strip()
    
                                # Part Number
                                if ('part_no' in col_index) and (col_no == col_index['part_no']):
                                    field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
                                    field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
    
                                # Filter Location
                                if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
                                    field_data['filter_loc'] = td.xpath('string()').strip()
    
                                # Engine
                                if ('engine' in col_index) and (col_no == col_index['engine']):
                                    field_data['engine'] = td.xpath('string()').strip()
    
                                if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
                                    field_data['vin_code'] = td.xpath('string()').strip()
    
                                if ('comment' in col_index) and (col_no == col_index['comment']):
                                    field_data['comment'] = td.xpath('string()').strip()
    
                                if int(options.verbose) == 0:
                                    print ',' 
    
    
                            if int(options.verbose) > 0:
                                print 'Field Data: ',  str(field_data)
                            elif int(options.verbose) == 0:
                                print '.'
    
                        # Save data to db dest table
                        if found_oil == 1:
                            data['source_id'] = record['id']
                            data['section_id'] = record['section_id']
                            data['section'] = record['section']
                            data['make_id'] = record['make_id']
                            data['make'] = record['make']
                            data['submake_id'] = record['submake_id']
                            data['submake'] = record['submake']
                            data['model_id'] = record['model_id']
                            data['model'] = record['model']
                            data['submodel_id'] = record['submodel_id']
                            data['submodel'] = record['submodel']
                            data['year_id'] = record['year_id']
                            data['year'] = record['year']
                            data['engine_id'] = record['engine_id']
                            data['engine'] = field_data['engine'] #record['engine']
                            data['part'] = field_data['part']
                            data['part_no'] = field_data['part_no']
                            data['part_note'] = field_data['part_note']
                            data['filter_loc'] = field_data['filter_loc']
                            data['vin_code'] = field_data['vin_code']
                            data['comment'] = conn.escape_string(field_data['comment'])
    
                            data['url'] = record['url']
                            save_data(data)
                            found_oil = 0
    
                            if int(options.verbose) > 2:
                                print 'Data:', str(data)
    
                        cntx+=1
                    rec_cnt+=1
            #End main per query loop 
            delay() # delay if wait was passed on cmd line
            records = get_data()
            has_offset = 1
            #End Queries