Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/294.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
挂在Try/Except子句上的Python脚本_Python_Exception_Try Catch - Fatal编程技术网

挂在Try/Except子句上的Python脚本

挂在Try/Except子句上的Python脚本,python,exception,try-catch,Python,Exception,Try Catch,我正在尝试使用BeautifulSoup从html页面提取数据。有些html格式不正确或者根本不存在,在这种情况下,我需要使用正则表达式来检查它。我使用try/except子句来处理这个问题。这是我的剧本: def get_metadata(path): path = path.replace('\\', '\\') rx_1 = r'Supersedes:?\s*[^\r\n]*[\r\n]+(.*?)[ \r\n]+(?:Service)?\s*Serial Numbers?

我正在尝试使用BeautifulSoup从html页面提取数据。有些html格式不正确或者根本不存在,在这种情况下,我需要使用正则表达式来检查它。我使用try/except子句来处理这个问题。这是我的剧本:

def get_metadata(path):
    path = path.replace('\\', '\\')
    rx_1 = r'Supersedes:?\s*[^\r\n]*[\r\n]+(.*?)[ \r\n]+(?:Service)?\s*Serial Numbers?:?[ \r\n]+.*?[ \n\r]+\*+[\n\r]+\*[\n\r]*([A-Za-z ]+)[ \n\r]\*+[\n\r]+\*+[ \n\r]*\*+[\n\r]+\*+[ \n\r]*(?:\*[ \n\r]*)+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
    rx_2 = r'(.*)(?:Service)[\r\n ]+Serial Numbers?:?[ \r\n]+.*?[ \n\r]+\*+[\n\r]+\*[\n\r]*([A-Za-z ]+)[ \n\r]\*+[\n\r]+\*+[ \n\r]*\*+[\n\r]+\*+[ \n\r]*(?:\*[ \n\r]*)+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
    rxs = [rx_1, rx_2]
    data = {}

    try:
        soup = BeautifulSoup(open(path, 'rb'), 'html.parser')
        pre = soup.find('pre')
        if pre:
            pre_string = pre.text
        else:
            pre_string = soup.text

        attachments = []
        data['path'] = path
        try:
            description = soup.find(text=re.compile(r'\s*Title:?')).find_next('td').contents[0]
            description = description.strip().replace('\n', '\\n').replace(',', '\\n')
            data['document_description'] = description
        except Exception as why:
        # SOUP COULDN'T FIND IT, MUST RESORT TO REGEX
            for rx in rxs:
                try:
                    match = re.search(rx, pre_string, re.S|re.M)
                    if match:
                        description = match[1]
                        if not description:
                            continue
                        else:
                            if description == 'Service':
                                continue
                            description = re.sub(before_d, '', description)
                            description = re.sub(after_d, '', description)
                            description = re.sub('([\r\n]|\s+)', ' ', description)
                            data['document_description'] = description
                            break
                    else:
                        continue
                except Exception as why:
                    err = {'path': path, 'error': str(why), 'field': 'description'}
#                    record_err(err)
                    data['document_description'] = None
                    continue
            if not data['document_description']:
                data['document_description'] = None
                pass
        html = soup.prettify('utf-8')

        with open(path, 'wb') as f:
            f.write(html)
            update_log(log)

#        with open(path, 'wb') as f:
 #           f.write(html)
  #          update_log(log)

    except Exception as why:
        print('failed to open soup' + str(why))
        pass
    return data
出现了一些问题,因此,当无法计算下面这样的行时,不要引发异常

  try:
        description = soup.find(text=re.compile(r'\s*Title:?')).find_next('td').contents[0]
脚本只是在执行期间冻结

我为几个不同的路径运行
get_metadata
,直到出现异常处理问题为止。每次尝试路径时,都会记录该路径。当读取一个有异常的文件时,它会冻结在这个文件上,我必须退出键盘。奇怪的是,错误似乎没有像在try子句中一样处理:

 python just-meta.py
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_01.htm
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_02.htm
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_03.htm
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_04.htm
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_05.htm
SNViewer-HTML\Compliance\CE\SN_CE_Compliance_06.htm
SNViewer-HTML\Compliance\CS\SN_CS_Compliance_01A.htm
Traceback (most recent call last):
  File "just-meta.py", line 109, in get_metadata
    description = soup.find(text=re.compile(r'\s*Title:?')).find_next('td').contents[0]
AttributeError: 'NoneType' object has no attribute 'find_next'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "just-meta.py", line 353, in <module>
    migrate()
  File "just-meta.py", line 46, in migrate
    metadata = get_metadata(f)
  File "just-meta.py", line 115, in get_metadata
    match = re.search(rx, pre_string, re.S|re.M)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\re.py", line 182, in search
    return _compile(pattern, flags).search(string)
python just-meta.py
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 01.htm
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 02.htm
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 03.htm
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 04.htm
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 05.htm
SNViewer HTML\Compliance\CE\SN\u CE\u Compliance\u 06.htm
SNViewer HTML\Compliance\CS\SN\u CS\u Compliance\u 01A.htm
回溯(最近一次呼叫最后一次):
get_元数据中第109行的文件“just meta.py”
description=soup.find(text=re.compile(r'\s*Title:?')).find_next('td')。内容[0]
AttributeError:“非类型”对象没有“查找下一个”属性
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“just meta.py”,第353行,在
迁移()
文件“just meta.py”,第46行,在migrate中
元数据=获取元数据(f)
get_元数据中第115行的文件“just meta.py”
匹配=重新搜索(rx,预搜索字符串,重新搜索S |重新搜索M)
文件“C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\re.py”,搜索中的第182行
返回编译(模式、标志)。搜索(字符串)

有谁能给我一些建议来解决这个问题吗?

soup.find(text=re.compile(r'\s*Title:?)
变成了
None
(什么也找不到)。你必须处理那个特殊情况。我以为我处理好了?这不是try/except块中的功能吗?捕获所有异常会使调试变得相当困难。你应该只处理那些你感兴趣的异常,让其余的冒泡起来。堆栈跟踪通常会将您指向正确的方向。