Python仍然存在try-except子句的问题

Python仍然存在try-except子句的问题,python,pandas,dataframe,apply,Python,Pandas,Dataframe,Apply,我使用tldpython库使用apply函数从代理请求日志中获取第一级域。当我遇到tld不知道如何处理的奇怪请求时,如“http:1 CON”或“http:/login.cgi%00”,我会遇到如下错误消息: TldBadUrl: Is not a valid URL http:1 con! TldBadUrlTraceback (most recent call last) in engine ----> 1 new_fld_column = request_2['request'].a

我使用tldpython库使用apply函数从代理请求日志中获取第一级域。当我遇到tld不知道如何处理的奇怪请求时,如“http:1 CON”或“http:/login.cgi%00”,我会遇到如下错误消息:

TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url, 
fail_silently, fix_protocol, search_public, search_private, **kwargs)
    385         fix_protocol=fix_protocol,
    386         search_public=search_public,
--> 387         search_private=search_private
    388     )
    389 

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
    289             return None, None, parsed_url
    290         else:
--> 291             raise TldBadUrl(url=url)
    292 
    293     domain_parts = domain_name.split('.')
为了克服这个问题,我建议将函数包装在一个try-except子句中,通过使用NaN查询来确定出错的行:

import tld
from tld import get_fld

def try_get_fld(x):
    try: 
        return get_fld(x)
    except tld.exceptions.TldBadUrl: 
        return np.nan
这似乎适用于某些“请求”,如“http:1 con”和“http:/login.cgi%00”,但在“请求”中失败,我会收到另一条类似于上面的错误消息:

TldDomainNotFound: Domain urnt12.knhc..txt didn't match any existing TLD name!
这就是数据帧的外观,在一个名为“请求”的数据帧中,总共有240000个“请求”:

我的代码:

from tld import get_tld
from tld import get_fld
import pandas as pd
import numpy as np
#Read back into to dataframe
request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column 
request = request[pd.notnull(request['request'])]
#Find the urls that contain IP addresses and exclude them from the new dataframe
request = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request = request.reset_index(drop=True)

import tld
from tld import get_fld

def try_get_fld(x):
    try: 
        return get_fld(x)
    except tld.exceptions.TldBadUrl: 
        return np.nan

request['flds'] = request['request'].apply(try_get_fld)

#faulty_url_df = request[request['flds'].isna()]
#print(faulty_url_df)
它失败是因为它是另一个例外。您
期望
tld.exceptions.TldBadUrl:
异常,但得到
TldDomainNotFound

您可以在except子句中不太具体,而是使用一个except子句捕获更多异常,或者添加另一个except子句以捕获其他类型的异常:

try: 
    return get_fld(x)
except tld.exceptions.TldBadUrl: 
    return np.nan
except tld.exceptions.TldDomainNotFound:
    print("Domain not found!")
    return np.nan

我会首先尝试一个通用异常捕捉器。与Exception as e类似:确保它不是预期的tld.exceptions.tldbadarl之外的其他异常。
try: 
    return get_fld(x)
except tld.exceptions.TldBadUrl: 
    return np.nan
except tld.exceptions.TldDomainNotFound:
    print("Domain not found!")
    return np.nan