Python 正则表达式、熊猫和悬挂行_Python_Regex_Pandas_Dataframe

Python 正则表达式、熊猫和悬挂行

python regex pandas dataframe

Python 正则表达式、熊猫和悬挂行,python,regex,pandas,dataframe,Python,Regex,Pandas,Dataframe,我试图标记任何包含用户定义的“不正确”字符的记录。在本例中，记录二（2）应作为无效记录返回，但我似乎捕获了记录1或记录3。这些将被视为“正确”有没有关于为什么这些是标记而不是“错误记录”的建议？ import pandas as pd import numpy as np import re data = {'HOME1': ['123 Main St', '567\ Country Road', 'PO Box 900']} dft = pd.DataFrame(data) from it

我试图标记任何包含用户定义的“不正确”字符的记录。在本例中，记录二（2）应作为无效记录返回，但我似乎捕获了记录1或记录3。这些将被视为“正确”有没有关于为什么这些是标记而不是“错误记录”的建议？

import pandas as pd
import numpy as np
import re

data = {'HOME1': ['123 Main St', '567\ Country Road', 'PO Box 900']}
dft = pd.DataFrame(data)

from itertools import chain
chars =[]
acceptable = [x for x in chain(range(48,58),range(32,33), range(65,91), range(97,123))]
for ch in acceptable:
    chars.append(chr(ch))

reg_list = map(re.compile,chars)

for x in dft['HOME1']:
    print(x)
    if any(re.match(x) for re in reg_list):
        conditions = [dft['HOME1'].apply(lambda x: x)!=x, dft['HOME1'].apply(lambda x: x)==x]
        choices = [0,1]
        dft['NonValidHOME1'] = np.select(conditions,choices,default=0)

try:
    print(dft.groupby(['NonValidHOME1'])[['HOME1']].filter(lambda x: len(x) ==1).agg(lambda x: x.tolist()))
except:
    print("no invalid Home1")

谢谢你的评论。这让我走上了一条“更好”的道路，或者至少有一条让我找到了答案。

我认为你需要删除

reg\u list=map（re.compile，chars）

并替换

（re.match（x）表示re-in-reg\u list）：

为

如果有的话（c-in x表示c-in-chars）：

（如果

x是一个字符串）。如果您只是从字符串中的列表中检查单个字符，则不需要正则表达式。
for x in dft['HOME1']:
for c in x:
    if c not in chars:
        print(c,x)
        conditions = [dft['HOME1'].apply(lambda x: x)==x, dft['HOME1'].apply(lambda x: x)!=x]
        choices = [1,0]
        dft['NonValidHOME1'] = np.select(conditions,choices,default=0)

#[print(c) for x in dft['HOME1'] for c in x if c not in chars]