检查是否存在多个关键字，并使用python创建另一列_Python_Python 3.x_Regex_Pandas_Dataframe

检查是否存在多个关键字，并使用python创建另一列

python python-3.x regex pandas dataframe

检查是否存在多个关键字，并使用python创建另一列,python,python-3.x,regex,pandas,dataframe,Python,Python 3.x,Regex,Pandas,Dataframe,我有一个如下所示的数据框 df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'], 'details':['DOSE: 667 mg - TDS with food - Inject','DOSE:

我有一个如下所示的数据框

df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS -  inJECTable','amoxicillin  1 g  + clavulanic acid  200 mg ','digoxin  - TABLET'],
                   'details':['DOSE: 667 mg - TDS with food - Inject','DOSE:   12 unit(s)  -  ON  -  SC (SubCutaneous)','-- AUGMENTIN -  inJECTable','DOSE:   62.5 mcg  -  Every other morning  -  PO'],
                   'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) -  SC (SubCutaneous)','amoxicillin  1 g  + clavulanic acid  200 mg -- AUGMENTIN','digoxin  - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']

我想做的是

a）检查

连接的列中是否存在从提取的列中提取的所有单个关键字
b） 如果存在，将1
分配给输出
列，否则0

c） 在issue
列中指定未找到的关键字，如下所示
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS -  inJECTable','amoxicillin  1 g  + clavulanic acid  200 mg ','digoxin  - TABLET'],
                   'details':['DOSE: 667 mg - TDS with food - Inject','DOSE:   12 unit(s)  -  ON  -  SC (SubCutaneous)','-- AUGMENTIN -  inJECTable','DOSE:   62.5 mcg  -  Every other morning  -  PO'],
                   'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) -  SC (SubCutaneous)','amoxicillin  1 g  + clavulanic acid  200 mg -- AUGMENTIN','digoxin  - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']

所以，我试着做下面的事情
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)') 
 #the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row):   #check whether each of the keyword is present in `concatenated` column
    if isinstance(row['keywords'], list):
        for keyword in row['keywords']:
            return 1
    else:
        return 0

df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()

如果您认为清理连接的列也很有用，那么这很好。我只对查找所有关键字的存在感兴趣
在700-800万条记录上，是否有任何有效且优雅的方法可以做到这一点
我希望我的输出如下所示。红色表示提取的列和连接的列之间缺少术语。因此，其分配的0和关键字存储在issue
列中
让我们压缩列提取的
和串联的
，对于每一对，将其映射到一个函数f
，该函数计算集
差，并相应地返回结果：
def f(x, y):
    s = set(x.split()) - set(y.split())
    return [0, ', '.join(s)] if s else [1, np.nan]

df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]


如果有数百万数据需要处理，并且需要速度，那么您可能不得不忘记regex。否则，在拆分为单词之前，运行df.extracted.str.replace（r'[^\w\s]+'，''）
或re.sub（r'[^\w\s]+'，''，x）
我实际上收到了这个错误keyrorm:“[Index（['output'，'issue']，dtype='object'）]都不在[列]”
@TheGreat我认为您使用的是旧的熊猫版本。您可以尝试df[['output'，'issue']]=pd.DataFrame（[f（*s）表示zip中的s（df['extracted']，df['concatenated']））
谢谢。我正试图逐行执行。我可以知道这一行是做什么的吗？返回[0'，'，'.join（s）]如果s else[1，np nan
当然。返回[0'，'.join（s）]如果s else[1，np nan只是一个语法糖（one liner
）用于多行if-else
语句。您可以在多行if-else
语句中看到相同的代码@TheGreat请检查