Python 比较句子字符串的两个dataframe列，并为第三个帧创建新值_Python_Regex_Pandas_Dataframe_Nlp

Python 比较句子字符串的两个dataframe列，并为第三个帧创建新值

python regex pandas dataframe nlp

Python 比较句子字符串的两个dataframe列，并为第三个帧创建新值,python,regex,pandas,dataframe,nlp,Python,Regex,Pandas,Dataframe,Nlp,这里，我有两个dataframe列。A和B。对于每一行[i]，所有B都包含在A中，现在我尝试测试A中的B，并为匹配短语中的所有单词返回1，为外部短语B中的所有其他单词返回0，从而创建一个0和1的新数据帧 Why would it be competitive, so it's wond... if the teabaggers hadn't ousted Sen Had he refused to attempt something so partisa...

这里，我有两个dataframe列。A和B。对于每一行[i]，所有B都包含在A中，现在我尝试测试A中的B，并为匹配短语中的所有单词返回1，为外部短语B中的所有其他单词返回0，从而创建一个0和1的新数据帧

    Why would it be competitive, so it's wond...        if the teabaggers hadn't ousted Sen
    Had he refused to attempt something so partisa...   Had he refused to attempt something so partisa...
    "This study would then have to be conducted an...   This study would then have to be conducted and

预期的数据帧

['0', '0', '0', '0' , '0', '1', '1', '1', '1', '1', '1'........]

我主要尝试了两种方法，但在stackoverflow上得到的第一种方法中，它测试的是B列中的单个单词，而不是B列中的整个短语，所以我会得到这样的结果

['0', '1', '0', '0' , '0', '1', '1', '1', '1', '1', '1', ........]

其中，B中的值（如“is”或“and”）总是可能出现在短语之外，并返回坏结果

我还尝试了正则表达式，它在一个实例中工作得非常好，但我无法将它应用到数据帧上，并获得良好的效果。这是一种棘轮作业，它将返回无限的1行或耗尽内存

rx = '({})'.format('|'.join(re.escape(el)for el in B))
     # Generator to yield replaced sentences, rep_lace is a column of 1's for each word in B
it = (re.sub(rx, rep_lace, sentence)for sentence in A)
     # Build list of paired new sentences and old to filter out where not the same
results.append([new_sentence for old_sentence, new_sentence in zip(A, it) if old_sentence != new_sentence])
nw_results = ' '.join([str(elem) for elem in results])
ew_results= nw_results.split(" ")
new_results = ['0' if i is not '1' else i for i in ew_results]
labels =([int(e) for e in new_results])

我希望我能给出一个足够清楚的解释。

我不完全理解你关于“是”和“和”的意思，以及它们产生错误的原因。但一般来说，如果您试图基于A列和B列中的值构造C列，最好的方法是使用lambda函数

def word_match(col_1, col_2):
    # Gather all words in column B to check column A against
    targets = set(col_2.split())
    # For each word in A, if it's in B then 1, else 0
    output = [1 if x in targets else 0 for x in col_1.split()]
    return output

# Create new column, C, whose value on each row is word_match(A, B) on each row
df['C'] = df.apply(lambda x: word_match(x.A, x.B), axis=1)

希望这有帮助

我的意思是，考虑到B的一个例子，例如[‘有许多快乐的人’]和A[有许多快乐的人说他们对政府感到高兴]，你可以观察到‘是’和‘快乐’这两个词出现在短语B中，但也出现在特定短语之外的A中。这将导致它返回A中B的期望范围之外的1。我将在这里发布我使用的方法。不过，它涉及regex