Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/windows/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 提取每对行之间的公共项_Python_Pandas_Dataframe - Fatal编程技术网

Python 提取每对行之间的公共项

Python 提取每对行之间的公共项,python,pandas,dataframe,Python,Pandas,Dataframe,我有这样一个数据框架: df = pd.DataFrame(np.array(['This here is text','My Text was here','This was not ready']), columns=['Text']) Text 0 This here is text 1 My Text was here 2 This was not ready 3 nothing common 我想创建一个具有以下结果的新数据

我有这样一个数据框架:

df = pd.DataFrame(np.array(['This here is text','My Text was here','This was not ready']), columns=['Text'])

                 Text
0    This here is text
1    My Text was here
2    This was not ready
3    nothing common
我想创建一个具有以下结果的新数据帧:

row1 row2    common_text
  0    1        here,text
  0    2        this
  1    2        was  
一个新的数据帧,每对行之间有所有公共字。另外,如果两行没有任何共同点,则忽略该对,如1,3和0,3


我的问题是,有没有更快或类似Python的方法来完成这项工作,而不是在所有行上迭代两次以提取公共术语并将它们存储在一起?

如果只需要一个循环,请选择
itertools.product
,但它可能不太类似Python

import itertools

# new_data_frame = ...
for row1, row2 in itertools.product(range(len(df)), range(len(df)):
    # possibly add
要获得常用词,您可以这样做

set(text1.lower().split()) & set(text2.lower().split())
这真是太棒了。出于性能原因,我会将每个句子保存为中间数组中的一个集合,然后在以后合并这些集合

temp = [set(s.lower().split()) for s in df['Text']]

您可以尝试“单词袋”方法在列表中搜索所有常用单词,然后使用此列表检查从列表中找到单词的行。您可以找到功能提取的文档,在根据OP的示例进行设置时,您可能希望加入一些关于
str.casefold
的内容。
from itertools import combinations

result = []

# Iterate through each pair of rows.
for row_1, row_2 in combinations(df['Text'].index, 2):
    # Find set of lower case words stripped of whitespace for each row in pair.
    s1, s2  = [set(df.loc[row, 'Text'].lower().strip().split()) for row in (row_1, row_2)]
    # Find the common words to the pair of rows.
    common = s1.intersection(s2)
    if common:
        # If there are words in common, append to the results as a common separated string (could also append the set of list of words).
        result.append([row_1, row_2, ",".join(common)])

>>> pd.DataFrame(result, columns=['row1', 'row2', 'common_text'])
   row1  row2 common_text
0     0     1   text,here
1     0     2        this
2     1     2         was