Python 查找数据帧字符串中的单词交叉点-仅限整单词_Python_Pandas_Numpy_Dataframe

Python 查找数据帧字符串中的单词交叉点-仅限整单词

python pandas numpy dataframe

Python 查找数据帧字符串中的单词交叉点-仅限整单词,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,下面是一个示例数据帧，对于每个总线描述，我希望找到所有其他总线，其描述至少有一个相同的单词 Bus # DESCRIPTION Bus1 RICE MILLS MANUFACTURER Bus2 LICORICE CANDY RETAIL Bus3 LICORICE CANDY WHOLESALE Bus4 RICE

下面是一个示例数据帧，对于每个总线描述，我希望找到所有其他总线，其描述至少有一个相同的单词

Bus #                  DESCRIPTION

Bus1                   RICE MILLS MANUFACTURER 
Bus2                   LICORICE CANDY RETAIL
Bus3                   LICORICE CANDY WHOLESALE
Bus4                   RICE RETAIL

例如，以下各项的输出：

RICE MILLS MANUFACTURER would be "RICE RETAIL"
LICORICE CANDY RETAIL would be "RICE RETAIL" "LICORICE CANDY WHOLESALE"
LICORICE CANDY WHOLESALE would be "LICORICE CANDY RETAIL"
RICE RETAIL would be: "RICE MILLS MANUFACTURER" "LICORICE CANDY RETAIL"

下面的代码几乎正确地实现了这一点

df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[1])]

问题是“大米”这个词在“甘草”中，所以米厂制造商的产量包括“甘草零售”。我不想这样

def match_word(ref_row,series):
    """
    --inputs
    ref_row (str): this is the string of reference
    series (pandas.series): this a series containing all other strings you want to cross-check
    --outputs:
    series (pandas.series): this will be a series of booleans
    """
    #convert ref_row into a set of strings. Use strip to remove whitespaces before and after the initial string
    ref_row = set(ref_row.strip().split(' '))
    #convert strings into set of strings 
    series = series.apply(lambda x:set(x.strip().split(' ')))
    #now cross check each row with the reference row.
    #find the size (number of words) of the intersection
    series = series.apply(lambda x:len(list(x.intersection(ref_row))))
    #if the size of the intersection set is greater than zero. Then there is a common word between ref_row and all the series
    series = series>0
    return series

现在，您可以按如下方式调用上述函数：

df['Description'].apply(lambda x:match_word(x,df['Description']))

请注意，这不是最好的优化算法，但它是一种快速而肮脏的方法。这是一个O（n2）

这仍然是O（n^2），但是，它是高度矢量化的

# get values of DESCRIPTION
s = df.DESCRIPTION.values.astype(str)

# parse strings and turn into sets
sets = np.array([set(l) for l in np.core.defchararray.split(s).tolist()])

# get upper triangle indices for all combinations of DESCRIPTION
r, c = np.triu_indices(len(sets), 1)

# use set operations to replicate intersection
i = sets[r] - sets[c] < sets[r]

# grab indices where intersections happen
r, c = r[i], c[i]
r, c = np.append(r, c), np.append(c, r)

比较时间

定时

您想要什么样的输出结构？名单？系列Dataframe？我希望大米零售的输出不包括甘草糖果零售。除此之外，这是正确的输出。我找到了一个正确的解决方案，但是这个方法可能没有你的快。

df.DESCRIPTION.iloc[c].groupby(r).apply(list)

0                                       [RICE RETAIL]
1             [LICORICE CANDY WHOLESALE, RICE RETAIL]
2                             [LICORICE CANDY RETAIL]
3    [RICE MILLS MANUFACTURER, LICORICE CANDY RETAIL]
Name: DESCRIPTION, dtype: object

# build truth matrix
t = np.empty((s.size, s.size), dtype=np.bool)
t.fill(False)

t[r, c] = True

pd.DataFrame(t, df.index, df.index)

       0      1      2      3
0  False  False  False   True
1  False  False   True   True
2  False   True  False  False
3   True   True  False  False