Python 查找数据帧字符串中的单词交叉点-仅限整单词
下面是一个示例数据帧,对于每个总线描述,我希望找到所有其他总线,其描述至少有一个相同的单词Python 查找数据帧字符串中的单词交叉点-仅限整单词,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,下面是一个示例数据帧,对于每个总线描述,我希望找到所有其他总线,其描述至少有一个相同的单词 Bus # DESCRIPTION Bus1 RICE MILLS MANUFACTURER Bus2 LICORICE CANDY RETAIL Bus3 LICORICE CANDY WHOLESALE Bus4 RICE
Bus # DESCRIPTION
Bus1 RICE MILLS MANUFACTURER
Bus2 LICORICE CANDY RETAIL
Bus3 LICORICE CANDY WHOLESALE
Bus4 RICE RETAIL
例如,以下各项的输出:
RICE MILLS MANUFACTURER would be "RICE RETAIL"
LICORICE CANDY RETAIL would be "RICE RETAIL" "LICORICE CANDY WHOLESALE"
LICORICE CANDY WHOLESALE would be "LICORICE CANDY RETAIL"
RICE RETAIL would be: "RICE MILLS MANUFACTURER" "LICORICE CANDY RETAIL"
下面的代码几乎正确地实现了这一点
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[1])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[2])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[0])]
df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[1])]
问题是“大米”这个词在“甘草”中,所以米厂制造商的产量包括“甘草零售”。我不想这样
def match_word(ref_row,series):
"""
--inputs
ref_row (str): this is the string of reference
series (pandas.series): this a series containing all other strings you want to cross-check
--outputs:
series (pandas.series): this will be a series of booleans
"""
#convert ref_row into a set of strings. Use strip to remove whitespaces before and after the initial string
ref_row = set(ref_row.strip().split(' '))
#convert strings into set of strings
series = series.apply(lambda x:set(x.strip().split(' ')))
#now cross check each row with the reference row.
#find the size (number of words) of the intersection
series = series.apply(lambda x:len(list(x.intersection(ref_row))))
#if the size of the intersection set is greater than zero. Then there is a common word between ref_row and all the series
series = series>0
return series
现在,您可以按如下方式调用上述函数:
df['Description'].apply(lambda x:match_word(x,df['Description']))
请注意,这不是最好的优化算法,但它是一种快速而肮脏的方法。这是一个O(n2) 这仍然是O(n^2),但是,它是高度矢量化的
# get values of DESCRIPTION
s = df.DESCRIPTION.values.astype(str)
# parse strings and turn into sets
sets = np.array([set(l) for l in np.core.defchararray.split(s).tolist()])
# get upper triangle indices for all combinations of DESCRIPTION
r, c = np.triu_indices(len(sets), 1)
# use set operations to replicate intersection
i = sets[r] - sets[c] < sets[r]
# grab indices where intersections happen
r, c = r[i], c[i]
r, c = np.append(r, c), np.append(c, r)
比较时间
定时
您想要什么样的输出结构?名单?系列Dataframe?我希望大米零售的输出不包括甘草糖果零售。除此之外,这是正确的输出。我找到了一个正确的解决方案,但是这个方法可能没有你的快。
df.DESCRIPTION.iloc[c].groupby(r).apply(list)
0 [RICE RETAIL]
1 [LICORICE CANDY WHOLESALE, RICE RETAIL]
2 [LICORICE CANDY RETAIL]
3 [RICE MILLS MANUFACTURER, LICORICE CANDY RETAIL]
Name: DESCRIPTION, dtype: object
# build truth matrix
t = np.empty((s.size, s.size), dtype=np.bool)
t.fill(False)
t[r, c] = True
pd.DataFrame(t, df.index, df.index)
0 1 2 3
0 False False False True
1 False False True True
2 False True False False
3 True True False False