Python dataframe在满足条件时选择连续跨距

Python dataframe在满足条件时选择连续跨距,python,pandas,dataframe,Python,Pandas,Dataframe,假设我有停止词列表: STOP = ['under', 'its', 'agreement', 'financed'] 对于给定的数据帧: lst = ['Kan.-based National', 'Kan.-based National Pizza', 'stock market', 'Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit', 'revolving credit

假设我有停止词列表:

STOP = ['under', 'its', 'agreement', 'financed'] 
对于给定的数据帧:

lst = ['Kan.-based National', 'Kan.-based National Pizza', 'stock market', 
   'Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
   'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement', 
   'financed under its revolving credit agreement']

df = pd.DataFrame(lst)
即:

0   Kan.-based National
1   Kan.-based National Pizza
2   stock market
3   Pittsburg Kan.-based National Pizza
4   the stock market
5   revolving credit
6   revolving credit agreement
7   its revolving credit agreement
8   under its revolving credit agreement
9   financed under its revolving credit agreement
0   Pittsburg Kan.-based National Pizza
1   the stock market
2   revolving credit
3   revolving credit agreement
4   its revolving credit agreement
5   under its revolving credit agreement
6   financed under its revolving credit agreement
我想获得:

out = ['Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
       'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement', 
       'financed under its revolving credit agreement']

df_out = pd.DataFrame(out)
即:

0   Kan.-based National
1   Kan.-based National Pizza
2   stock market
3   Pittsburg Kan.-based National Pizza
4   the stock market
5   revolving credit
6   revolving credit agreement
7   its revolving credit agreement
8   under its revolving credit agreement
9   financed under its revolving credit agreement
0   Pittsburg Kan.-based National Pizza
1   the stock market
2   revolving credit
3   revolving credit agreement
4   its revolving credit agreement
5   under its revolving credit agreement
6   financed under its revolving credit agreement
注意:行的顺序并不重要

说明:

由于,
'Kan.-based National'
'Kan.-based National Pizza'
仅相差一个单词
'Pizza'
,并且
停止
列表中没有单词,因此我们希望选择最长的跨度,即
'Kan.-based National Pizza'
。 但是,
'Pittsburg Kan.-based National Pizza'
'Pittsburg.-based National Pizza'
也只有一个词不同
'Pittsburg'
,并且在
停止
列表中没有词,我们想选择最长的跨度,即
'Pittsburg Kan.-based National Pizza'

我们不能选择
“根据其循环信贷协议融资”
作为从
“循环信贷”
开始的最长期限,因为单词出现在
停止
列表中。因此,我们不会删除它的较小跨度

或者,在旁道上,如果字符串以(a | an | the)开头,其常用跨距之间的差异仅为一个单词。对于例如-
“股票市场”
“股票市场”
,我们希望选择最长的跨度,即
“股票市场”

我试着做:

delete_from_best_constituents = []
for u in best_parse_constituents:
    for v in best_parse_constituents:
        if u.lower().startswith('the') or v.lower().startswith('the'):
            u_part = u.lower().split('the')[-1].strip()
            v_part =  v.lower().split('the')[-1].strip()
            cond1 = all([w.lower() not in STOP for w in u_part.split()])
            cond2 = all([w.lower() not in STOP for w in v_part.split()])
            if u_part == v.lower() or v_part == u.lower() and cond1 and cond2:
                if not u.lower().startswith('the'):
                    delete_from_best_constituents.append(u)

假设XS是一组字符串。让YS是XS中X的集合,这样对于XS中的所有X-hat,X-hat不是X的子字符串。例如,如果XS是
{“blue”、“blueegg”、“hello”、“hello world”}
,那么
YS
{“blueegg”、“hello world”}
?你是在计算y吗?您正在尝试删除所有短子字符串吗?我不知道你想做什么。是否只保留较长的超级字符串?是否在任何时候找到另一个字符串的子字符串时,删除短字符串并保留较长字符串?例如,如果您发现
“hello world”
“hello”
这两个词,您保留
“hello world”
并丢弃
“hello”
?您写道,我们不能选择“根据其循环信贷协议融资”作为以“循环信贷”开头的最长字符串,这一点都没有意义。字符串
“根据其循环信贷协议融资”
并不是以字符串“循环信贷”开头的。您可以这样写:“我们不能做[BLAH],因为单词出现在
停止列表中。”确切的“单词”是什么
STOP
列表中的哪些词确实会导致问题?您能否提供“span”的定义?你一直在谈论“这个跨度”、“那个跨度”等等,但你从来没有说过“跨度”到底是什么。你能在你的问题中加入一个指向“跨度”定义的超链接吗?