Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 制作缩写-选择非停止词的第一个字符_Python_Python 3.x_String_Pandas_Nltk - Fatal编程技术网

Python 制作缩写-选择非停止词的第一个字符

Python 制作缩写-选择非停止词的第一个字符,python,python-3.x,string,pandas,nltk,Python,Python 3.x,String,Pandas,Nltk,给出一个停止字列表和一个数据框,其中1列具有完整的格式,如图所示- stopwords = ['of', 'and', '&', 'com', 'org'] df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']}) df +---+---------------------------+ | | Fu

给出一个停止字列表和一个数据框,其中1列具有完整的格式,如图所示-

stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})
df

+---+---------------------------+
|   |         Full form         |
+---+---------------------------+
| 0 | World health organization |
| 1 | Intellectual property     |
| 2 | Royal bank of canada      |
+---+---------------------------+
我正在寻找一种方法,使相邻列的缩写忽略stopwords(如果有的话)

预期产出:

+---+---------------------------+----------------+
|   |         Full form         |   Abbreviation |
+---+---------------------------+----------------+
| 0 | World health organization |   WHO          |
| 1 | Intellectual property     |   IP           |
| 2 | Royal bank of canada      |   RBC          |
+---+---------------------------+----------------+
这应该做到:

import pandas as pd

stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})


def abbrev(t, stopwords=stopwords):
    return ''.join(u[0] for u in t.split() if u not in stopwords).upper()


df['Abbreviation'] = df['Full form'].apply(abbrev)

print(df)
输出

                   Full form Abbreviation
0  World health organization          WHO
1      Intellectual property           IP
2       royal bank of canada          RBC
这应该做到:

import pandas as pd

stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})


def abbrev(t, stopwords=stopwords):
    return ''.join(u[0] for u in t.split() if u not in stopwords).upper()


df['Abbreviation'] = df['Full form'].apply(abbrev)

print(df)
输出

                   Full form Abbreviation
0  World health organization          WHO
1      Intellectual property           IP
2       royal bank of canada          RBC
另一种方法:

df['Abbreviation'] = (df['Full form'].replace(stopwords, '', regex=True)
                      .str.split()
                      .apply(lambda word: [l[0].upper() for l in word])
                      .str.join(''))
另一种方法:

df['Abbreviation'] = (df['Full form'].replace(stopwords, '', regex=True)
                      .str.split()
                      .apply(lambda word: [l[0].upper() for l in word])
                      .str.join(''))

下面是一个正则表达式解决方案:

stopwods = ['of', 'and', '&', 'com', 'org']
stopwords_re = r"(?!" + r"\b|".join(stopwords) + r"\b)"
abbv_re = r"\b{}\w".format(stopwords_re)

def abbrv(s):
    return "".join(re.findall(abbv_re, s)).upper()
[out]:

>>> abbrv('royal bank of scotland')
'RBS'
与熊猫一起使用:

df['Abbreviation'] = df['Full form'].apply(abbrv)

有关正则表达式的完整解释,请参见:

简言之

  • \b{}\w
    :查找单词边界后的所有字符
  • (?!of\b| and\b|&\b)
    :除非它在停止词列表中

    • 这里有一个正则表达式解决方案:

      stopwods = ['of', 'and', '&', 'com', 'org']
      stopwords_re = r"(?!" + r"\b|".join(stopwords) + r"\b)"
      abbv_re = r"\b{}\w".format(stopwords_re)
      
      def abbrv(s):
          return "".join(re.findall(abbv_re, s)).upper()
      
      [out]:

      >>> abbrv('royal bank of scotland')
      'RBS'
      
      与熊猫一起使用:

      df['Abbreviation'] = df['Full form'].apply(abbrv)
      

      有关正则表达式的完整解释,请参见:

      简言之

      • \b{}\w
        :查找单词边界后的所有字符
      • (?!of\b| and\b|&\b)
        :除非它在停止词列表中

      太快了。谢谢@Danielt这很快。谢谢@Daniel