Python 制作缩写-选择非停止词的第一个字符
给出一个停止字列表和一个数据框,其中1列具有完整的格式,如图所示-Python 制作缩写-选择非停止词的第一个字符,python,python-3.x,string,pandas,nltk,Python,Python 3.x,String,Pandas,Nltk,给出一个停止字列表和一个数据框,其中1列具有完整的格式,如图所示- stopwords = ['of', 'and', '&', 'com', 'org'] df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']}) df +---+---------------------------+ | | Fu
stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})
df
+---+---------------------------+
| | Full form |
+---+---------------------------+
| 0 | World health organization |
| 1 | Intellectual property |
| 2 | Royal bank of canada |
+---+---------------------------+
我正在寻找一种方法,使相邻列的缩写忽略stopwords(如果有的话)
预期产出:
+---+---------------------------+----------------+
| | Full form | Abbreviation |
+---+---------------------------+----------------+
| 0 | World health organization | WHO |
| 1 | Intellectual property | IP |
| 2 | Royal bank of canada | RBC |
+---+---------------------------+----------------+
这应该做到:
import pandas as pd
stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})
def abbrev(t, stopwords=stopwords):
return ''.join(u[0] for u in t.split() if u not in stopwords).upper()
df['Abbreviation'] = df['Full form'].apply(abbrev)
print(df)
输出
Full form Abbreviation
0 World health organization WHO
1 Intellectual property IP
2 royal bank of canada RBC
这应该做到:
import pandas as pd
stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})
def abbrev(t, stopwords=stopwords):
return ''.join(u[0] for u in t.split() if u not in stopwords).upper()
df['Abbreviation'] = df['Full form'].apply(abbrev)
print(df)
输出
Full form Abbreviation
0 World health organization WHO
1 Intellectual property IP
2 royal bank of canada RBC
另一种方法:
df['Abbreviation'] = (df['Full form'].replace(stopwords, '', regex=True)
.str.split()
.apply(lambda word: [l[0].upper() for l in word])
.str.join(''))
另一种方法:
df['Abbreviation'] = (df['Full form'].replace(stopwords, '', regex=True)
.str.split()
.apply(lambda word: [l[0].upper() for l in word])
.str.join(''))
下面是一个正则表达式解决方案:
stopwods = ['of', 'and', '&', 'com', 'org']
stopwords_re = r"(?!" + r"\b|".join(stopwords) + r"\b)"
abbv_re = r"\b{}\w".format(stopwords_re)
def abbrv(s):
return "".join(re.findall(abbv_re, s)).upper()
[out]:
>>> abbrv('royal bank of scotland')
'RBS'
与熊猫一起使用:
df['Abbreviation'] = df['Full form'].apply(abbrv)
有关正则表达式的完整解释,请参见: 简言之
:查找单词边界后的所有字符\b{}\w
:除非它在停止词列表中(?!of\b| and\b|&\b)
- 这里有一个正则表达式解决方案:
stopwods = ['of', 'and', '&', 'com', 'org']
stopwords_re = r"(?!" + r"\b|".join(stopwords) + r"\b)"
abbv_re = r"\b{}\w".format(stopwords_re)
def abbrv(s):
return "".join(re.findall(abbv_re, s)).upper()
[out]:
>>> abbrv('royal bank of scotland')
'RBS'
与熊猫一起使用:
df['Abbreviation'] = df['Full form'].apply(abbrv)
有关正则表达式的完整解释,请参见: 简言之
:查找单词边界后的所有字符\b{}\w
:除非它在停止词列表中(?!of\b| and\b|&\b)