Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python从带有短语的单元格中提取连字符单词_Python_Regex_Pandas_Nlp - Fatal编程技术网

Python从带有短语的单元格中提取连字符单词

Python从带有短语的单元格中提取连字符单词,python,regex,pandas,nlp,Python,Regex,Pandas,Nlp,我有一个包含短语的数据框,我只想从数据框中提取由连字符分隔的复合词,并将它们放在另一个数据框中 df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],}) 到目前为止,我得到的是: import pandas as pd df=pd.DataF

我有一个包含短语的数据框,我只想从数据框中提取由连字符分隔的复合词,并将它们放在另一个数据框中

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})
到目前为止,我得到的是:

import pandas as pd

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})


new = df['Phrases'].str.extract("(?P<part1>.*?)-(?P<part2>.*)")
我想要的是只包含单词,以便注意Pok-e-mon由于两个连字符而显示为Nan:

>>> new
            part1        part2
0          Yellow        Green
1             Jong          il
2             NaN          NaN
3          methyl       butane
4              da         derp
5             NaN          NaN

您可以使用此正则表达式:

(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)

(?:               # non capturing group
    [^-\w]|^        # a non-hyphen or the beginning of the string
)
(?P<part1>
    [a-zA-Z]+     # at least a letter
)-(?P<part2>
    [a-zA-Z]+
)
(?:[^-\w]|$)        # either a non-hyphen character or the end of the string
您的第一个问题是,没有任何东西可以阻止。避免占用空间。[a-zA-Z]仅选择字母,以避免从一个单词跳到另一个单词。 对于pok-e-mon案例,您需要检查匹配前后是否没有连字符。
请参见

鉴于规格,我不知道您的第一行Nan,Nan来自何方。也许在你的例子中是打字错误?无论如何,这里有一个可能的解决方案

import re

# returns words with at least one hyphen
def split_phrase(phrase):
    return re.findall('(\w+(?:-\w+)+)', phrase)

# get all words with hyphens
words_with_hyphens = sum(df.Phrases.apply(split_phrase).values)
# split all words into parts
split_words = [word.split('-') for word in words_with_hyphens]
# keep words with two parts only, else return (Nan, Nan)
new_data = [(ws[0], ws[1]) if len(ws) == 2 else (np.nan, np.nan) for ws in split_words]
# create the new DataFrame
pd.DataFrame(new_data, columns=['part1', 'part2'])

#  part1   | part2
#------------------
# 0 Yellow | Green
# 1 Jong   | il
# 2 methyl | butane
# 3 da     | derp
# 4 NaN    | NaN

短语巴拉克·奥巴马总统没有连字符,所以它是Nan,而且总和函数不能在你行中的字符串上使用带连字符的单词,它会返回一个类型错误是的,我正在使用python 3我尝试了这个,类似的东西,它不适用于pandas str.extract如果我使用这个,它将返回所有Nans@ccsv:中央?P[a-zA-Z]+-?P[a-zA-Z]+除了口袋妖怪的案例外,其他部分都有效?我编辑了以替换单词边界,可能熊猫不喜欢这些。好吧,我想没有其他捷径可以一次捕获所有内容,所以我将为口袋妖怪和类似案例制作另一个正则表达式,这样我就可以将它们扔掉。我还使用了与您当前使用的相同正则表达式代码。
import re

# returns words with at least one hyphen
def split_phrase(phrase):
    return re.findall('(\w+(?:-\w+)+)', phrase)

# get all words with hyphens
words_with_hyphens = sum(df.Phrases.apply(split_phrase).values)
# split all words into parts
split_words = [word.split('-') for word in words_with_hyphens]
# keep words with two parts only, else return (Nan, Nan)
new_data = [(ws[0], ws[1]) if len(ws) == 2 else (np.nan, np.nan) for ws in split_words]
# create the new DataFrame
pd.DataFrame(new_data, columns=['part1', 'part2'])

#  part1   | part2
#------------------
# 0 Yellow | Green
# 1 Jong   | il
# 2 methyl | butane
# 3 da     | derp
# 4 NaN    | NaN