Python 要在文本引用中识别的正则表达式模式是什么;(作者姓名,年份)和#x27;?

Python 要在文本引用中识别的正则表达式模式是什么;(作者姓名,年份)和#x27;?,python,regex,pandas,dataframe,nlp,Python,Regex,Pandas,Dataframe,Nlp,我已将标记化句子列表转换为数据帧。现在,我需要过滤包含引用的行(句子) 数据帧示例: sentences 1 This is my house 2 This is clear water(World Health organisation, 2018). 3 This house was built in 2000 4 According to me (Sundar, 2015)it is good. 预期产出: sentences 1 This is clear wa

我已将标记化句子列表转换为数据帧。现在,我需要过滤包含引用的行(句子)

数据帧示例:

   sentences
1  This is my house
2  This is clear water(World Health organisation, 2018).
3  This house was built in 2000 
4  According to me (Sundar, 2015)it is good.
预期产出:

   sentences
1  This is clear water(World Health organisation, 2018).
2  According to me (Sundar, 2015)it is good.
我一直在以不同的模式使用下面的代码,r'[(]\w+,\d{4}[)],r'[(\w+\s+,\d{4}]

你可以试试:

print(df[df['sentences'].str.contains(r'\d{4}\)', regex = True)])
或:

两项产出:

                                               sentences
2  This is clear water(World Health organisation, 2018).
4              According to me (Sundar, 2015)it is good.
很高兴它能帮助你!你能吗?
print(df[df['sentences'].str.contains(r'\w.+\(\w.+\d{4}\)', regex = True)])
                                               sentences
2  This is clear water(World Health organisation, 2018).
4              According to me (Sundar, 2015)it is good.