在Python中仅从混合字符串中提取单词_Python_Regex

在Python中仅从混合字符串中提取单词

python regex

在Python中仅从混合字符串中提取单词,python,regex,Python,Regex,我正在进行一项主题建模任务，未知主题如下表所示 topic = 0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword" 我想要一个regex.findall（）函数返回一个只包含单词的列表，例如： ['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword'] 我已尝试使用正则表达式函数： reg

我正在进行一项主题建模任务，未知主题如下表所示

 topic = 0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"

我想要一个regex.findall（）函数返回一个只包含单词的列表，例如：

['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']

我已尝试使用正则表达式函数：

regex.findall(r'\w+', topic)  and 
regex.findall(r'\D\w+', topic)

但没有一个能正确地消除这些数字。

有人能帮我找出我做错了什么吗？

如果

topic

是字符串

topic = '0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"'

然后下面的正则表达式将返回您需要的内容

re.findall('"(.*?)"', topic)

它查找包含在双引号（“）中的所有字符串。这里有一种方法：

>>> import re

>>> topic = "0.2*firstword" + "0.2*secondword" + "0.2*thirdword" + "0.2*fourthword" + "0.2*fifthword"

>>> re.sub(r'[ˆ\d]\W',' ', topic).strip().split()
>>> ['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']

您可以通过两种方式进行尝试：

第一个也是更简单的方法是迭代字符串，只保留如下字母：

''.join(letter for letter in topic if letter.isalpha())

re.sub('[^a-zA-Z]+', '', topic)

否则，您可以使用如下正则表达式：

''.join(letter for letter in topic if letter.isalpha())

re.sub('[^a-zA-Z]+', '', topic)

这个表达式只保留字母il小写和大写。

我自己也遇到了这个问题。我的解决方案是：

    import re

    def extract_tokens_from_topic(self, raw_topic):            
        raw_topic_string = raw_topic.__str__() # convert list to string
        return re.findall(r"'(.*?)'", raw_topic_string)

其中

raw\u topic

来自

raw\u topic=lda\u model.show\u topic（topic\u no）

什么是

topic

？

str

？是的主题是'str'type@SoumyaChakraborty你能分享

主题

字符串的实际值吗？它是

'0.2*“firstword”+0.2*“secondword”+0.2*“thirdword”+0.2*“fourthword”+0.2*第五个单词“

？如果我键入print（主题），它会显示：0.2*“firstword”+0.2*“secondword”+0.2*“thirdword”+0.2*“fourthword”+0.2*“fifthword”，概率不在双引号内，只在组成单词内，但无论如何我明白了你的意思。谢谢