使用regex-python从变长字符串中提取子字符串_Python_Regex_Substring

使用regex-python从变长字符串中提取子字符串

python regex

使用regex-python从变长字符串中提取子字符串,python,regex,substring,Python,Regex,Substring,我有一个文本数据集，从中提取所有包含模式r'\b'+'？：\w+？：\w+？'。连接单词标记+r'\b' 现在，我想将所有>200个单词的长句缩减为更具可读性的句子，例如，在我的模式前后只取30个单词，将修剪部分替换为有没有一个干净的方法可以做到这一点编辑：对预处理的文本进行小写搜索，删除停止词和标点符号以及其他手动选择的单词，然后匹配的句子以其原始形式存储。我想用标点符号和停止词对原句进行修剪例如： t1 = "This is a complete sentence, containi

我有一个文本数据集，从中提取所有包含模式r'\b'+'？：\w+？：\w+？'。连接单词标记+r'\b'

现在，我想将所有>200个单词的长句缩减为更具可读性的句子，例如，在我的模式前后只取30个单词，将修剪部分替换为

有没有一个干净的方法可以做到这一点

编辑：对预处理的文本进行小写搜索，删除停止词和标点符号以及其他手动选择的单词，然后匹配的句子以其原始形式存储。我想用标点符号和停止词对原句进行修剪

例如：

t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1)  # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)

在此列表中，我想修剪最长的：

# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern: 
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."

您可以扩展正则表达式，使其在模式前后最多匹配30个单词：

pattern = r'(?:\w+\W+){,30}\b' + \
          r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
          r'\b(?:\W+\w+){,30}'

然后循环所有句子，如果正则表达式匹配，使用and检查是否必须插入省略号…：

谢谢，但它不能正常工作。我已经编辑了我的question@Fed下次请试着把你所有的要求都包括在问题中……请给出一些经过处理和未经处理的句子的例子。我不想更新我的答案只是为了感谢，但它第二次对我的数据不起作用。@Rawing happy now？是和否。我认为这是不可能的。。。

for sentence in sentences:
    match = re.search(pattern, sentence)
    if match:
        text = '{}{}{}'.format('...' if match.start() > 0 else '',
                               match.group(),
                               '...' if match.end() < len(sentence) else '')
        print(text)