使用python正则表达式提取句子

使用python正则表达式提取句子,python,regex,Python,Regex,我有一个降价文件,如下所示: #2016-12-24 | 单词 | 解释 | 例句 | | --------- | -------- | --------- | |**accelerator;**| - | - | |**compass**| - | - | |**wheels**| - | - | |**fabulous**| - | - | |**sweeping**| - | - | |**prospect**| - | - | |**pumpkin**| - | - | |**troll

我有一个降价文件,如下所示:

#2016-12-24
| 单词 | 解释 | 例句 |
| --------- | -------- | --------- |
|**accelerator;**| - | - |
|**compass**| - | - |
|**wheels**| - | - |
|**fabulous**| - | - |
|**sweeping**| - | - |
|**prospect**| - | - |
|**pumpkin**| - | - |
|**trolley**| - | - |
|snapped,**| - | - |
|tip| - | - |
|lap| - | - |
|tether.| - | - |
|damp| - | - |
|triumphant| - | - |
|sarcastic| - | - |
|missed out| - | - |
|sidekick| - | - |
|considerable| - | - |
|Willow.| - | - |
|eagle.| - | - |
|considerably.| - | - |
|flat.| - | - |
|feast| - | - |
|scramble| - | - |
|turned up| - | - |
|rounded off| - | - |
|rat| - | - |
|resembled| - | - |
|By the time she had clambered back into the car,| - | - |
|By the time she had clambered back into the car, they were running very late,| - | - |
|wheeled his trolley| - | - |
|barrier,| - | - |
|bounced| - | - |
|in blazes| - | - |
|clutching| - | - |
|sealed| - | - |
|stunned.| - | - |
|‘We’re stuck,| - | - |
|marched off| - | - |
|accelerator| - | - |
|and the prospect of seeing Fred and George’s jealous faces| - | - |
|protest.| - | - |
|in protest.| - | - |
|horizon,| - | - |
|knuckles| - | - |
|metal| - | - |
|thick| - | - |
|reached the end of its tether.| - | - |
|Artefacts| - | - |
|blurted out.| - | - |
|gaped| - | - |
|I will be writing to both your families tonight.| - | - |
|‘Can you believe our luck, though?’| - | - |
|‘Skip the lecture,’| - | - |
|people’ll be talking about that one for years!’| - | - |
|nudged| - | - |
|‘I know I shouldn’t’ve enjoyed that or anything, but –’| - | - |
|dashed| - | - |
我想摘录如下句子:

  • 当她爬回车里时
  • 当她爬回车里时,他们已经很晚了
  • 推着他的手推车
  • “我们被卡住了
  • 看到弗雷德和乔治嫉妒的脸
  • 已经到了极限
  • 今晚我会写信给你们两个家庭
  • “不过,你能相信我们的运气吗?”
  • “跳过讲座,”
  • 人们会谈论这个问题很多年的
  • “我知道我不应该喜欢这样或那样,但是——”
  • 我试着在网站上这样做,但实际上每次都是匹配的

    有人能帮我吗?

    试试这个:

    ^\|[^\w\|]*(\w+\s+(?=\w+)[^\|]*)
    

  • 如果管线以管道(|)开头,则匹配
  • [^\w\|]*
    抓取任何不在a-z0-9和|
  • \w+\s+
    确保它后跟一个单词和一个或多个单词 空白
  • (?=\w+)
    然后检查它是否有更多的单词要跟随
  • [^\|]*
    如果发现以前的情况,则会捕获任何内容,直到 下一管道|
  • 对于每一场比赛,第一组包含你想要的句子


    你可以想出:

    ^\|                     # start of line, followed by |
    (                       # capture the "words"
        (?:[‘\w]+           # a non-capturing group and at least one of \w or ‘
            (?:[^|\w\n\r]+  # followed by NOT one of these
            |               # or
            (?=\|)          # make sure, there's a | straight ahead
        )
    ){2,})                  # repeat the construct at least 2 times
    \|
    
    请参见(注意修改器!)。

    这将捕获至少两个连续的单词,如果您需要更多,请在{}括号中添加另一个数字。

    确定提取内容的标准是什么?最少几个字?也许你需要?请重新检查要求。@Langitar没有特别的标准,实际上在标记格式表中有单个单词和句子,我只想提取句子,忽略单个单词。但是为什么blazes中的
    不匹配?“这不止一个词。@MarcoMei用你的评论更新你的问题非常感谢,Maverick,它很有效。还有一个问题:根据您的回答,我可以将其更新为:
    (?你在第一组中得到句子,在提供的解释链接中…尽量避免向后看…可以告诉你,你是如何从第一组中得到句子的…让我告诉你know@Marco梅,我已经添加了运行示例的链接,我猜您只能接受一个答案:)非常感谢,这是一个很好的编码站点,保存它。如果可以的话,希望我能给你更多反馈。:)我一直在思考如何匹配连续的单词,非常感谢。最后我将其更新为:(?)?