Python 抓住匹配术语前后的单词

Python 抓住匹配术语前后的单词,python,regex,Python,Regex,我正在用Python迈出第一步,我有一个问题需要解决,其中需要正则表达式 我正在解析几行文本,我需要在某个匹配前后抓取5个单词。要匹配的术语总是相同的,并且行可以有多个该术语出现 r"(?i)((?:\S+\s+){0,5})<tag>(\w*)</tag>\s*((?:\S+\s+){0,5})" 不会被很好地解析 有没有办法考虑到这些问题并使其发挥作用 提前谢谢大家 这可能是您在不使用正则表达式的情况下所做的操作 #!/usr/bin/env python d

我正在用Python迈出第一步,我有一个问题需要解决,其中需要正则表达式

我正在解析几行文本,我需要在某个匹配前后抓取5个单词。要匹配的术语总是相同的,并且行可以有多个该术语出现

r"(?i)((?:\S+\s+){0,5})<tag>(\w*)</tag>\s*((?:\S+\s+){0,5})"
不会被很好地解析

有没有办法考虑到这些问题并使其发挥作用


提前谢谢大家

这可能是您在不使用正则表达式的情况下所做的操作

#!/usr/bin/env python



def find_words(s, count, needle):

  # split the string into a list
  lst = s.split()

  # get the index of the where the needle is
  idx = lst.index(needle)

  # s is the start and end of the list you need
  s = idx -count
  e = idx +count

  # print the list as slice notation
  print lst[s:e+1]


def find_occurrences_in_list(s, count, needle):
  # split the string into a list
  lst = s.split()

  idxList = [i for i, x in enumerate(lst) if x == needle]

  # print idxList

  r = []
  for n in idxList:
    s = n-count
    e = n+count
    # append the list as slice notation
    r.append(" ".join(lst[s:e+1]))

  print r

# the string of words
mystring1 = "zero one two three four five match six seven eight nine ten eleven"
# call function to find string, 5 words head & behind, looking for the word "match"
find_occurrences_in_list(mystring1, 5, "match")

# call function to find string, 3 words head & behind, looking for the word "nation"
mystring2 = "Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition"
find_occurrences_in_list(mystring2, 3, "nation")

mystring3 = "zero one two three four five match six seven match eight nine ten eleven"
find_occurrences_in_list(mystring3, 2, "match")


['one two three four five match six seven eight nine ten']
['continent a new nation conceived in Liberty']
['four five match six seven', 'six seven match eight nine']

请您提供几个测试字符串的例子好吗?regex是您唯一的选择。Python可以将字符串分割成一个列表,然后你可以对其进行迭代。@Chih Xsujacklin举例说:“你好,我的名字是Steve。我还有一个朋友叫Steve,但我是最棒的Steve。Steve是规则。”如果我想匹配“Steve”,并且在每一场比赛前后至少抓取5个单词,那是行不通的。@MysteryGuest先生,我真的不知道。我认为正则表达式是最简单的方法。你到底在想什么?这是一个很好的方法,但是使用正则表达式在单词边界或非单词字符上拆分字符串会很有趣,从而使这个脚本在现实生活中有用。(我怀疑最终的目标是提取一个特定单词的摘录)这是一个没有正则表达式的优雅解决方案!然而,它仍然给我留下了同样的问题。不适用于同一字符串中的多个匹配项,如果您选择的捕获范围大于现有字数。。。示例:MyStrug1=“012345匹配67匹配”:FunthWord(MyStrug1,8,“Matter”)……上的任何解决方案……@ CasimiRethiPulePulter,是的,目标是提取预定范围的单词(不超过5或6)。我的正则表达式可以与边界一起使用吗?@MrMysteryGuest谢谢你,我将从你的解决方案开始工作!不过,只有一件事遗漏了。你有没有办法保证,如果你在“之前”上没有足够的文字,它会捕获最大的现有金额?现在,如果你在mystring3上加上'7',它会返回'11'8'返回'10 11'。“after”中的单词过多会起作用,它只会停在列表的末尾。@DJM我确实想到了这一点,但没有包括在内-我相信你可以解决它;)
#!/usr/bin/env python



def find_words(s, count, needle):

  # split the string into a list
  lst = s.split()

  # get the index of the where the needle is
  idx = lst.index(needle)

  # s is the start and end of the list you need
  s = idx -count
  e = idx +count

  # print the list as slice notation
  print lst[s:e+1]


def find_occurrences_in_list(s, count, needle):
  # split the string into a list
  lst = s.split()

  idxList = [i for i, x in enumerate(lst) if x == needle]

  # print idxList

  r = []
  for n in idxList:
    s = n-count
    e = n+count
    # append the list as slice notation
    r.append(" ".join(lst[s:e+1]))

  print r

# the string of words
mystring1 = "zero one two three four five match six seven eight nine ten eleven"
# call function to find string, 5 words head & behind, looking for the word "match"
find_occurrences_in_list(mystring1, 5, "match")

# call function to find string, 3 words head & behind, looking for the word "nation"
mystring2 = "Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition"
find_occurrences_in_list(mystring2, 3, "nation")

mystring3 = "zero one two three four five match six seven match eight nine ten eleven"
find_occurrences_in_list(mystring3, 2, "match")


['one two three four five match six seven eight nine ten']
['continent a new nation conceived in Liberty']
['four five match six seven', 'six seven match eight nine']