如何在excel或python中提取单词周围的文本?

如何在excel或python中提取单词周围的文本?,python,excel,text,extract,Python,Excel,Text,Extract,我有一个数千行的文本,如下所示: ksjd 234first special 34-37xy kjsbn sde 89second special 22-23xh ewio 647red special 55fg dsk uuire another special 98 another special 107r green special 55-59 ewk blue special 31-39jkl 我需要从右边的“特殊”和数字(或数字范围)前提取一个单词。换句话说,我想要: 转换为表格:

我有一个数千行的文本,如下所示:

ksjd 234first special 34-37xy kjsbn
sde 89second special 22-23xh ewio
647red special 55fg dsk
uuire another special 98
another special 107r
green special 55-59 ewk
blue special 31-39jkl
我需要从右边的“特殊”和数字(或数字范围)前提取一个单词。换句话说,我想要:

转换为表格:


一种快速的方法是使用正则表达式:

In [1]: import re

In [2]: text = '''234first special 34-37xy                          
   ...: 89second special 22-23xh
   ...: 647red special 55fg
   ...: another special 98
   ...: another special 107r
   ...: green special 55-59
   ...: blue special 31-39jkl'''

In [3]: [re.findall('\d*\s*(\S+)\s+(special)\s+(\d+(?:-\d+)?)', line)[0] for line in text.splitlines()]
Out[3]: 
[('first', 'special', '34-37'),
 ('second', 'special', '22-23'),
 ('red', 'special', '55'),
 ('another', 'special', '98'),
 ('another', 'special', '107'),
 ('green', 'special', '55-59'),
 ('blue', 'special', '31-39')]

除了@RolandSmith所写的内容之外,这里还有一种在Excel-VBA中使用正则表达式的方法



此UDF中的
Index
参数对应于从匹配集合返回第1、第2或第3个子匹配,因此您可以轻松地将原始字符串拆分为所需的三个组件

由于您编写的代码有“数千行”,因此您可能更喜欢运行宏。宏将更快地处理数据,但不是动态的。下面的宏假设您的原始数据在Sheet2的A列中,并将结果放在同一工作表的C:E列中。您可以轻松更改这些参数:




在Excel中,您可以使用公式提取两个单词之间的文本,方法如下:

  • 选择一个空白单元格,在其中键入公式=MID(A1,搜索(“KTE”,A1)+3,搜索(“特征”,A1)-SEARCH(“KTE”,A1)-4),然后按Enter按钮

  • 拖动填充控制柄以填充要应用此公式的范围。现在只提取“KTE”和“feature”之间的文本字符串

  • 注:

  • 在此公式中,A1是要从中提取文本的单元格

  • KTE和feature是要在它们之间提取文本的单词

  • 数字3是KTE的字符长度,数字4等于KTE的字符长度加1


  • 这太完美了。我唯一的问题是它会自动将一些数字转换成日期。我尝试了建议的方法[例如将列设置为文本],但仍然存在此问题。@KitKat尝试包装MC(0)。。。在CStr功能中,。或者在前面加上单引号
    Option Explicit
    Function ExtractSpecial(S As String, Index As Long) As String
        Dim RE As Object, MC As Object
        Const sPat As String = "([a-z]+)\s+(special)\s+([^a-z]+)"
    
    Set RE = CreateObject("vbscript.regexp")
    With RE
        .Global = True
        .ignorecase = True
        .MultiLine = False
        .Pattern = sPat
        If .test(S) = True Then
            Set MC = .Execute(S)
            ExtractSpecial = MC(0).submatches(Index - 1)
        End If
    End With
    
    End Function
    
    Sub ExtractSpec()
        Dim RE As Object, MC As Object
        Dim wsSrc As Worksheet, wsRes As Worksheet, rRes As Range
        Dim vSrc As Variant, vRes As Variant
        Dim I As Long
    
    Set wsSrc = Worksheets("sheet2")
    Set wsRes = Worksheets("sheet2")
        Set rRes = wsRes.Cells(1, 3)
    
    With wsSrc
        vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
    End With
    
    Set RE = CreateObject("vbscript.regexp")
    With RE
        .Global = True
        .MultiLine = False
        .ignorecase = True
        .Pattern = "([a-z]+)\s+(special)\s+([^a-z]+)"
    
    ReDim vRes(1 To UBound(vSrc), 1 To 3)
    For I = 1 To UBound(vSrc)
        If .test(vSrc(I, 1)) = True Then
            Set MC = .Execute(vSrc(I, 1))
            vRes(I, 1) = MC(0).submatches(0)
            vRes(I, 2) = MC(0).submatches(1)
            vRes(I, 3) = MC(0).submatches(2)
        End If
    Next I
    End With
    
    Set rRes = rRes.Resize(UBound(vRes, 1), UBound(vRes, 2))
    With rRes
        .EntireColumn.Clear
        .Value = vRes
        .EntireColumn.AutoFit
    End With
    
    End Sub