如何在excel或python中提取单词周围的文本?
我有一个数千行的文本,如下所示:如何在excel或python中提取单词周围的文本?,python,excel,text,extract,Python,Excel,Text,Extract,我有一个数千行的文本,如下所示: ksjd 234first special 34-37xy kjsbn sde 89second special 22-23xh ewio 647red special 55fg dsk uuire another special 98 another special 107r green special 55-59 ewk blue special 31-39jkl 我需要从右边的“特殊”和数字(或数字范围)前提取一个单词。换句话说,我想要: 转换为表格:
ksjd 234first special 34-37xy kjsbn
sde 89second special 22-23xh ewio
647red special 55fg dsk
uuire another special 98
another special 107r
green special 55-59 ewk
blue special 31-39jkl
我需要从右边的“特殊”和数字(或数字范围)前提取一个单词。换句话说,我想要:
转换为表格:
一种快速的方法是使用正则表达式:
In [1]: import re
In [2]: text = '''234first special 34-37xy
...: 89second special 22-23xh
...: 647red special 55fg
...: another special 98
...: another special 107r
...: green special 55-59
...: blue special 31-39jkl'''
In [3]: [re.findall('\d*\s*(\S+)\s+(special)\s+(\d+(?:-\d+)?)', line)[0] for line in text.splitlines()]
Out[3]:
[('first', 'special', '34-37'),
('second', 'special', '22-23'),
('red', 'special', '55'),
('another', 'special', '98'),
('another', 'special', '107'),
('green', 'special', '55-59'),
('blue', 'special', '31-39')]
除了@RolandSmith所写的内容之外,这里还有一种在Excel-VBA中使用正则表达式的方法
此UDF中的
Index
参数对应于从匹配集合返回第1、第2或第3个子匹配,因此您可以轻松地将原始字符串拆分为所需的三个组件
由于您编写的代码有“数千行”,因此您可能更喜欢运行宏。宏将更快地处理数据,但不是动态的。下面的宏假设您的原始数据在Sheet2的A列中,并将结果放在同一工作表的C:E列中。您可以轻松更改这些参数:
在Excel中,您可以使用公式提取两个单词之间的文本,方法如下:
这太完美了。我唯一的问题是它会自动将一些数字转换成日期。我尝试了建议的方法[例如将列设置为文本],但仍然存在此问题。@KitKat尝试包装MC(0)。。。在CStr功能中,。或者在前面加上单引号
Option Explicit
Function ExtractSpecial(S As String, Index As Long) As String
Dim RE As Object, MC As Object
Const sPat As String = "([a-z]+)\s+(special)\s+([^a-z]+)"
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = True
.MultiLine = False
.Pattern = sPat
If .test(S) = True Then
Set MC = .Execute(S)
ExtractSpecial = MC(0).submatches(Index - 1)
End If
End With
End Function
Sub ExtractSpec()
Dim RE As Object, MC As Object
Dim wsSrc As Worksheet, wsRes As Worksheet, rRes As Range
Dim vSrc As Variant, vRes As Variant
Dim I As Long
Set wsSrc = Worksheets("sheet2")
Set wsRes = Worksheets("sheet2")
Set rRes = wsRes.Cells(1, 3)
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = False
.ignorecase = True
.Pattern = "([a-z]+)\s+(special)\s+([^a-z]+)"
ReDim vRes(1 To UBound(vSrc), 1 To 3)
For I = 1 To UBound(vSrc)
If .test(vSrc(I, 1)) = True Then
Set MC = .Execute(vSrc(I, 1))
vRes(I, 1) = MC(0).submatches(0)
vRes(I, 2) = MC(0).submatches(1)
vRes(I, 3) = MC(0).submatches(2)
End If
Next I
End With
Set rRes = rRes.Resize(UBound(vRes, 1), UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
End With
End Sub