Python 对于pandas dataframe中字符串列中的每个单词，在前后查找5个周围单词，并在新dataframe中插入新列_Python_Pandas

Python 对于pandas dataframe中字符串列中的每个单词，在前后查找5个周围单词，并在新dataframe中插入新列

python pandas

Python 对于pandas dataframe中字符串列中的每个单词，在前后查找5个周围单词，并在新dataframe中插入新列,python,pandas,Python,Pandas,我在做文本分析。我有一个问题。我需要一个解决方案我试图为pandas数据框中字符串列中的每个单词查找周围的单词（5个或更多）。屏幕截图中显示的虚拟数据帧。我有id列和文本列。我正在尝试创建一个新的数据框，它有四列（id列、before、Word、After），如所附的第二个屏幕截图（result dataframe）所示比如说虚拟数据帧结果数据帧最初我考虑使用df.Text.extractall（…），有3个捕获组（之前、单词和之后），但缺点是例如，一场比赛中的赛后小组可以使用该

我在做文本分析。我有一个问题。我需要一个解决方案

我试图为pandas数据框中字符串列中的每个单词查找周围的单词（5个或更多）。屏幕截图中显示的虚拟数据帧。我有id列和文本列。我正在尝试创建一个新的数据框，它有四列（id列、before、Word、After），如所附的第二个屏幕截图（result dataframe）所示

比如说

虚拟数据帧

结果数据帧

最初我考虑使用

df.Text.extractall（…）

，有3个捕获组（之前、单词和之后），但缺点是例如，一场比赛中的赛后小组可以使用该内容在下一场比赛中可能是单词或至少是之前的单词小组

所以我决定换一种方式：

对每一行应用一个函数，返回此行的“部分”结果
在数据帧列表中收集结果
连接它们

安装程序 源数据帧：

   ID   Text
0  ID1  The Company sells its products worldwide through its wide network of
1  ID2  Provides one of most often used search engines for HTTP sites
2  ID3  The most known of its products is the greatest airliner of the world
3  ID4  Xyz nothing

words = ['products', 'most', 'for']

wNo = 3

注意，我添加了一个“不匹配”行（ID4）

要匹配的单词：

   ID   Text
0  ID1  The Company sells its products worldwide through its wide network of
1  ID2  Provides one of most often used search engines for HTTP sites
2  ID3  The most known of its products is the greatest airliner of the world
3  ID4  Xyz nothing

words = ['products', 'most', 'for']

wNo = 3

之前/之后的字数：

   ID   Text
0  ID1  The Company sells its products worldwide through its wide network of
1  ID2  Provides one of most often used search engines for HTTP sites
2  ID3  The most known of its products is the greatest airliner of the world
3  ID4  Xyz nothing

words = ['products', 'most', 'for']

wNo = 3

在代码中，将其更改为您想要的任何数字

解决方案函数查找在当前行中匹配：

def find(row, wanted, wNo):
    wList = re.split(r'\W+', row.Text)
    wListLC = list(map(lambda x: x.lower(), wList))
    res = []
    for wd in wanted:  # Check each "wanted" word
        for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
            # For each index of "wd" in "wList"
            wdBef = ''
            if indW > 0:
                indBefBeg = indW - wNo if indW >= wNo else 0
                wdBef = ' '.join(wList[indBefBeg : indW])
            indAftBeg = indW + 1
            indAftEnd = indAftBeg + wNo
            wdAft = ' '.join(wList[indAftBeg : indAftEnd])
            res.append([row.ID, wdBef, wd, wdAft])
    return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])

参数包括：

行-源行
通缉犯-通缉犯单词列表（小写）
wNo—所需单词前后的单词数

对于找到的每个匹配项，结果包含一行，其中包含：

ID-从当前行
Before、Word、After-当前匹配的各个部分

当然，Before/After组中的实际字数可以是较小，如果当前行中没有足够的此类单词

请注意，此函数将源行拆分为两个列表：

wList-“原始”字，稍后返回
wListLC-转换为小写的单词，以匹配（请记住 “通缉犯”列表也应小写）

结果是“部分”数据帧（对于此行，如果不匹配，则为空），稍后与其他部分结果连接

现在，如何使用这个函数：以列表的形式收集部分结果运行的数据帧数量：

tbl = df.apply(find, axis=1, wanted=words, wNo=wNo).tolist()

要生成最终结果，请运行：

pd.concat(tbl, ignore_index=True)

对于我的源数据，结果是：

    ID               Before      Word                  After
0  ID1    Company sells its  products  worldwide through its
1  ID2      Provides one of      most      often used search
2  ID2  used search engines       for             HTTP sites
3  ID3         known of its  products        is the greatest
4  ID3                  The      most           known of its

请注意，Before/After组可以是空字符串，但仅限于如果单词是当前行中的第一个或最后一个

如何加速此解决方案通过以下步骤可以实现速度的某些提高：

提前编译正则表达式（
```
pat=re.Compile（r'\W+）
```
）并使用它在函数中查找匹配项
删除其他参数，改用全局变量

因此，功能可以是：

def find2(row):
    wList = re.split(pat, row.Text)
    wListLC = list(map(lambda x: x.lower(), wList))
    res = []
    for wd in words:  # Check each "wanted" word
        for indW in [ i for i, x in enumerate(wListLC) if x == wd ]:
            # For each index of "wd" in "wList"
            wdBef = ''
            if indW > 0:
                indBefBeg = indW - wNo if indW >= wNo else 0
                wdBef = ' '.join(wList[indBefBeg : indW])
            indAftBeg = indW + 1
            indAftEnd = indAftBeg + wNo
            wdAft = ' '.join(wList[indAftBeg : indAftEnd])
            res.append([row.ID, wdBef, wd, wdAft])
    return pd.DataFrame(res, columns=['ID', 'Before', 'Word', 'After'])

要调用它，请运行：

tbl = df.apply(find2, axis=1).tolist()
pd.concat(tbl, ignore_index=True)

我使用%timeit（用于测试数据）和平均执行时间从46毫秒降至39毫秒（缩短16%）。

对于较大的数据集，差异应更为显著。

请提供一个详细信息。您的问题应该是独立的，我们不必通过链接来了解您的问题。请帮助，我已修改了我的问题。感谢您的回答，我尝试了它，但它显示了一组3个单词在一个匹配单词前后的字符串，它没有显示列表中所有单词的前后单词。请确认如何对列表中的每个单词单独执行str.extractall，然后按照您上面的建议对结果进行关联。再次感谢@Valdi_bo的回复。执行代码res=[find（row，words，wNo）for(，row in df.iterrows（）]时出错-Series对象没有df属性，请帮助您从早期（昨天）版本获取此行。现在，find函数被称为tbl=df.apply（find，axis=1，want=words，wNo=wNo）.tolist（），然后，除此之外还有pd.concat（…）。感谢您的耐心，我尝试使用您创建的数据帧代码，同时执行代码tb1=df.apply（find，axis=1，wanted=words，wNo=wNo）。tolist（）我得到了错误值error:无法将大小为4的序列复制到维度为1的数组axis。请帮助解决此错误。我在Python 2.7中遇到了这样的错误。所以我想，您使用的是一些“过时的”我使用Python 3.7.0和Pandas 0.24.0。