Python 比较数据帧列中的多个字符串_Python_String_Python 3.x_Pandas

Python 比较数据帧列中的多个字符串

python string python-3.x pandas

Python 比较数据帧列中的多个字符串,python,string,python-3.x,pandas,Python,String,Python 3.x,Pandas,我在Python3.x中有以下数据框架，其中有几个数字列和两个带字符串的列： import numpy as np import pandas as pd dict = {"numericvals": np.repeat(25, 8), "numeric":np.repeat(42, 8), "first":["beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary c

我在Python3.x中有以下数据框架，其中有几个数字列和两个带字符串的列：

import numpy as np
import pandas as pd

dict = {"numericvals": np.repeat(25, 8), 
    "numeric":np.repeat(42, 8), 
    "first":["beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property”], 
    "second": ["abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”]}

df = pd.DataFrame(dict1)

df = df[['numeric', 'numericvals', 'first', 'second']]

print(df)
   numeric  numericvals                     first second
0       42           25         beneficiary, duke  abcde
1       42           25                   compose  abcde
2       42           25              herd primary  abcde
3       42           25                     stall  abcde
4       42           25                      deep  abcde
5       42           25  regular summary classify  abcde
6       42           25                    timber  abcde
7       42           25                  property  abcde

列

first

包含一个或多个字符串。如果有多个，则用空格或逗号分隔

我的目标是创建一个列，记录

第一个

中的字符串长度，这些字符串的长度比

第二个

中的字符串长或短。如果大小相同，则应忽略此情况

我的想法是创建两个列表：

longer = []
shorter = []

如果

first

中的字符串较长，请通过

longer

中的

len（）

追加字符串长度。如果字符串较短，请通过

len（）

在

short

中记录字符串长度

以下是分析的样子（数据帧格式）：

我不知道如何处理

first

中的多个字符串，尤其是当有3个字符串时。在pandas中应该如何进行这种比较？

您可以使用

pandas.DataFrame.apply

：

这适用于任意数量的字符串，假设任何空格或逗号表示新字符串

以下是输出：

   numeric  numericvals                     first second     longer shorter
0       42           25         beneficiary, duke  abcde       [11]     [4]
1       42           25                   compose  abcde        [7]     [0]
2       42           25              herd primary  abcde        [7]     [4]
3       42           25                     stall  abcde        [0]     [0]
4       42           25                      deep  abcde        [0]     [4]
5       42           25  regular summary classify  abcde  [7, 7, 8]     [0]
6       42           25                    timber  abcde        [6]     [0]
7       42           25                  property  abcde        [8]     [0]

我尽力了。希望这有帮助

import operator

def transform(df, op):
    lengths = [len(s) for s in df['first'].replace(',', ' ').split()]
    return [f for f in lengths if op(f, len(df.second))] or [0]

df['longer']  = df.apply(transform, axis=1, args=[operator.gt])
df['shorter'] = df.apply(transform, axis=1, args=[operator.lt])

   numeric  numericvals                     first second     longer shorter
0       42           25         beneficiary, duke  abcde       [11]     [4]
1       42           25                   compose  abcde        [7]     [0]
2       42           25              herd primary  abcde        [7]     [4]
3       42           25                     stall  abcde        [0]     [0]
4       42           25                      deep  abcde        [0]     [4]
5       42           25  regular summary classify  abcde  [7, 7, 8]     [0]
6       42           25                    timber  abcde        [6]     [0]
7       42           25                  property  abcde        [8]     [0]