Python 查找dataframe列中最长字符串的长度_Python_Pandas

Python 查找dataframe列中最长字符串的长度

python pandas

Python 查找dataframe列中最长字符串的长度,python,pandas,Python,Pandas,有没有比下面的示例更快的方法来查找Pandas数据帧中最长字符串的长度 import numpy as np import pandas as pd x = ['ab', 'bcd', 'dfe', 'efghik'] x = np.repeat(x, 1e7) df = pd.DataFrame(x, columns=['col1']) print df.col1.map(lambda x: len(x)).max() # result --> 6 当使用IPython的%time

有没有比下面的示例更快的方法来查找Pandas数据帧中最长字符串的长度

import numpy as np
import pandas as pd

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])

print df.col1.map(lambda x: len(x)).max()
# result --> 6

当使用IPython的

%timeit

计时时，运行

df.col1.map（lambda x:len（x））.max（）大约需要10秒的时间。DSM的建议似乎是在不进行手动微优化的情况下获得的最佳结果：
%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop

%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop

%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop

请注意，显式使用str.len（）
方法似乎没有多大改进。如果您不熟悉IPython，这就是非常方便的%timeit
语法的来源，我绝对建议您尝试一下快速测试类似的东西
更新添加了屏幕截图：
作为一个次要的补充，您可能希望在一个数据帧中的所有对象列之间循环：
for c in df:
    if df[c].dtype == 'object':
        print('Max length of column %s: %s\n' %  (c, df[c].map(len).max()))

这将防止bool、int类型等引发错误
可以扩展为其他非数字类型，如“字符串”、“unicode”，即
if df[c].dtype in ('object', 'string_', 'unicode_'):

有时您需要以字节为单位的最长字符串的长度。这与使用特殊Unicode字符的字符串有关，在这种情况下，字节长度大于常规长度。这在特定情况下非常相关，例如，对于数据库写入
df_col_len = int(df[df_col_name].str.encode(encoding='utf-8').str.len().max())

上面的行有额外的str.encode（encoding='utf-8'）
。输出包含在int（）
中，因为它在其他方面是一个numpy对象。
非常好的答案，特别是马吕斯和瑞奇，它们非常有用
鉴于我们大多数人都在优化编码时间，下面是对这些答案的快速扩展，以将所有列的最大项目长度作为一个系列返回，按每列的最大项目长度排序：
mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)

或作为一个班轮：
pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)

通过调整原始样本，可以将其演示为：
import pandas as pd

x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])

print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))

输出：
col2    6
col1    3
dtype: int64

您可以通过简单地使用map（len）
节省一些时间--lambda
在这里只会浪费时间。我猜大概是25%左右。当DSM评论map（len）
时，我得出了相同的结论。与len（lambda x:len（x））
方法相比，减少了约40%。需要指出的一点是str.len方法是NaN等。除非数据帧中有一个由NaN
表示的空值，否则您将收到以下错误：类型为“float”的对象没有len（）。上面的A-B-Bs答案转换为str
，以适应这种情况。