Python 给定包含字符串数据的列，使用字符串中每个字符的ascii等效值创建数据帧_Python_Pandas_Numpy_Dataframe_Ascii

Python 给定包含字符串数据的列，使用字符串中每个字符的ascii等效值创建数据帧

python pandas numpy dataframe

Python 给定包含字符串数据的列，使用字符串中每个字符的ascii等效值创建数据帧,python,pandas,numpy,dataframe,ascii,Python,Pandas,Numpy,Dataframe,Ascii,我试图将字符串列表转换为ascii，并将每个字符放在数据帧的列中。我有3000万这样的字符串，我正在运行的代码遇到内存问题例如： strings=['a'，'asd'，1234，'ewq'] 要获取以下数据帧： 0 1 2 3 0 97 0.0 0.0 0.0 1 97 115.0 100.0 0.0 2 49 50.0 51.0 52.0 3 101 119.0 113.0 0.0 我所尝试的：

我试图将字符串列表转换为ascii，并将每个字符放在数据帧的列中。我有3000万这样的字符串，我正在运行的代码遇到内存问题

例如：

strings=['a'，'asd'，1234，'ewq']

要获取以下数据帧：

     0      1      2     3
0   97    0.0    0.0   0.0
1   97  115.0  100.0   0.0
2   49   50.0   51.0  52.0
3  101  119.0  113.0   0.0

我所尝试的：

pd.DataFrame（[[ord（chr）表示列表中的chr（str（rec））]表示字符串中的rec]）。fillna（0）

错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 404, in to_arrays
    dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 434, in _list_to_arrays
    content = list(lib.to_object_array(data).T)
  File "pandas/_libs/lib.pyx", line 2269, in pandas._libs.lib.to_object_array
MemoryError

回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
文件“/root/anaconda3/lib/python3.7/site packages/pandas/core/frame.py”，第435行，在__
数组，列=到数组（数据，列，数据类型=数据类型）
文件“/root/anaconda3/lib/python3.7/site packages/pandas/core/internals/construction.py”，第404行，在to_数组中
dtype=dtype）
文件“/root/anaconda3/lib/python3.7/site packages/pandas/core/internals/construction.py”，第434行，在列表到数组中
内容=列表（库到对象数组（data）.T）
文件“pandas/_libs/lib.pyx”，第2269行，在pandas._libs.lib.to_object_数组中
记忆者

不确定是否相关，但

strings

实际上是另一个数据帧中具有

值的列
此外，最长的字符串长度几乎为255个字符。我知道30mx1000是个大数字。有什么办法可以解决这个问题吗？
您是否尝试过将数据类型明确设置为uint8，然后分块处理数据？
从您的示例代码中，我猜您隐含地使用了float32
，这需要4倍的内存
例如，如果将其写入csv文件，并且字符串适合内存，则可以尝试以下代码：
def prepare_list(string, n, default):
    size= len(string)
    res= [ord(char) for char in string[:n]]
    if size < n:
        res+= [default] * (n - size)
    return res

chunk_size= 10000 # number of strings to be processed per step
max_len= 4        # maximum number of columns (=characters per string)
column_names= [str(i+1) for i in range(max_len)] # column names used for the columns
with open('output.csv', 'wt*) as fp:
    while string_list:
        df= pd.DataFrame([prepare_list(s, max_len, 0) for s in string_list[:chunk_size]], dtype='uint8', columns=column_names)
        df.to_csv(fp, header=fp.tell() == 0, index=False)
        string_list= string_list[chunk_size:]

这使用了pandas压缩数据类型，但我只知道如何在构建后将其应用于整个数据帧。注意：我假设所有字符串都是字符串，而不是整数和字符串的混合
import pandas as pd
import numpy as np
strings = ['a','asd','1234','ewq']

# Find padding length
maxlen = max(len(s) for s in strings)

# Use 8 bit integer with pandas sparse data type, compressing zeros
dt = pd.SparseDtype(np.int8, 0)

# Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
# NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
# each series as the dataframe is built

sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
print(f"Memory used: {sdf.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null Sparse[int8, 0]
# 1    4 non-null Sparse[int8, 0]
# 2    4 non-null Sparse[int8, 0]
# 3    4 non-null Sparse[int8, 0]
# dtypes: Sparse[int8, 0](4)
# memory usage: 135.0 bytes
# Memory used: None

# The original uncompressed size
df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
print(f"Memory used: {df.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null int64
# 1    4 non-null int64
# 2    4 non-null int64
# 3    4 non-null int64
# dtypes: int64(4)
# memory usage: 208.0 bytes
# Memory used: None

将熊猫作为pd导入
将numpy作为np导入
字符串=['a'、'asd'、'1234'、'ewq']
#查找填充长度
maxlen=max（字符串中的s的len）
#使用具有稀疏数据类型的8位整数，压缩零
dt=pd.SparseDtype（np.int8，0）
#从熊猫系列为每个整数ord值创建稀疏数据帧，并用零填充
#注意：这会在创建后压缩数据帧。我找不到正确的压缩方法
#每个系列作为数据帧构建
sdf=stringsSeries.apply（lambda s:pd.Series（（ord（c）表示s.ljust（maxlen，chr（0щщщ）））中的c）。astype（dt）
打印（f“使用的内存：{sdf.info（）}”）
# 
#范围索引：4个条目，0到3
#数据列（共4列）：
#0 4非空稀疏[int8，0]
#1 4非空稀疏[int8，0]
#2 4非空稀疏[int8，0]
#3 4非空稀疏[int8，0]
#数据类型：稀疏[int8,0]（4）
#内存使用：135.0字节
#使用的内存：无
#原始未压缩的大小
df=stringsSeries.apply（lambda s:pd.Series（（ord（c）表示s.ljust中的c）（maxlen，chr（0щщ））
打印（f“使用的内存：{df.info（）}”）
# 
#范围索引：4个条目，0到3
#数据列（共4列）：
#0 4非空int64
#1 4非空int64
#2.4非空int64
#3.4非空int64
#数据类型：int64（4）
#内存使用：208.0字节
#使用的内存：无
 30m是一个大的列表，你是否考虑它的块并保存到TXT文件中？如@ JoTbe的答案，<代码> UTI88/COD>是数据类型的更好选择，而不是<代码>
import pandas as pd
import numpy as np
strings = ['a','asd','1234','ewq']

# Find padding length
maxlen = max(len(s) for s in strings)

# Use 8 bit integer with pandas sparse data type, compressing zeros
dt = pd.SparseDtype(np.int8, 0)

# Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
# NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
# each series as the dataframe is built

sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
print(f"Memory used: {sdf.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null Sparse[int8, 0]
# 1    4 non-null Sparse[int8, 0]
# 2    4 non-null Sparse[int8, 0]
# 3    4 non-null Sparse[int8, 0]
# dtypes: Sparse[int8, 0](4)
# memory usage: 135.0 bytes
# Memory used: None

# The original uncompressed size
df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
print(f"Memory used: {df.info()}")

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 4 columns):
# 0    4 non-null int64
# 1    4 non-null int64
# 2    4 non-null int64
# 3    4 non-null int64
# dtypes: int64(4)
# memory usage: 208.0 bytes
# Memory used: None