Python 在数据帧中将字符串2.90K转换为2900，或将字符串5.2M转换为520万_Python_Pandas_Dataframe

Python 在数据帧中将字符串2.90K转换为2900，或将字符串5.2M转换为520万

python pandas dataframe

Python 在数据帧中将字符串2.90K转换为2900，或将字符串5.2M转换为520万,python,pandas,dataframe,Python,Pandas,Dataframe,需要一些关于处理数据框内数据的帮助。欢迎任何帮助我有CSV格式的OHCLV数据。我已将文件加载到数据帧中如何将体积列从2.90K转换为2900或从5.2M转换为520万。该列可以包含千形式的K和百万形式的M import pandas as pd file_path = '/home/fatjoe/UCHM.csv' df = pd.read_csv(file_path, parse_dates=[0], index_col=0) df.columns = [ "closing_pri

需要一些关于处理数据框内数据的帮助。欢迎任何帮助

我有CSV格式的OHCLV数据。我已将文件加载到数据帧中

如何将体积列从2.90K转换为2900或从5.2M转换为520万。该列可以包含千形式的K和百万形式的M

import pandas as pd

file_path = '/home/fatjoe/UCHM.csv'
df = pd.read_csv(file_path, parse_dates=[0], index_col=0)
df.columns = [
"closing_price", 
"opening_price", 
"high_price", 
"low_price",
"volume",
"change"]

df['opening_price'] = df['closing_price']
df['opening_price'] = df['opening_price'].shift(-1)
df = df.replace('-', 0)
df = df[:-1]
print(df.head())

Console:
 Date
 2016-09-23          0
 2016-09-22      9.60K
 2016-09-21     54.20K
 2016-09-20    115.30K
 2016-09-19     18.90K
 2016-09-16    176.10K
 2016-09-15     31.60K
 2016-09-14     10.00K
 2016-09-13      3.20K

假设您具有以下DF：

In [30]: df
Out[30]:
         Date      Val
0  2016-09-23      100
1  2016-09-22    9.60M
2  2016-09-21   54.20K
3  2016-09-20  115.30K
4  2016-09-19   18.90K
5  2016-09-16  176.10K
6  2016-09-15   31.60K
7  2016-09-14   10.00K
8  2016-09-13    3.20M

您可以这样做：

In [31]: df.Val = (df.Val.replace(r'[KM]+$', '', regex=True).astype(float) * \
   ....:           df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False)
   ....:             .fillna(1)
   ....:             .replace(['K','M'], [10**3, 10**6]).astype(int))

In [32]: df
Out[32]:
         Date        Val
0  2016-09-23      100.0
1  2016-09-22  9600000.0
2  2016-09-21    54200.0
3  2016-09-20   115300.0
4  2016-09-19    18900.0
5  2016-09-16   176100.0
6  2016-09-15    31600.0
7  2016-09-14    10000.0
8  2016-09-13  3200000.0

说明：

In [36]: df.Val.replace(r'[KM]+$', '', regex=True).astype(float)
Out[36]:
0    100.0
1      9.6
2     54.2
3    115.3
4     18.9
5    176.1
6     31.6
7     10.0
8      3.2
Name: Val, dtype: float64

In [37]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False)
Out[37]:
0    NaN
1      M
2      K
3      K
4      K
5      K
6      K
7      K
8      M
Name: Val, dtype: object

In [38]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False).fillna(1)
Out[38]:
0    1
1    M
2    K
3    K
4    K
5    K
6    K
7    K
8    M
Name: Val, dtype: object

In [39]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False).fillna(1).replace(['K','M'], [10**3, 10**6]).astype(int)
Out[39]:
0          1
1    1000000
2       1000
3       1000
4       1000
5       1000
6       1000
7       1000
8    1000000
Name: Val, dtype: int32

DataFrame.replace

为

pd.eval

我喜欢马苏的回答。您可以使用

pd.eval

大大缩短此时间：

df['Val'].replace({'K': '*1e3', 'M': '*1e6'}, regex=True).map(pd.eval).astype(int)

0        100
1    9600000
2      54200
3     115300
4      18900
5     176100
6      31600
7      10000
8    3200000
Name: Val, dtype: int64

稍加修改也会使此选项不区分大小写：

repl_dict = {'[kK]': '*1e3', '[mM]': '*1e6', '[bB]': '*1e9', }
df['Val'].replace(repl_dict, regex=True).map(pd.eval)

0        100.0
1    9600000.0
2      54200.0
3     115300.0
4      18900.0
5     176100.0
6      31600.0
7      10000.0
8    3200000.0
Name: Val, dtype: float64

解释

假设“Val”是一列字符串，

replace

操作产生

df['Val'].replace({'K': '*1e3', 'M': '*1e6'}, regex=True)

0           100
1      9.60*1e6
2     54.20*1e3
3    115.30*1e3
4     18.90*1e3
5    176.10*1e3
6     31.60*1e3
7     10.00*1e3
8      3.20*1e6
Name: Val, dtype: object

这是一个

pd.eval

可以计算的算术表达式

_ .map(pd.eval)

0        100.0
1    9600000.0
2      54200.0
3     115300.0
4      18900.0
5     176100.0
6      31600.0
7      10000.0
8    3200000.0
Name: Val, dtype: float64

为了进一步概括cs95的答案，我将这样做：

df['Val'].replace({'K': '*1e3', 'M': '*1e6', '-':'-1'}, regex=True).map(pd.eval).astype(int)

因为在某些数值上，pd.eval必须将“-”乘以其他数字，这将导致错误。（无法将字符串转换为浮点'-'）

@JosephMNjuguna，欢迎您！我已经在我的答案中添加了一个逐步的解释-请检查…@MaxU………现在我知道了如何在熊猫身上使用正则表达式………已经做了几天，几十亿年了：

df.Val=（df.Val.replace（r'[KMB]+$，''，regex=True）。astype（float）*df Val.str extract（r'[\d\.]+（[KMB]+），expand=False）。fillna（1）。replace（['K'，'M'，'B']），[10**3，10**6，10**9]）.astype（int））

如果我想再添加一个功能来用

替换

replace（{'[kK]'：'*1e3'，'[mM]'：'*1e6'，'[bB]'：'*1e9'，'-'：'0'），regex=True）

，这是正确的，只是想确认一下。对于数据框对象（多列），请使用.apply（pd.eval）或.applymap（pd.eval）.map方法仅适用于本例中的序列。

df['Val'].replace（repl_dict，regex=True）。apply（pd.eval）

@DavidDarby

df['Val'].replace（repl_dict，regex=True）

将返回序列，您的意思是

df.replace（repl_dict，regex=True）

？如果您这样做了，那么您的回答是正确的，

apply

允许您概括到多个列。是的，我应该像您那样使用数据帧的示例。

df['Val'].replace({'K': '*1e3', 'M': '*1e6', '-':'-1'}, regex=True).map(pd.eval).astype(int)