Python与R.转换代码的简洁性
我转换了这个R代码:Python与R.转换代码的简洁性,r,python-3.x,pandas,R,Python 3.x,Pandas,我转换了这个R代码: # Raw data data <- data.frame( metalname=c('Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'), radius=c(0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.14
# Raw data
data <- data.frame(
metalname=c('Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'),
radius=c(0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.1445,0.1371,0.1332),
crystal=c('FCC','HCP','BCC','HCP','FCC','FCC','BCC','FCC','BCC','FCC','FCC','FCC','BCC','HCP','BCC','HCP'))
# Calc lattice parameters (nm)
data <- rbind(
transform(subset(data, crystal=='BCC'), N=2, latticea=4*radius/sqrt(3), latticec=0),
transform(subset(data, crystal=='FCC'), N=4, latticea=2*radius*sqrt(2), latticec=0),
transform(subset(data, crystal=='HCP'), N=6, latticea=2*radius, latticec=4*radius*sqrt(2/3))
)
代码可以工作,但是有没有办法使Python版本更简洁?理想情况下,我可以在一个步骤中进行多列计算,而不需要像R代码这样的临时变量。这可能吗?这是pandas 0.13中新的
查询
/评估
功能的一个用例
databcc = data.query('crystal == "BCC"')
sqrt3 = sqrt(3)
databcc.eval('latticea = 4 * radius / sqrt3')
# ...
目前无法在表达式字符串中调用函数,因此您必须定义一个局部变量并使用它。这将非常快,因为它的所有向量化
In [65]: data.join(
concat([
DataFrame(dict(N=2, latticea=4*data.loc[data.crystal=='BCC','radius']/np.sqrt(3))),
DataFrame(dict(N=4, latticea=2*data.loc[data.crystal=='FCC','radius']/np.sqrt(2))),
DataFrame(dict(N=6, latticea=2*data.loc[data.crystal=='HCP','radius'],
latticec=4*data.loc[data.crystal=='HCP','radius']/np.sqrt(2/3.0)))
]))
Out[65]:
crystal metalname radius N latticea latticec
0 FCC Al 0.1431 4 0.202374 NaN
1 HCP Cd 0.1490 6 0.298000 0.729948
2 BCC Cr 0.1249 2 0.288444 NaN
3 HCP Co 0.1253 6 0.250600 0.613842
4 FCC Cu 0.1278 4 0.180736 NaN
5 FCC Au 0.1442 4 0.203930 NaN
6 BCC Fe 0.1241 2 0.286597 NaN
7 FCC Pb 0.1750 4 0.247487 NaN
8 BCC Mo 0.1363 2 0.314771 NaN
9 FCC Ni 0.1246 4 0.176211 NaN
10 FCC Pt 0.1387 4 0.196151 NaN
11 FCC Au 0.1445 4 0.204354 NaN
12 BCC Ta 0.1430 2 0.330244 NaN
13 HCP Ti 0.1445 6 0.289000 0.707903
14 BCC W 0.1371 2 0.316619 NaN
15 HCP Zn 0.1332 6 0.266400 0.652544
[16 rows x 6 columns]
这是否相当于原始问题代码比原始R杰作看起来更好:
import pdb
import pandas as pd
import numpy as np
import math
from pandas import DataFrame
# Raw data
data = DataFrame({
'metalname': ['Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'],
'radius': [0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.1445,0.1371,0.1332],
'crystal': ['FCC','HCP','BCC','HCP','FCC','FCC','BCC','FCC','BCC','FCC','FCC','FCC','BCC','HCP','BCC','HCP']
})
def calc_lattic_params(x):
N = None
l = None
lc = None
if x['crystal'] == 'BCC':
N = 2
l = 4 * x['radius'] / math.sqrt(3)
elif x['crystal'] == 'FCC':
N = 4
l = 2*x['radius'] / math.sqrt(2)
elif x['crystal'] == 'HCP':
N = 6
l = 2*x['radius']
lc = 4*x['radius']*math.sqrt(2.0/3.0)
return pd.Series({'N': N, 'latticea': l, 'latticec': lc})
data = pd.concat([data, data.apply(calc_lattic_params, axis = 1)], axis = 1)
以下是白炽灯(基于Lisp)版本,如果有人感兴趣,可以进行比较:
(use '(incanter core stats charts))
; Raw data
(def data (dataset [:metalname :radius :crystal] [
["Al" 0.1431 "FCC"]
["Cd" 0.1490 "HCP"]
["Cr" 0.1249 "BCC"]
["Co" 0.1253 "HCP"]
["Cu" 0.1278 "FCC"]
["Au" 0.1442 "FCC"]
["Fe" 0.1241 "BCC"]
["Pb" 0.1750 "FCC"]
["Mo" 0.1363 "BCC"]
["Ni" 0.1246 "FCC"]
["Pt" 0.1387 "FCC"]
["Au" 0.1445 "FCC"]
["Ta" 0.1430 "BCC"]
["Ti" 0.1445 "HCP"]
["W" 0.1371 "BCC"]
["Zn" 0.1332 "HCP"]
]))
; Calc lattice parameters (nm)
(conj-rows
(add-derived-column :latticec [] (fn [] 0)
(add-derived-column :latticea [:radius] (fn [r] (/ (* 4 r) (sqrt 3)))
(add-derived-column :n [] (fn [] 2)
($where {:crystal "BCC"} data))))
(add-derived-column :latticec [] (fn [] 0)
(add-derived-column :latticea [:radius] (fn [r] (* 2 r (sqrt 2)))
(add-derived-column :n [] (fn [] 4)
($where {:crystal "FCC"} data))))
(add-derived-column :latticec [:radius] (fn [r] (* 4 r (sqrt (/ 2 3))))
(add-derived-column :latticea [:radius] (fn [r] (* 2 r))
(add-derived-column :n [] (fn [] 6)
($where {:crystal "HCP"} data)))))
由于问题是代码简洁性的比较(
Python
vsR
),这里是一个使用数据的R
解决方案。表
:
library(data.table)
dt <- data.table(data, key="crystal")
data_transformed_dt <- rbind(dt["BCC", .(metalname, radius, crystal, N=2, latticea=4*radius/sqrt(3), latticec=0)],
dt['FCC', .(metalname, radius, crystal, N=4, latticea=2*radius*sqrt(2), latticec=0)],
dt['HCP', .(metalname, radius, crystal, N=6, latticea=2*radius, latticec=4*radius*sqrt(2/3))])
然后我们可以在加入时更新原始的数据
,如下所示:
# v1.9.5+, for new feature "on = ", See github project page
require(data.table)
key = data.table(crystal = c("BCC", "FCC", "HCP"),
latticea = c(4/sqrt(3), 2*sqrt(2), 2),
latticec=c(0,0,4*sqrt(2/3)),
N = c(2,4,6))
setDT(data)[key , c("latticea", "latticec", "N") :=
.(radius * latticea, radius * latticec, N),
on = "crystal"]
# metalname radius crystal latticea latticec N
# 1: Al 0.1431 FCC 0.4047479 0.0000000 4
# 2: Cd 0.1490 HCP 0.2980000 0.4866320 6
# 3: Cr 0.1249 BCC 0.2884442 0.0000000 2
# 4: Co 0.1253 HCP 0.2506000 0.4092281 6
# 5: Cu 0.1278 FCC 0.3614730 0.0000000 4
# 6: Au 0.1442 FCC 0.4078592 0.0000000 4
# 7: Fe 0.1241 BCC 0.2865967 0.0000000 2
# 8: Pb 0.1750 FCC 0.4949747 0.0000000 4
# 9: Mo 0.1363 BCC 0.3147714 0.0000000 2
# 10: Ni 0.1246 FCC 0.3524220 0.0000000 4
# 11: Pt 0.1387 FCC 0.3923028 0.0000000 4
# 12: Au 0.1445 FCC 0.4087077 0.0000000 4
# 13: Ta 0.1430 BCC 0.3302444 0.0000000 2
# 14: Ti 0.1445 HCP 0.2890000 0.4719350 6
# 15: W 0.1371 BCC 0.3166189 0.0000000 2
# 16: Zn 0.1332 HCP 0.2664000 0.4350294 6
这应该是非常高效和快速的内存,因为我们通过引用进行更新(并且不会实现整个连接)on=“crystal”
对该列执行联接,并在数据
中找到与键
中的每一行对应的匹配行索引,同时在这些匹配行上更新/创建必要的列
请注意,数据的原始顺序也保留在结果中。Python语法无法适应这种情况,因此我们必须转义为单独的表达式字符串语言?如果pandas的前提是将R功能移植到一个更干净的语言平台上,那么Python似乎是一个错误的选择。因此,我要理解你,你认为键入额外的两个单引号就足以让你重新思考pandas的动力了?你把它淡化为只键入两个引号字符,但是这些引用实际上是从Python中逃出来的,变成了一种完全独立的语言,这是一个糟糕的解决方案。如果Python语法不足以满足产品的需要,那么是的,它肯定应该使用另一个平台。这几乎完全是我在最初的帖子中看到的(子集)。你有什么不同的建议吗?它比R版本长得多,可读性也差得多,并且仍然使用临时变量。它到底在哪里使用临时变量?它正在返回一个新帧。您的代码只执行以下步骤之一。在完整的操作中,我将数据帧拆分为三个,使用不同的公式添加三个新列,然后追加回最终的数据帧。在这种情况下,databcc将是一个临时中间变量。哇!对于您的新代码,它是简洁的,不使用中间临时变量,并且不需要将单独的表达式语言嵌入字符串中。那太完美了。这与R版本完全相同。谢谢顺便说一句,我做了一个修正和小改动。在Python2.7Math中,sqrt(2/3)将为0,您需要指定2.0/3.0来使用浮点除法。我使用的是Python3,但我知道,您对Python2.x的看法绝对正确。这是迄今为止最好的python解决方案。不过更漂亮的还是更好。谢谢@Arun感谢您的改进,感谢您制作了
data.table
这样一个很棒的软件包!正是data.table
让我决定学习R
。(顺便说一句:虽然我使用的是v1.9.5
,但我必须重新安装data.table
才能运行代码。)
setDT(data)[key , c("latticea", "latticec", "N") :=
.(radius * latticea, radius * latticec, N),
on = "crystal"]
# metalname radius crystal latticea latticec N
# 1: Al 0.1431 FCC 0.4047479 0.0000000 4
# 2: Cd 0.1490 HCP 0.2980000 0.4866320 6
# 3: Cr 0.1249 BCC 0.2884442 0.0000000 2
# 4: Co 0.1253 HCP 0.2506000 0.4092281 6
# 5: Cu 0.1278 FCC 0.3614730 0.0000000 4
# 6: Au 0.1442 FCC 0.4078592 0.0000000 4
# 7: Fe 0.1241 BCC 0.2865967 0.0000000 2
# 8: Pb 0.1750 FCC 0.4949747 0.0000000 4
# 9: Mo 0.1363 BCC 0.3147714 0.0000000 2
# 10: Ni 0.1246 FCC 0.3524220 0.0000000 4
# 11: Pt 0.1387 FCC 0.3923028 0.0000000 4
# 12: Au 0.1445 FCC 0.4087077 0.0000000 4
# 13: Ta 0.1430 BCC 0.3302444 0.0000000 2
# 14: Ti 0.1445 HCP 0.2890000 0.4719350 6
# 15: W 0.1371 BCC 0.3166189 0.0000000 2
# 16: Zn 0.1332 HCP 0.2664000 0.4350294 6