Pandas-对一列应用函数并返回新列
我有一个函数Pandas-对一列应用函数并返回新列,pandas,Pandas,我有一个函数findFullPath,它获取一个文件字符串,并使用完整路径列表查找该文件的完整路径。比如, >>> i = 4000 >>> serisuid = candidates.iloc[i].seriesuid >>> fullPath = findFullPath(serisuid,fullPaths) >>> print(serisuid) >>> pr
findFullPath
,它获取一个文件字符串,并使用完整路径列表查找该文件的完整路径。比如,
>>> i = 4000
>>> serisuid = candidates.iloc[i].seriesuid
>>> fullPath = findFullPath(serisuid,fullPaths)
>>> print(serisuid)
>>> print(fullPath)
1.3.6.1.4.1.14519.5.2.1.6279.6001.100684836163890911914061745866
/home/msmith/luna16/subset1/1.3.6.1.4.1.14519.5.2.1.6279.6001.100684836163890911914061745866.raw
我正在尝试将此函数应用于完整列候选者[“seriesuid”],并使用下面的内容返回一个具有完整路径的新列,但到目前为止它没有成功
>>> candidates["seriesuidFullPaths"] = candidates[["seriesuid"]].apply(findFullPath,args=(fullPaths,),axis=1)
[编辑]
抱歉有点模棱两可。所以我的功能是
def findFullPath(seriesuid,fullPaths):
fullPath = [s.replace(".mhd",".raw") for s in fullPaths if serisuid in s][0]
return fullPath
它在我在顶部给出的逐案代码中工作得很好,但是当我将它应用到系列中时,会产生不正确的完整文件路径。此外,我得到一个复制错误
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
然而,我认为我正在编辑实际的数据帧,所以我有点困惑
[示例]
>>> candidates.head()
seriesuid coordX \
0 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -56.08
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 53.21
2 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 103.66
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -33.66
4 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -32.25
coordY coordZ class
0 -67.85 -311.92 0
1 -244.41 -245.17 0
2 -121.80 -286.62 0
3 -72.75 -308.41 0
4 -85.36 -362.51 0
我刚刚更新了完整路径,只包含.raw
文件
>>> fullPaths = [path for path in fullPaths if ".raw" in path]
>>> fullPaths[:5]
['/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.142154819868944114554521645782.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.252358625003143649770119512644.raw']
我想用相关的.raw文件路径替换候选文件中的每个seriesuid。希望这能清除它。此错误:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
通常,当您将函数应用于数据帧的一个片段时,会发生这种情况
消除此错误的一种方法是:
candidates = df.loc[<Your condition>].copy()
您可以尝试从列表中创建新的
DataFrame
p
,将seriesuid
提取到新列seriesuid
,然后使用DataFrame
候选列seriesuid
我更改了列表fullPath
中的第一项和最后一项以进行测试:
print candidates
seriesuid coordX coordY coordZ \
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -56.08 -67.85 -311.92
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... 53.21 -244.41 -245.17
2 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... 103.66 -121.80 -286.62
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -33.66 -72.75 -308.41
4 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -32.25 -85.36 -362.51
class
0 0
1 0
2 0
3 0
4 0
如果列表完整路径中的目录长度不同,则可以使用:
fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
'/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
'/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']
p = pd.DataFrame(fullPath, columns=['paths'])
#replace .raw to empty string
p["paths"] = p["paths"].str.replace(".raw","")
#find last string splitted by / and get it to column seriesuid
p[['tmp','seriesuid']] = p['paths'].str.rsplit('/', expand=True, n=1)
#drop unnecessary column tmp
p = p.drop(['tmp'], axis=1)
print p
paths \
0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
1 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
2 /msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1...
3 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
4 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
seriesuid
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...
2 1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...
4 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
不应该只使用candidates[“seriesuidfullpath”]=candidates[“seriesuid”]。apply(findFullPath)
工作吗?“到目前为止还没有成功”-它到底有多失败?我已经更新了问题@AmiTavory。在df上应用时,返回不正确的文件路径失败,并产生复制错误。您可以使用pandas
函数替换自定义函数,如candidates[“seriesuidfullpath”]=candidates[“seriesuid”].str.replace(.mhd“,.raw”)
。顺便问一下,什么是完整路径<代码>列表
?@jezrael是的,它列出了我感兴趣的文件的所有完整路径,这些路径分散在几个文件夹中。数据帧目前只包含seriesuid,它是完整路径名的子字符串,并且对于两种类型的文件名(.raw和.mhd)是唯一的。我只对生的感兴趣。因此,我需要迭代seriesuid并在列表中找到匹配的完整路径。
fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
'/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']
p = pd.DataFrame(fullPath, columns=['paths'])
p["paths"] = p["paths"].str.replace(".raw","")
p['seriesuid'] = p['paths'].str.split('/').str[5]
print p
paths \
0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
1 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
2 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
3 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
4 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
seriesuid
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...
2 1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...
4 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
print pd.merge(candidates, p, on=['seriesuid'])
seriesuid coordX coordY coordZ \
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -56.08 -67.85 -311.92
1 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -56.08 -67.85 -311.92
class paths
0 0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
1 0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
'/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
'/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
'/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']
p = pd.DataFrame(fullPath, columns=['paths'])
#replace .raw to empty string
p["paths"] = p["paths"].str.replace(".raw","")
#find last string splitted by / and get it to column seriesuid
p[['tmp','seriesuid']] = p['paths'].str.rsplit('/', expand=True, n=1)
#drop unnecessary column tmp
p = p.drop(['tmp'], axis=1)
print p
paths \
0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
1 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
2 /msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1...
3 /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....
4 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
seriesuid
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...
2 1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...
4 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
print pd.merge(candidates, p, on=['seriesuid'])
seriesuid coordX coordY coordZ \
0 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -56.08 -67.85 -311.92
1 9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -56.08 -67.85 -311.92
class paths
0 0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....
1 0 /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....