Pandas-对一列应用函数并返回新列_Pandas

Pandas-对一列应用函数并返回新列

pandas

Pandas-对一列应用函数并返回新列,pandas,Pandas,我有一个函数findFullPath，它获取一个文件字符串，并使用完整路径列表查找该文件的完整路径。比如, >>> i = 4000 >>> serisuid = candidates.iloc[i].seriesuid >>> fullPath = findFullPath(serisuid,fullPaths) >>> print(serisuid) >>> pr

我有一个函数

findFullPath

，它获取一个文件字符串，并使用完整路径列表查找该文件的完整路径。比如,

    >>> i = 4000
    >>> serisuid = candidates.iloc[i].seriesuid
    >>> fullPath = findFullPath(serisuid,fullPaths)
    >>> print(serisuid)
    >>> print(fullPath)

    1.3.6.1.4.1.14519.5.2.1.6279.6001.100684836163890911914061745866
    /home/msmith/luna16/subset1/1.3.6.1.4.1.14519.5.2.1.6279.6001.100684836163890911914061745866.raw

我正在尝试将此函数应用于完整列候选者[“seriesuid”]，并使用下面的内容返回一个具有完整路径的新列，但到目前为止它没有成功

>>> candidates["seriesuidFullPaths"] = candidates[["seriesuid"]].apply(findFullPath,args=(fullPaths,),axis=1)

[编辑]

抱歉有点模棱两可。所以我的功能是

def findFullPath(seriesuid,fullPaths):
    fullPath = [s.replace(".mhd",".raw") for s in fullPaths if serisuid in s][0]
    return fullPath

它在我在顶部给出的逐案代码中工作得很好，但是当我将它应用到系列中时，会产生不正确的完整文件路径。此外，我得到一个复制错误

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

然而，我认为我正在编辑实际的数据帧，所以我有点困惑

[示例]

>>> candidates.head()

                                                          seriesuid  coordX  \
0  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860  -56.08   
1  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860   53.21   
2  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860  103.66   
3  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860  -33.66   
4  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860  -32.25   

   coordY  coordZ  class  
0  -67.85 -311.92      0  
1 -244.41 -245.17      0  
2 -121.80 -286.62      0  
3  -72.75 -308.41      0  
4  -85.36 -362.51      0

我刚刚更新了完整路径，只包含

.raw

文件

>>> fullPaths = [path for path in fullPaths if ".raw" in path]
>>> fullPaths[:5] 
['/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.142154819868944114554521645782.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.252358625003143649770119512644.raw']

我想用相关的.raw文件路径替换候选文件中的每个seriesuid。希望这能清除它。

此错误：

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

通常，当您将函数应用于数据帧的一个片段时，会发生这种情况

消除此错误的一种方法是：

candidates = df.loc[<Your condition>].copy()

您可以尝试从列表中创建新的

DataFrame

，将

seriesuid

提取到新列

seriesuid

，然后使用

DataFrame

候选列

seriesuid

我更改了列表

fullPath

中的第一项和最后一项以进行测试：

print candidates
                                           seriesuid  coordX  coordY  coordZ  \
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -56.08  -67.85 -311.92   
1  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...   53.21 -244.41 -245.17   
2  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  103.66 -121.80 -286.62   
3  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -33.66  -72.75 -308.41   
4  1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -32.25  -85.36 -362.51   

   class  
0      0  
1      0  
2      0  
3      0  
4      0

如果列表

完整路径中的目录长度不同，则可以使用：
fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
     '/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
     '/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']

p = pd.DataFrame(fullPath, columns=['paths'])
#replace .raw to empty string
p["paths"] = p["paths"].str.replace(".raw","")
#find last string splitted by / and get it to column seriesuid
p[['tmp','seriesuid']] = p['paths'].str.rsplit('/', expand=True, n=1)
#drop unnecessary column tmp
p = p.drop(['tmp'], axis=1)
print p
                                               paths  \
0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   
1  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
2  /msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1...   
3  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
4  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   

                                           seriesuid  
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  
1  1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...  
2  1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...  
3  1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...  
4  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  

不应该只使用candidates[“seriesuidfullpath”]=candidates[“seriesuid”]。apply（findFullPath）
工作吗？“到目前为止还没有成功”-它到底有多失败？我已经更新了问题@AmiTavory。在df上应用时，返回不正确的文件路径失败，并产生复制错误。您可以使用pandas
函数替换自定义函数，如candidates[“seriesuidfullpath”]=candidates[“seriesuid”].str.replace（.mhd“，.raw”）
。顺便问一下，什么是完整路径<代码>列表

？@jezrael是的，它列出了我感兴趣的文件的所有完整路径，这些路径分散在几个文件夹中。数据帧目前只包含seriesuid，它是完整路径名的子字符串，并且对于两种类型的文件名（.raw和.mhd）是唯一的。我只对生的感兴趣。因此，我需要迭代seriesuid并在列表中找到匹配的完整路径。

fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
     '/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']

p = pd.DataFrame(fullPath, columns=['paths'])
p["paths"] = p["paths"].str.replace(".raw","")
p['seriesuid'] = p['paths'].str.split('/').str[5]
print p
                                               paths  \
0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   
1  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
2  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
3  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
4  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   

                                           seriesuid  
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  
1  1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...  
2  1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...  
3  1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...  
4  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...

print pd.merge(candidates, p, on=['seriesuid'])    
                                           seriesuid  coordX  coordY  coordZ  \
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -56.08  -67.85 -311.92   
1  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -56.08  -67.85 -311.92   

   class                                              paths  
0      0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....  
1      0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....

fullPath = ['/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915618528829547301883.raw',
     '/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146468860187238398197.raw',
     '/home/msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282361219537913355115.raw',
     '/home/msmith/luna16/subset4/9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.raw']

p = pd.DataFrame(fullPath, columns=['paths'])
#replace .raw to empty string
p["paths"] = p["paths"].str.replace(".raw","")
#find last string splitted by / and get it to column seriesuid
p[['tmp','seriesuid']] = p['paths'].str.rsplit('/', expand=True, n=1)
#drop unnecessary column tmp
p = p.drop(['tmp'], axis=1)
print p
                                               paths  \
0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   
1  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
2  /msmith/luna16/subset4/1.3.6.1.4.1.14519.5.2.1...   
3  /home/msmith/luna16/subset4/1.3.6.1.4.1.14519....   
4  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....   

                                           seriesuid  
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  
1  1.3.6.1.4.1.14519.5.2.1.6279.6001.211071908915...  
2  1.3.6.1.4.1.14519.5.2.1.6279.6001.390009458146...  
3  1.3.6.1.4.1.14519.5.2.1.6279.6001.463214953282...  
4  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...

print pd.merge(candidates, p, on=['seriesuid'])    
                                           seriesuid  coordX  coordY  coordZ  \
0  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -56.08  -67.85 -311.92   
1  9.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...  -56.08  -67.85 -311.92   

   class                                              paths  
0      0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....  
1      0  /home/msmith/luna16/subset4/9.3.6.1.4.1.14519....