Pandas 使用np.where提取项目发生错误索引超出范围_Pandas_Numpy_Data Cleaning

Pandas 使用np.where提取项目发生错误索引超出范围

pandas numpy

Pandas 使用np.where提取项目发生错误索引超出范围,pandas,numpy,data-cleaning,Pandas,Numpy,Data Cleaning,我想使用np从两列中提取一个项。其中，DataFrame类似于：（总计100000多行）添加说明：“eNBID”并不总是“ID”的第三部分，数据非常脏 ID-eNBID 460-00-2354-9 2354 4600023549 2354 46001368511 6789 4600332783112 32783 我想要的结果是： ID eNBID CI 460-00-2354-9 2354

我想使用np从两列中提取一个项。其中，DataFrame类似于：（总计100000多行）

添加说明：“eNBID”并不总是“ID”的第三部分，数据非常脏

ID-eNBID
460-00-2354-9     2354
4600023549        2354
46001368511       6789
4600332783112     32783

我想要的结果是：

       ID         eNBID     CI
460-00-2354-9     2354       9
4600023549        2354       9
46001368511       6789       11
4600332783112     32783      112

我的代码是：

df['Ci']=np.where（df['ID'].astype（str）.str.contains（r'-'，na=False，regex=True）\
df['ID'].apply（lambda x:re.split（'-'，str（x））[-1]\
df.apply（lambda x:re.findall（“（[\w]{5}）”+”（[\w]{%d}'（len（str（x.eNBID）））+”（\w*）”，str（x.ID））[0][1]，axis=1））

错误是：

索引器错误：（“列表索引超出范围”，“在索引0处发生”）

这是我的新代码：

cond=df['ID'].astype（str）.str.contains（'-'，na=False，regex=True）
df['CI']=np.where（cond，df['ID'].apply（lambda x:re.split（'-'，str（x））[-1]）\
df[~cond].apply（lambda x:re.findall（‘（[\w]{5}）‘+‘（[\w]{%d}'（len（str（x.eNBID））））+‘（\w*）’，str（x.ID））[0][1]，axis=1））如果len（str（x.eNBID））在R
中标记，则有一个解决方案：
data$CI = sapply(1:nrow(data),function(x){
  gsub(paste0(".*",data$eNBID[x],"-?"),"",data$ID[x])
})

             ID eNBID  CI
1 460-00-2354-9  2354   9
2    4600023549  2354   9
3   46001368511 36851   1
4 4600332783112 32783 112

我们删除所有字符，直到eNBID
，以及（可选）-
字符
数据
data = read.table(textConnection(" 
460-00-2354-9     2354
                                 4600023549        2354
                                 46001368511       36851
                                 4600332783112     32783"),stringsAsFactors=FALSE)
names(data)=c("ID","eNBID")

试试这个
df['s']=df['ID'].replace('-','', regex=True)
df['Ci'] = df.apply(lambda x: x['s'][(5+len(str(x.eNBID))):], axis=1)
df.drop('s', axis=1, inplace = True)

输出
     ID            eNBID    Ci
0   460-00-2354-9   2354    9
1   4600023549      2354    9
2   46001368511     6789    11
3   4600332783112   32783   112

              ID  eNBID   CI
0  460-00-2354-9   2354    9
1     4600023549   2354    9
2    46001368511  36851    1
3  4600332783112  32783  112

使用re
和np，您的逻辑几乎达到了目的。where
：
import re

df['CI'] = np.where(df['ID'].str.contains('-'),
                    df.apply(lambda x: re.findall(f'(?<={x.eNBID}\-)(\d+)', x['ID']), axis=1),
                    df.apply(lambda x: re.findall(f'(?<={x.eNBID})(\d+)', x['ID']), axis=1))

df['CI'] = df['CI'].str.join('')

@Erfan我假设CI是在eNBID之后的ID中找到的数字（至少这与提供的示例一致）对不起，这是我的错，我没有说得很清楚，“eNBID”并不总是“ID”的第三部分，数据非常脏，只有“eNBID”的长度可以使用。非常感谢，因为数据非常脏，“eNBID”并不总是“ID”的第三部分，因此您的代码只能在特定条件下工作扫描您更新数据以表示所有场景？我们只能使用您提供的数据。在我的解决方案中，“eNBID”在哪里并不重要“是。我正在使用eNBID分割ID，然后取后面的部分。对不起，这是我的错，我可以找出你的代码，我刚刚更新了我的数据演示。第三行的逻辑是什么？我无法找到。6789根本不是ID的一部分。脏数据没有逻辑，只有“eNBID”的长度。”可以使用。太好了，如果这对你有帮助，别忘了：）