Python 从文件中删除列中重复的信息_Python_Pandas

Python 从文件中删除列中重复的信息

python pandas

Python 从文件中删除列中重复的信息,python,pandas,Python,Pandas,我需要清理一个数据框，删除重复的信息。例如： name strength 770 Vitamin B12 Tab 500mcg 500 mcg 771 Vitamin B12 Tab 5mcg 5 mcg 772 Vitamin B12 Tablets 250mcg 250 mcg 773 Vita

我需要清理一个数据框，删除重复的信息。例如：

    name                                       strength
770 Vitamin B12 Tab 500mcg                     500 mcg
771 Vitamin B12 Tab 5mcg                       5 mcg
772 Vitamin B12 Tablets 250mcg                 250 mcg
773 Vitamin B12-folic Acid                     None
774 Vitamin B6 & B12 With Folic Acid           None
775 Vitamin Deficiency Injectable System - B12 None
776 Vitamine 110 Liq                           None
777 Vitamine B-12 Tab 100mcg                   100 mcg
778 Vitamine B12 25 Mcg - Tablet               25 mcg
779 Vitamine B12 250mcg                        250 mcg

从第一列

名称

中，我需要删除

强度

中的信息，即：

    name                                       strength
770 Vitamin B12 Tab                            500 mcg
771 Vitamin B12 Tab                            5 mcg
772 Vitamin B12 Tablets                        250 mcg
773 Vitamin B12-folic Acid                     None
774 Vitamin B6 & B12 With Folic Acid           None
775 Vitamin Deficiency Injectable System - B12 None
776 Vitamine 110 Liq                           None
777 Vitamine B-12 Tab                          100 mcg
778 Vitamine B12 - Tablet                      25 mcg
779 Vitamine B12                               250 mcg

注意，

名称

中的强度表示可能与

强度

列中的强度表示不完全对应，直到空白处（500 mcg与500 mcg）

我的简单解决方案是循环所有可能的

强度组合

，如果

名称

列中存在匹配项，则替换为空字符：

new_df = []

for i in df:
    for j in df.strength.dropna().drop_duplicates().tolist():
        for k in i.split():
            if j == k: 
                new_df.append((i, i.replace(j, '')))

print(new_df)

它确实可以工作，但是，我有很多数据，这是最不符合pythonic的，也是最不高效的实现方式

有什么建议吗？

我可能不会接受所有可能的力量组合。由于两列的项目似乎包含大致相同的字符，因此使用强度列模糊搜索名称列可能就足够了

您可以搜索不区分大小写的带空格和不带空格的项目，您可能会完成大多数项目

可以使用python中的正则表达式执行不区分大小写的搜索：

import re

# case insensitive without whitespace
if re.search('5 mcg'.replace(" ",""), 'Vitamin B12 Tab 5mcg', re.IGNORECASE):
    # is True
elif re.search('25 mcg', 'Vitamine B12 25 Mcg - Tablet', re.IGNORECASE):
    # is True

当然，用变量替换文本

编辑：可能有一种更有效的方法可以使用正则表达式来实现这一点，因此如果有人更精通正则表达式，我很乐意学习它。

我可能无法与所有可能的强度组合相匹配。由于两列的项目似乎包含大致相同的字符，因此使用强度列模糊搜索名称列可能就足够了

new_df=[]  
df= df[df[strength]!=None]# Firstly select the column with Non None values.     
df['name']= df[name].str.split()   
for i in df[name]:  
   for j in df[strength]:    
        if j in i:   
            i.remove(j)   
        else:   
             pass   
   new_df.append(' '.join(i))

您可以搜索不区分大小写的带空格和不带空格的项目，您可能会完成大多数项目

可以使用python中的正则表达式执行不区分大小写的搜索：

import re

# case insensitive without whitespace
if re.search('5 mcg'.replace(" ",""), 'Vitamin B12 Tab 5mcg', re.IGNORECASE):
    # is True
elif re.search('25 mcg', 'Vitamine B12 25 Mcg - Tablet', re.IGNORECASE):
    # is True

当然，用变量替换文本

编辑：也许有一种更有效的方法可以用正则表达式来实现这一点，所以如果有人更精通正则表达式，我很乐意学习它

new_df=[]  
df= df[df[strength]!=None]# Firstly select the column with Non None values.     
df['name']= df[name].str.split()   
for i in df[name]:  
   for j in df[strength]:    
        if j in i:   
            i.remove(j)   
        else:   
             pass   
   new_df.append(' '.join(i))

这可能是一个更好的方法。首先，我们正在减少您的数据和一个for循环，这将使代码的复杂性由o（n3）变为o（n2）

假设：强度模式始终是“数字+空间（可选）+mcg”。如果需要的话，会有更多的方法来推广它

您可以使用

regex

和

df.apply

首先，您要定义要使用的模式。然后在

名称

列中使用，如下面的代码所示

import re
import pandas as pd

# Creates a DataFrame for testing
df = pd.DataFrame({"name":["Vitamin B12 500 MCG tab", "Vitamin Deficiency Injectable System - B12", 
"Vitamin Deficiency Injectable System - B12 25 mcg"],"strenght":["500 mcg", "None", "25 mcg"]})

# creates the pattern we are looking for
p = re.compile(r'[\d]+\s?mcg', re.IGNORECASE) 

# Replace our column name with the value we want
df["name"] = df["name"].apply(lambda x: re.sub(p,'',x))
print(df)

您可以找到有关

df.apply

和将正则表达式与Python一起使用的更多信息假设：强度模式始终是“数字+空格（可选）+mcg”。如果需要的话，会有更多的方法来推广它

您可以使用

regex

和

df.apply

首先，您要定义要使用的模式。然后在

名称

列中使用，如下面的代码所示

import re
import pandas as pd

# Creates a DataFrame for testing
df = pd.DataFrame({"name":["Vitamin B12 500 MCG tab", "Vitamin Deficiency Injectable System - B12", 
"Vitamin Deficiency Injectable System - B12 25 mcg"],"strenght":["500 mcg", "None", "25 mcg"]})

# creates the pattern we are looking for
p = re.compile(r'[\d]+\s?mcg', re.IGNORECASE) 

# Replace our column name with the value we want
df["name"] = df["name"].apply(lambda x: re.sub(p,'',x))
print(df)

您可以找到有关

df的更多信息。apply

和将regex与Python一起使用使用

re

包删除不需要的冗余字符串和

apply

函数到

DataFrame

中的行应该可以完成这项工作

在下面的代码中，您可以看到一个可能的解决方案：

import pandas as pd
import re

def removeReduntantData(row):
    if row["strength"] is not None:
        string = row["strength"].replace(" ", "\s?")
        return re.sub(re.compile(string+"\s?", re.IGNORECASE), "", row["name"]).strip()
    else:
        return row["name"]

df = pd.DataFrame({"name":["Vitamin B12 Tab 500mcg","Vitamin B12 Tab 5mcg","Vitamin B12 Tablets 250mcg","Vitamin B12-folic Acid","Vitamin B6 & B12 With Folic Acid","Vitamin Deficiency Injectable System - B12","Vitamine 110 Liq","Vitamine B-12 Tab 100mcg","Vitamine B12 25 Mcg - Tablet","Vitamine B12 250mcg"],\
"strength":["500 mcg","5 mcg","250 mcg",None,None,None,None,"100 mcg","25 mcg","250 mcg"]})

df["name"] = df.apply(removeReduntantData, axis=1)

然后，输出数据帧为：

>>> df
                                         name strength
0                             Vitamin B12 Tab  500 mcg
1                             Vitamin B12 Tab    5 mcg
2                         Vitamin B12 Tablets  250 mcg
3                      Vitamin B12-folic Acid     None
4            Vitamin B6 & B12 With Folic Acid     None
5  Vitamin Deficiency Injectable System - B12     None
6                            Vitamine 110 Liq     None
7                           Vitamine B-12 Tab  100 mcg
8                       Vitamine B12 - Tablet   25 mcg
9                                Vitamine B12  250 mcg

这样，您就可以使用

strength

列在

name

列中查找冗余字符串并将其删除，考虑到冗余字符串之间可能没有空格。

使用

re

包删除不需要的冗余字符串，并

将函数应用于数据框中的行应该可以完成这项工作
在下面的代码中，您可以看到一个可能的解决方案：
import pandas as pd
import re

def removeReduntantData(row):
    if row["strength"] is not None:
        string = row["strength"].replace(" ", "\s?")
        return re.sub(re.compile(string+"\s?", re.IGNORECASE), "", row["name"]).strip()
    else:
        return row["name"]

df = pd.DataFrame({"name":["Vitamin B12 Tab 500mcg","Vitamin B12 Tab 5mcg","Vitamin B12 Tablets 250mcg","Vitamin B12-folic Acid","Vitamin B6 & B12 With Folic Acid","Vitamin Deficiency Injectable System - B12","Vitamine 110 Liq","Vitamine B-12 Tab 100mcg","Vitamine B12 25 Mcg - Tablet","Vitamine B12 250mcg"],\
"strength":["500 mcg","5 mcg","250 mcg",None,None,None,None,"100 mcg","25 mcg","250 mcg"]})

df["name"] = df.apply(removeReduntantData, axis=1)

然后，输出数据帧为：
>>> df
                                         name strength
0                             Vitamin B12 Tab  500 mcg
1                             Vitamin B12 Tab    5 mcg
2                         Vitamin B12 Tablets  250 mcg
3                      Vitamin B12-folic Acid     None
4            Vitamin B6 & B12 With Folic Acid     None
5  Vitamin Deficiency Injectable System - B12     None
6                            Vitamine 110 Liq     None
7                           Vitamine B-12 Tab  100 mcg
8                       Vitamine B12 - Tablet   25 mcg
9                                Vitamine B12  250 mcg

这样，您就可以使用strength
列在name
列中查找冗余字符串并将其删除，同时考虑到冗余字符串之间可能没有空格