Python 分配标签：所有值均为false_Python_Pandas

Python 分配标签：所有值均为false

python pandas

Python 分配标签：所有值均为false,python,pandas,Python,Pandas,我在指定标签是否满足条件时遇到一些问题。具体来说，我想将False（或0）分配给至少包含其中一个单词的行 my_list=["maths", "science", "geography", "statistics"] 在其中一个字段中： path | Subject | Notes 并在web栏中查找这些网站webs=[“www.stanford.edu”，“www.ucl.ac.uk”，“www.sorbonn

我在指定标签是否满足条件时遇到一些问题。具体来说，我想将False（或0）分配给至少包含其中一个单词的行

my_list=["maths", "science", "geography", "statistics"]

在其中一个字段中：

path | Subject | Notes

并在

web

栏中查找这些网站

webs=[“www.stanford.edu”，“www.ucl.ac.uk”，“www.sorbonne-universite.fr”]

为此，我使用以下代码：

  def part_is_in(x, values):
        output = False
        for val in values:
            if val in str(x):
                return True
                break                
        return output


  def assign_value(filename):
    my_list=["maths", "", "science", "geography", "statistics"]
  

    filename['Label'] = filename[['path','subject','notes']].apply(part_is_in, values= my_list)
    filename['Low_Subject']=filename['Subject']
    filename['Low_Notes']=filename['Notes']
    lower_cols = [col for col in filename if col not in ['Subject','Notes']]
    filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
    webs=["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]

# NEW COLUMN # this is still inside the function but I cannot add an indent within this post

filename['Label'] = pd.Series(index = filename.index, dtype='object')

for index, row in filename.iterrows():
        value = row['web']

        if any(x in str(value) for x in webs):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False

for index, row in filename.iterrows():
        value = row['Subject']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False

for index, row in filename.iterrows():
        value = row['Notes']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False
            
for index, row in filename.iterrows():
        value = row['path']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False
            
return(filename)

我的数据集是

web                        path         Subject                Notes
www.stanford.edu        /maths/           NA                    NA
www.ucla.com           /history/        History of Egypt        NA
www.kcl.ac.uk         /datascience/     Data Science            50 students
...

预期产出为：

web                        path         Subject                Notes           Label
www.stanford.edu        /maths/           NA                    NA               1    # contains the web and maths
www.ucla.com           /history/        History of Egypt        NA               0    
www.kcl.ac.uk         /datascience/     Data Science            50 students      1    # contains the word science
...

使用我的代码，我将获得所有值

False

。你能发现问题吗？

标签中的最终值是布尔值

如果需要int
，请使用df.Label=df.Label.astype（int）


def测试单词

用'
类型的str
填充所有NaN
s，它们是float
类型
将所有单词转换为小写
将所有/
替换为'
在''
上拆分以创建列表
将所有列表合并为一个集合
使用set方法确定行是否包含my\u列表中的单词



{'datascience'}.intersection（{'science'}）
返回一个空的集
，因为没有交集
{'data'，'science'}.intersection（{'science'}）
返回{'science'}
，因为这个词上有一个交集






lambda x:any（y中的x表示web中的y）

对于web
中的每个值，检查web是否在该值中

'www.stanford.edu'in'https://www.stanford.edu“
为True


如果任何为True
，则计算结果为True


将熊猫作为pd导入
#测试数据和数据帧
数据={'web'：['www.stanford.edu'，'www.ucla.com'，'www.kcl.ac.uk']，
“路径”：['/math/'，'/history/'，'/datascience/']，
“主题”：[np.nan，“埃及历史”，“数据科学”]，
‘注释’：[np.nan，np.nan，'50名学生]]
df=pd.DataFrame（数据）
#根据我的清单
我的清单=[“数学”、“科学”、“地理”、“统计”]
my_list=set（映射（str.lower，my_list））#转换为set并验证单词是否为小写
#给定的网络；所有值都应为小写
网=[”https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
#用于测试单词内容的函数
def测试字（v:pd.系列）->bool:
v=v.fillna（“”）.str.lower（）.str.replace（‘/’，“”）.str.split（“”）#replace na，小写，转换为列表
s#u set={st表示行中的行，v表示行中的st，if st}#将列表中的所有值连接到一个集合
如果s#u set.intersection（我的#u列表）返回True，否则返回False#如果集合之间存在单词交叉，返回True
#测试word列和web列中的条件
df['Label']=df['path'，Subject'，Notes']].apply（测试单词，轴=1）| df.web.apply（lambda x:any（x在y中表示y在web中））
#显示（df）
web路径主题注释标签
0 www.stanford.edu/math/NaN-True
1 www.ucla.com/history/history of Egypt
2 www.kcl.ac.uk/datascience/datascience 50名学生真实

关于原始代码的注释

多次使用iterrows不是一个好主意。对于大型数据集，这将非常耗时且容易出错。

编写新函数然后为每列解释不同的代码块更容易


谢谢@Trenton。我后来注意到了