Python 3.x 将a和x27分类;字符串';基于在另一列中使用预定义类别的文本词的列
我有一个有电子邮件域的熊猫专栏,类似这样:Python 3.x 将a和x27分类;字符串';基于在另一列中使用预定义类别的文本词的列,python-3.x,pandas,dataframe,data-science,categories,Python 3.x,Pandas,Dataframe,Data Science,Categories,我有一个有电子邮件域的熊猫专栏,类似这样: Sno Domain_IDs 1 herowire.com 2 xyzenerergy.com 3 financial.com 4 oo-loans.com 5 okwire.com 6 cleaneneregy.com 7 pop-advisors.com 等等 我在一个单独的数据框中有以下类别: Sno category 1 contains wire 2 contains energy 3 contai
Sno Domain_IDs
1 herowire.com
2 xyzenerergy.com
3 financial.com
4 oo-loans.com
5 okwire.com
6 cleaneneregy.com
7 pop-advisors.com
等等
我在一个单独的数据框中有以下类别:
Sno category
1 contains wire
2 contains energy
3 contains loans
4 contains advisors
我想创建一个dataframe,将数据分类如下:
Sno Domain_IDS category
1 herowire.com contains wire
2 xyzenerergy.com contains energy
3 financial.com others
4 oo-loans.com contains loans
5 okwire.com contains wire
6 cleaneneregy.com contains energy
7 pop-advisors.com contains advisors
我尝试使用lambda函数和使用“if-else”语句的标准循环
"emailAddress.str.contains('wire')"
contains子句,但我得到以下错误:
AttributeError: 'str' object has no attribute 'str'
不知何故,我无法解析数据框中的单行文本。请提供帮助。在域中查找模式,提取并创建类别
lst = ["wire", "energy", "loans","advisors"]
def fun(a):
for i in lst:
if i in a:
return i
return "others"
df["category"] = df.Domain_IDs.apply(lambda x: fun(x))
df
Sno Domain_IDs category
0 1 herowire.com wire
1 2 xyzenenergy.com energy
2 3 financial.com others
3 4 oo-loans.com loans
4 5 okwire.com wire
5 6 cleanenergy.com energy
6 7 pop-advisors.com advisors
pat = '('+'|'.join(cat['Sno category'].str.split().str[-1])+')'
df['category'] = ('contains ' + df['Domain_IDs'].str.extract(pat)).fillna('other')
Sno Domain_IDs category
0 1 herowire.com contains wire
1 2 xyzenenergy.com contains energy
2 3 financial.com other
3 4 oo-loans.com contains loans
4 5 okwire.com contains wire
5 6 cleaneneregy.com other
6 7 pop-advisors.com contains advisors
此解决方案允许多种分类:
categories = pd.DataFrame({"category": ["wire", "energy", "loans", "advisors"]})
domains = pd.DataFrame({"Sno": list(range(1, 10)),
"Domain_IDs": [
"herowire.com",
"xyzenergy.com",
"financial.com",
"oo-loans.com",
"okwire.com",
"cleanenergy.com",
"pop-advisors.com",
"energy-advisors.com",
"wire-loans.com"]})
categories["common"] = 0
domains["common"] = 0
possibilities = pd.merge(categories, domains, how="outer")
possibilities["satisfied"] = possibilities.apply(lambda row: row["category"] in row["Domain_IDs"], axis=1)
因此,仅过滤满足以下条件的类别:
possibilities[possibilities["satisfied"]]
给出:
category common Domain_IDs Sno satisfied
0 wire 0 herowire.com 1 True
4 wire 0 okwire.com 5 True
8 wire 0 wire-loans.com 9 True
10 energy 0 xyzenergy.com 2 True
14 energy 0 cleanenergy.com 6 True
16 energy 0 energy-advisors.com 8 True
21 loans 0 oo-loans.com 4 True
26 loans 0 wire-loans.com 9 True
33 advisors 0 pop-advisors.com 7 True
34 advisors 0 energy-advisors.com 8 True
wire-loans.com或energy-advisors.com怎么样?他们只有一个目录吗?如果有的话,哪一个?我想哪一个词先出现对我有用。它可以是任何一类。没有这样的区别。尽管我很想看看我如何在这些方面有所区别。谢谢@SourceSimianThanks的回答。如果你愿意回答的话,我还有一个问题。我们如何在这方面利用NLP自己创建类别?例如,上面没有提到的一个类别是“服务”,如何根据在域中看到的时间自动创建该类别。有什么想法吗?@KshitijYadav,这是一个完全不同的问题。如果你为thatHi@Vaishali发布一个新问题,你会得到更好的答案,你能帮我回答这个问题吗:所以这个方法不能处理多个类别