Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 将a和x27分类;字符串';基于在另一列中使用预定义类别的文本词的列_Python 3.x_Pandas_Dataframe_Data Science_Categories - Fatal编程技术网

Python 3.x 将a和x27分类;字符串';基于在另一列中使用预定义类别的文本词的列

Python 3.x 将a和x27分类;字符串';基于在另一列中使用预定义类别的文本词的列,python-3.x,pandas,dataframe,data-science,categories,Python 3.x,Pandas,Dataframe,Data Science,Categories,我有一个有电子邮件域的熊猫专栏,类似这样: Sno Domain_IDs 1 herowire.com 2 xyzenerergy.com 3 financial.com 4 oo-loans.com 5 okwire.com 6 cleaneneregy.com 7 pop-advisors.com 等等 我在一个单独的数据框中有以下类别: Sno category 1 contains wire 2 contains energy 3 contai

我有一个有电子邮件域的熊猫专栏,类似这样:

Sno  Domain_IDs
1   herowire.com
2   xyzenerergy.com
3   financial.com
4   oo-loans.com
5   okwire.com
6   cleaneneregy.com
7   pop-advisors.com
等等

我在一个单独的数据框中有以下类别:

Sno category
1   contains wire
2   contains energy
3   contains loans
4   contains advisors
我想创建一个dataframe,将数据分类如下:

Sno Domain_IDS         category
1   herowire.com       contains wire
2   xyzenerergy.com    contains energy
3   financial.com      others
4   oo-loans.com       contains loans
5   okwire.com         contains wire
6   cleaneneregy.com   contains energy
7   pop-advisors.com   contains advisors
我尝试使用lambda函数和使用“if-else”语句的标准循环

"emailAddress.str.contains('wire')"
contains子句,但我得到以下错误:

AttributeError: 'str' object has no attribute 'str'

不知何故,我无法解析数据框中的单行文本。请提供帮助。

在域中查找模式,提取并创建类别

lst = ["wire", "energy", "loans","advisors"]
def fun(a):
    for i in lst:
        if i in a:
            return i
    return "others"
df["category"] = df.Domain_IDs.apply(lambda x: fun(x))
df

  Sno        Domain_IDs category
0   1      herowire.com     wire
1   2   xyzenenergy.com   energy
2   3     financial.com   others
3   4      oo-loans.com    loans
4   5        okwire.com     wire
5   6   cleanenergy.com   energy
6   7  pop-advisors.com advisors
pat =  '('+'|'.join(cat['Sno category'].str.split().str[-1])+')'
df['category'] = ('contains ' + df['Domain_IDs'].str.extract(pat)).fillna('other')

   Sno  Domain_IDs          category
0   1   herowire.com        contains wire
1   2   xyzenenergy.com     contains energy
2   3   financial.com       other
3   4   oo-loans.com        contains loans
4   5   okwire.com          contains wire
5   6   cleaneneregy.com    other
6   7   pop-advisors.com    contains advisors

此解决方案允许多种分类:

categories = pd.DataFrame({"category": ["wire", "energy", "loans", "advisors"]})
domains = pd.DataFrame({"Sno": list(range(1, 10)),
                        "Domain_IDs": [
                            "herowire.com",
                            "xyzenergy.com",
                            "financial.com",
                            "oo-loans.com",
                            "okwire.com",
                            "cleanenergy.com",
                            "pop-advisors.com",
                            "energy-advisors.com",
                            "wire-loans.com"]})    
categories["common"] = 0
domains["common"] = 0

possibilities = pd.merge(categories, domains, how="outer")
possibilities["satisfied"] = possibilities.apply(lambda row: row["category"] in row["Domain_IDs"], axis=1)
因此,仅过滤满足以下条件的类别:

possibilities[possibilities["satisfied"]]
给出:

    category  common           Domain_IDs  Sno satisfied
0       wire       0         herowire.com    1      True
4       wire       0           okwire.com    5      True
8       wire       0       wire-loans.com    9      True
10    energy       0        xyzenergy.com    2      True
14    energy       0      cleanenergy.com    6      True
16    energy       0  energy-advisors.com    8      True
21     loans       0         oo-loans.com    4      True
26     loans       0       wire-loans.com    9      True
33  advisors       0     pop-advisors.com    7      True
34  advisors       0  energy-advisors.com    8      True

wire-loans.com或energy-advisors.com怎么样?他们只有一个目录吗?如果有的话,哪一个?我想哪一个词先出现对我有用。它可以是任何一类。没有这样的区别。尽管我很想看看我如何在这些方面有所区别。谢谢@SourceSimianThanks的回答。如果你愿意回答的话,我还有一个问题。我们如何在这方面利用NLP自己创建类别?例如,上面没有提到的一个类别是“服务”,如何根据在域中看到的时间自动创建该类别。有什么想法吗?@KshitijYadav,这是一个完全不同的问题。如果你为thatHi@Vaishali发布一个新问题,你会得到更好的答案,你能帮我回答这个问题吗:所以这个方法不能处理多个类别