Python 复制&；标识数据帧中的某些行-regex_Python_Regex_Pandas

Python 复制&；标识数据帧中的某些行-regex

python regex pandas

Python 复制&；标识数据帧中的某些行-regex,python,regex,pandas,Python,Regex,Pandas,我没有找到解决我问题的办法我想用我的数据帧中的regex certains行进行标识和复制例如，我的df： var1 0 House A and B 1 2 garage + garden 2 fridges 我希望在var2中得到的结果（也保留我的var1）：我不知道该怎么做，我认为使用regex是个好主意，但我不同意。我尝试了str.contains，但效果不好感谢您的帮助。也许，正则表达式不是执行此任务的最佳方法，但您可以编写一些表达式来拆分它们，如何对其进

我没有找到解决我问题的办法

我想用我的数据帧中的regex certains行进行标识和复制

例如，我的df：

   var1  
0  House A and B 
1  2 garage + garden 
2  fridges

我希望在var2中得到的结果（也保留我的var1）：

我不知道该怎么做，我认为使用regex是个好主意，但我不同意。我尝试了str.contains，但效果不好

感谢您的帮助。

也许，正则表达式不是执行此任务的最佳方法，但您可以编写一些表达式来拆分它们，如何对其进行编码或如何查找复数单词（您可能需要一些NLP库）将是其他一些不同的故事：

([A-Za-z]+?)\s([A-Z])(?=\s+and|$)|([0-9]+)?\s+([A-Za-z]*?)(?=\s+\+|$)

如果您希望简化/修改/探索表达式，将在的右上面板中进行解释。如果您愿意，还可以在中查看它与一些示例输入的匹配情况

如果这三种情况都是详尽无遗的，那么您可以使用我的解决方案，我的解决方案使用正则表达式匹配和拆分的组合

#the hard part
def my_scan(t):
    #Split
    #only '+' and 'and' are considered
    cond = re.findall(r'(.+)(and|\+)(.+)' , t)
    if len(cond):
        t = [_.strip() for _ in cond[0]]
    else:
        t = [t]

    #Process
    #Case 1 'House': and
    if 'and' in t:
        t.remove('and')
        #add 'House' to the second element
        t[1] = re.split(' ', t[0])[0]+' '+t[1]

    #Case 2 'Garage + Garden': + with numeral
    elif '+' in t:
        t.remove('+')
        x = []
        ##check for numerals in front
        for _ in t:
            if (re.match(r'^\d+', _)):
                temp = _[(re.match(r'^\d+', _)).end()+1:] #'garage'
                #append by the number of numeral times
                for i in range(int(re.match(r'^\d+', _)[0])):
                    x.append(temp+' '+str(i+1))
            else:
                x.append(_)
        t = x

    #Case 3 'Fridges': single word that ends with an s
    else:
        if (re.match(r'^[A-Za-z]+s$', t[0])):
            t = t[0][:-1]
            t = [t+' 1', t+' 2']

        else:
            t[0] = t[0]+' 1'

    return t

#the easier part
def get_df(t):
    output1 = []
    output2 = []
    for _ in t:
        dummy = my_scan(_)
        for i in range(len(dummy)):
            output1.append(_)
            output2.append(dummy[i])


    df = pd.DataFrame({'var1':output1,'var2':output2})
    return df


#test it
data = {'var1':['House A and B','2 Garage + Garden', 'Fridges']}
df = get_df(data['var1'])
print(df)

#bonus test
data1 = {'var1':['House AZ and PL','5 Garage + 3 Garden', 'Fridge']}
df = get_df(data1['var1'])
print(df)

打印原始数据的df输出，

data={'var1'：['House A and B'，'2 Garage+Garden'，'fredges']}

                var1      var2
0      House A and B   House A
1      House A and B   House B
2  2 Garage + Garden  Garage 1
3  2 Garage + Garden  Garage 2
4  2 Garage + Garden    Garden
5            Fridges  Fridge 1
6            Fridges  Fridge 2

                   var1      var2
0       House AZ and PL  House AZ
1       House AZ and PL  House PL
2   5 Garage + 3 Garden  Garage 1
3   5 Garage + 3 Garden  Garage 2
4   5 Garage + 3 Garden  Garage 3
5   5 Garage + 3 Garden  Garage 4
6   5 Garage + 3 Garden  Garage 5
7   5 Garage + 3 Garden  Garden 1
8   5 Garage + 3 Garden  Garden 2
9   5 Garage + 3 Garden  Garden 3
10               Fridge  Fridge 1

来自附加测试数据的打印df输出，

data1={'var1'：['House AZ and PL'，'5 Garage+3 Garden'，'冰箱]}

                var1      var2
0      House A and B   House A
1      House A and B   House B
2  2 Garage + Garden  Garage 1
3  2 Garage + Garden  Garage 2
4  2 Garage + Garden    Garden
5            Fridges  Fridge 1
6            Fridges  Fridge 2

                   var1      var2
0       House AZ and PL  House AZ
1       House AZ and PL  House PL
2   5 Garage + 3 Garden  Garage 1
3   5 Garage + 3 Garden  Garage 2
4   5 Garage + 3 Garden  Garage 3
5   5 Garage + 3 Garden  Garage 4
6   5 Garage + 3 Garden  Garage 5
7   5 Garage + 3 Garden  Garden 1
8   5 Garage + 3 Garden  Garden 2
9   5 Garage + 3 Garden  Garden 3
10               Fridge  Fridge 1

对不起，我不明白。预期输出是var2中显示的内容，还是希望var1和var2的组合出现在新列中？@powerPixie预期输出是第二列中var2中显示的内容，但也保留var2。@Emma，在我的情况下，如果单词是复数，则表示两个项目