Python 公司类型(值)如下所示: keys=list(df2['Name\u Extension']) keys=[key.strip().lower()用于键入键] 打印(钥匙) >>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co'] 值=列表(df2[“公司类型”]) values=[value.strip().lower()表示值中的值] 打印(值) >>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’]

Python 公司类型(值)如下所示: keys=list(df2['Name\u Extension']) keys=[key.strip().lower()用于键入键] 打印(钥匙) >>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co'] 值=列表(df2[“公司类型”]) values=[value.strip().lower()表示值中的值] 打印(值) >>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’],python,pandas,dataframe,replace,split,Python,Pandas,Dataframe,Replace,Split,下一步是将cleaned_输入中的每个值映射到Core_输入和Type_输入。我们可以在cleaned_Input列上使用pandas方法 要获取核心\u输入: def get_core_输入(数据): #预处理 data=str(data).strip().lower() #检查数据是否以任何键结束 对于键入键: 如果data.endswith(键): return data.split(key)[0].strip()#分割数据并返回不带key的零件 一无所获 df1['Core\u Inpu

下一步是将
cleaned_输入
中的每个值映射到
Core_输入
Type_输入
。我们可以在
cleaned_Input
列上使用pandas方法 要获取
核心\u输入

def get_core_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于键入键:
如果data.endswith(键):
return data.split(key)[0].strip()#分割数据并返回不带key的零件
一无所获
df1['Core\u Input']=df1['Cleansed\u Input']。应用(获取\u Core\u Input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A&J INDUSTRIES有限责任公司A&J INDUSTRIES有限责任公司
2 A&S牙科服务作为牙科服务
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
要获取
输入类型

def get_type_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于范围内的idx(len(键)):
如果data.endswith(键[idx]):
返回值[idx].strip()#返回对应匹配键的值
一无所获
df1['Type_input']=df1['Cleansed_input']。应用(获取_Type_input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0技术S.A技术SA无
1 A&J INDUSTRIES,LLC A J INDUSTRIES LLC无
2 A&S牙科服务作为牙科服务无
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC无

这是一个很容易遵循的解决方案,但我相信这不是解决问题的最有效的方法。。希望它能解决您的用例。

假设您在数据帧中读取了
df1
df2
,第一步是创建两个列表-一个用于
名称扩展名(键),一个用于
公司类型(值),如下所示:

keys=list(df2['Name\u Extension'])
keys=[key.strip().lower()用于键入键]
打印(钥匙)
>>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co']
值=列表(df2[“公司类型”])
values=[value.strip().lower()表示值中的值]
打印(值)
>>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’]
下一步是将
cleaned_输入
中的每个值映射到
Core_输入
Type_输入
。我们可以在
cleaned_Input
列上使用pandas方法 要获取
核心\u输入

def get_core_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于键入键:
如果data.endswith(键):
return data.split(key)[0].strip()#分割数据并返回不带key的零件
一无所获
df1['Core\u Input']=df1['Cleansed\u Input']。应用(获取\u Core\u Input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A&J INDUSTRIES有限责任公司A&J INDUSTRIES有限责任公司
2 A&S牙科服务作为牙科服务
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
要获取
输入类型

def get_type_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于范围内的idx(len(键)):
如果data.endswith(键[idx]):
返回值[idx].strip()#返回对应匹配键的值
一无所获
df1['Type_input']=df1['Cleansed_input']。应用(获取_Type_input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0技术S.A技术SA无
1 A&J INDUSTRIES,LLC A J INDUSTRIES LLC无
2 A&S牙科服务作为牙科服务无
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC无

这是一个很容易遵循的解决方案,但我相信这不是解决问题的最有效的方法。。希望它能解决您的用例。

SA
未在
df2
中定义,但已被拆分。是预期的吗?是的。这是一种类型。在df2中有24行。df2中的缩写也将是SA
SA
未在
df2
中定义,但已拆分。是预期的吗?是的。这是一种类型。在df2中有24行。缩写词也将是df2Hey中的SA,谢谢您的回答。但我仍然没有得到所需的产出。它只是将已清理的_输入复制到名为lower input的新列中。它没有拆分。您是否将“Cleaned_输入”预处理为小写?你能分享你的代码吗?@DhanalakshmiV我没有看到你用来为你的
df
中的每一行找到相应扩展名的代码。你能把它也加上吗?另外,请查看我的编辑,这可能会有所帮助。非常感谢,这对数字、标点符号和空格都有效。而且我没有存任何co
Original_Input           Cleansed_Input        Core_Input    Type_input
TECHNOLOGIES S.A         TECHNOLOGIES SA        
A & J INDUSTRIES, LLC    A J INDUSTRIES LLC     
A&S DENTAL SERVICES      AS DENTAL SERVICES     
A.M.G Médicale Inc       AMG Mdicale Inc        
AAREN SCIENTIFIC         AAREN SCIENTIFIC   
Name_Extension     Company_Type     Priority
co llc             Company LLC       2
Pvt ltd            Private Limited   8
Corp               Corporation       4
CO Ltd             Company Limited   3
inc                Incorporated      5
CO                 Company           1
Original_Input          Cleansed_Input        Core_Input       Type_input
TECHNOLOGIES S.A        TECHNOLOGIES SA       TECHNOLOGIES      SA
A & J INDUSTRIES, LLC   A J INDUSTRIES LLC    A J INDUSTRIES    LLC
A&S DENTAL SERVICES     AS DENTAL SERVICES      
A.M.G Médicale Inc      AMG Mdicale Inc       AMG Mdicale       Incorporated
AAREN SCIENTIFIC        AAREN SCIENTIFIC        
Here is my code:

import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re

#Connecting SQL with Python

server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'


engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'? 
driver=SQL+server')

#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size() 
#print(gp)

#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join(['' 
if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.punctuation]))

#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.digits]))

#4.Removing trialing and leading spaces 
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())

df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)


df2 = pd.DataFrame({ 
"Name_Extension": ["llc",
                   "Pvt ltd",
                   "Corp",
                   "CO Ltd",
                   "inc", 
                   "CO",
                   "SA"],
"Company_Type": ["Company LLC",
                 "Private Limited",
                 "Corporation",
                 "Company Limited",
                 "Incorporated",
                 "Company",
                 "Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})

data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
df = pd.DataFrame({
    "Original_Input": ["TECHNOLOGIES S.A", 
                       "A & J INDUSTRIES, LLC", 
                       "A&S DENTAL SERVICES", 
                       "A.M.G Médicale Inc", 
                       "AAREN SCIENTIFIC"],
    "Cleansed_Input": ["TECHNOLOGIES SA", 
                       "A J INDUSTRIES LLC", 
                       "AS DENTAL SERVICES", 
                       "AMG Mdicale Inc", 
                       "AAREN SCIENTIFIC"]
})

df_2 = pd.DataFrame({ 
    "Name_Extension": ["llc",
                       "Pvt ltd",
                       "Corp",
                       "CO Ltd",
                       "inc", 
                       "CO",
                       "SA"],
    "Company_Type": ["Company LLC",
                     "Private Limited",
                     "Corporation",
                     "Company Limited",
                     "Incorporated",
                     "Company",
                     "Anonymous Company"],
    "Priority": [2, 8, 4, 3, 5, 1, 9]
})

# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()

# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0]) 
                    for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
    .apply(lambda p: next(( priority 
                            for priority, extension in extensions_list 
                            if p.endswith(extension)), None))

# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of 
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")

# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
    lambda p: p["Cleansed_Input"] 
              if p["aux"] == 0 
              else p["Cleansed_Input"][:p["aux"]].strip(), 
    axis=1
)

# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
data['Cleansed_Input'] = data["Original_Input"] \
    .str.replace("[^\w ]+", "") \ # removes non-alpha characters
    .str.replace(" +", " ") \ # removes duplicated spaces
    .str.strip() # removes spaces before or after the string
SELECT t.Original_Name,
       t.Cleansed_Input,
       t.Name_Extension,
       t.Company_Type,
       t.Priority
FROM (
    SELECT df.Original_Name,
           df.Cleansed_Input,
           df_2.Name_Extension,
           df_2.Company_Type,
           df_2.Priority,
           ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
    FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                 ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                 ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
         LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                           ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                           ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
            ON  lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
from itertools import chain

ext = df2['Name_Extension'].str.strip().str.split('\s+')

ext = list(chain.from_iterable(i for i in ext))

df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]

s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()

df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s
          Original_Input      Cleansed_Input type_input      core_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA        NaN             NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC        LLC  A J INDUSTRIES
2    A&S DENTAL SERVICES  AS DENTAL SERVICES        NaN             NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc        Inc     AMG Mdicale
4       AAREN SCIENTIFIC    AAREN SCIENTIFIC        NaN             NaN