Python 公司类型(值)如下所示: keys=list(df2['Name\u Extension']) keys=[key.strip().lower()用于键入键] 打印(钥匙) >>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co'] 值=列表(df2[“公司类型”]) values=[value.strip().lower()表示值中的值] 打印(值) >>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’]
下一步是将Python 公司类型(值)如下所示: keys=list(df2['Name\u Extension']) keys=[key.strip().lower()用于键入键] 打印(钥匙) >>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co'] 值=列表(df2[“公司类型”]) values=[value.strip().lower()表示值中的值] 打印(值) >>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’],python,pandas,dataframe,replace,split,Python,Pandas,Dataframe,Replace,Split,下一步是将cleaned_输入中的每个值映射到Core_输入和Type_输入。我们可以在cleaned_Input列上使用pandas方法 要获取核心\u输入: def get_core_输入(数据): #预处理 data=str(data).strip().lower() #检查数据是否以任何键结束 对于键入键: 如果data.endswith(键): return data.split(key)[0].strip()#分割数据并返回不带key的零件 一无所获 df1['Core\u Inpu
cleaned_输入
中的每个值映射到Core_输入
和Type_输入
。我们可以在cleaned_Input
列上使用pandas方法
要获取核心\u输入
:
def get_core_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于键入键:
如果data.endswith(键):
return data.split(key)[0].strip()#分割数据并返回不带key的零件
一无所获
df1['Core\u Input']=df1['Cleansed\u Input']。应用(获取\u Core\u Input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A&J INDUSTRIES有限责任公司A&J INDUSTRIES有限责任公司
2 A&S牙科服务作为牙科服务
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
要获取输入类型
:
def get_type_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于范围内的idx(len(键)):
如果data.endswith(键[idx]):
返回值[idx].strip()#返回对应匹配键的值
一无所获
df1['Type_input']=df1['Cleansed_input']。应用(获取_Type_input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0技术S.A技术SA无
1 A&J INDUSTRIES,LLC A J INDUSTRIES LLC无
2 A&S牙科服务作为牙科服务无
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC无
这是一个很容易遵循的解决方案,但我相信这不是解决问题的最有效的方法。。希望它能解决您的用例。假设您在数据帧中读取了
df1
和df2
,第一步是创建两个列表-一个用于名称扩展名(键),一个用于公司类型(值),如下所示:
keys=list(df2['Name\u Extension'])
keys=[key.strip().lower()用于键入键]
打印(钥匙)
>>>['co llc'、'pvt ltd'、'corp'、'co ltd'、'inc'、'co']
值=列表(df2[“公司类型”])
values=[value.strip().lower()表示值中的值]
打印(值)
>>>[‘有限责任公司’、‘私人有限公司’、‘公司’、‘有限公司’、‘注册公司’、‘公司’]
下一步是将cleaned_输入
中的每个值映射到Core_输入
和Type_输入
。我们可以在cleaned_Input
列上使用pandas方法
要获取核心\u输入
:
def get_core_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于键入键:
如果data.endswith(键):
return data.split(key)[0].strip()#分割数据并返回不带key的零件
一无所获
df1['Core\u Input']=df1['Cleansed\u Input']。应用(获取\u Core\u Input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A&J INDUSTRIES有限责任公司A&J INDUSTRIES有限责任公司
2 A&S牙科服务作为牙科服务
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
要获取输入类型
:
def get_type_输入(数据):
#预处理
data=str(data).strip().lower()
#检查数据是否以任何键结束
对于范围内的idx(len(键)):
如果data.endswith(键[idx]):
返回值[idx].strip()#返回对应匹配键的值
一无所获
df1['Type_input']=df1['Cleansed_input']。应用(获取_Type_input)
打印(df1)
>>>
原始输入净化输入核心输入类型输入
0技术S.A技术SA无
1 A&J INDUSTRIES,LLC A J INDUSTRIES LLC无
2 A&S牙科服务作为牙科服务无
3 A.M.G Médicale Inc.AMG Mdicale Inc.AMG Mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC无
这是一个很容易遵循的解决方案,但我相信这不是解决问题的最有效的方法。。希望它能解决您的用例。SA
未在df2
中定义,但已被拆分。是预期的吗?是的。这是一种类型。在df2中有24行。df2中的缩写也将是SASA
未在df2
中定义,但已拆分。是预期的吗?是的。这是一种类型。在df2中有24行。缩写词也将是df2Hey中的SA,谢谢您的回答。但我仍然没有得到所需的产出。它只是将已清理的_输入复制到名为lower input的新列中。它没有拆分。您是否将“Cleaned_输入”预处理为小写?你能分享你的代码吗?@DhanalakshmiV我没有看到你用来为你的df
中的每一行找到相应扩展名的代码。你能把它也加上吗?另外,请查看我的编辑,这可能会有所帮助。非常感谢,这对数字、标点符号和空格都有效。而且我没有存任何co
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc
AAREN SCIENTIFIC AAREN SCIENTIFIC
Name_Extension Company_Type Priority
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc AMG Mdicale Incorporated
AAREN SCIENTIFIC AAREN SCIENTIFIC
Here is my code:
import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re
#Connecting SQL with Python
server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'
engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'?
driver=SQL+server')
#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size()
#print(gp)
#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join([''
if ord(i) < 32 or ord(i) > 126 else i for i in x]))
#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.punctuation]))
#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.digits]))
#4.Removing trialing and leading spaces
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())
df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)
df2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
df = pd.DataFrame({
"Original_Input": ["TECHNOLOGIES S.A",
"A & J INDUSTRIES, LLC",
"A&S DENTAL SERVICES",
"A.M.G Médicale Inc",
"AAREN SCIENTIFIC"],
"Cleansed_Input": ["TECHNOLOGIES SA",
"A J INDUSTRIES LLC",
"AS DENTAL SERVICES",
"AMG Mdicale Inc",
"AAREN SCIENTIFIC"]
})
df_2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0])
for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
.apply(lambda p: next(( priority
for priority, extension in extensions_list
if p.endswith(extension)), None))
# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
lambda p: p["Cleansed_Input"]
if p["aux"] == 0
else p["Cleansed_Input"][:p["aux"]].strip(),
axis=1
)
# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
data['Cleansed_Input'] = data["Original_Input"] \
.str.replace("[^\w ]+", "") \ # removes non-alpha characters
.str.replace(" +", " ") \ # removes duplicated spaces
.str.strip() # removes spaces before or after the string
SELECT t.Original_Name,
t.Cleansed_Input,
t.Name_Extension,
t.Company_Type,
t.Priority
FROM (
SELECT df.Original_Name,
df.Cleansed_Input,
df_2.Name_Extension,
df_2.Company_Type,
df_2.Priority,
ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
ON lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
from itertools import chain
ext = df2['Name_Extension'].str.strip().str.split('\s+')
ext = list(chain.from_iterable(i for i in ext))
df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]
s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()
df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s
Original_Input Cleansed_Input type_input core_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA NaN NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC LLC A J INDUSTRIES
2 A&S DENTAL SERVICES AS DENTAL SERVICES NaN NaN
3 A.M.G Médicale Inc AMG Mdicale Inc Inc AMG Mdicale
4 AAREN SCIENTIFIC AAREN SCIENTIFIC NaN NaN