Python 使用长度不均匀的列表项创建列?
我有一个地址列表,我想将其放入一个数据框中,其中每一行是一个新地址,列是地址的单位(标题、街道、城市) 但是,按照列表的结构,有些地址比其他地址长。例如:Python 使用长度不均匀的列表项创建列?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个地址列表,我想将其放入一个数据框中,其中每一行是一个新地址,列是地址的单位(标题、街道、城市) 但是,按照列表的结构,有些地址比其他地址长。例如: address = ['123 Some Street, City','45 Another Place, PO Box 123, City'] 我有一个包含以下列的数据框架: Index Court Address Zipcode Phone
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
我有一个包含以下列的数据框架:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
我想根据地址中有多少逗号分隔符,将地址列拆分为最多三列,并在缺少值的地方填入NaN
例如,我希望数据如下所示:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
我在StackOverflow上尝试了大量不同的解决方案,但都没有成功。我得到的最接近的代码是:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
但这将返回一个数据帧,该数据帧将以下三列添加到结构相同的末尾:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
这很接近,但正如您所看到的,对于较短的条目,城市与较长条目的第二个地址行对齐
理想情况下,第2列应该用一个城市填充每一行,第1列应该在“无”和第二个地址行(如果适用)之间交替
我希望这是有意义的——这是一个很难用语言表达的问题。谢谢 地址,尤其是由人工输入产生的地址,可能很棘手。但是,如果您的地址只适合这两种格式,则这将起作用: 注意:如果有一个额外的格式,你必须考虑,这将打印罪犯
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
基本上,有两件事应该对您有所帮助:告诉您字符串中逗号数量的
string.count()
函数,以及您已经找到的将输入拆分为数组的string.split()
。您可以引用此数组的各个部分,将片段分配到正确的列。您可以执行以下操作:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
您可以考虑使用包创建函数。当我需要将地址分成几个部分时,它对我非常有帮助:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
然后创建用于分割数据的函数:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
输出:
谢谢大家的回复!