Python 使用长度不均匀的列表项创建列?

Python 使用长度不均匀的列表项创建列?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个地址列表,我想将其放入一个数据框中,其中每一行是一个新地址,列是地址的单位(标题、街道、城市) 但是,按照列表的结构,有些地址比其他地址长。例如: address = ['123 Some Street, City','45 Another Place, PO Box 123, City'] 我有一个包含以下列的数据框架: Index Court Address Zipcode Phone

我有一个地址列表,我想将其放入一个数据框中,其中每一行是一个新地址,列是地址的单位(标题、街道、城市)

但是,按照列表的结构,有些地址比其他地址长。例如:

address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
我有一个包含以下列的数据框架:

Index     Court       Address                              Zipcode   Phone                           
0         Court 1     123 Court Dr, Springfield            12345     11111
1         Court 2     45 Court Pl, PO Box 45, Pawnee       54321     11111
2         Court 3     1725 Slough Ave, Scranton            18503     11111
3         Court 4     101 Court Ter, Unit 321, Eagleton    54322     11111
我想根据地址中有多少逗号分隔符,将地址列拆分为最多三列,并在缺少值的地方填入NaN

例如,我希望数据如下所示:

Index     Court       Address          Address2     City           Zip  Phone                                          
0         Court 1     123 Court Dr     NaN          Springfield    ...   ...           
1         Court 2     45 Court Pl      PO Box 45    Pawnee         ...   ...
2         Court 3     1725 Slough Ave  NaN          Scranton       ...   ...
3         Court 4     101 Court Ter    Unit 321     Eagleton       ...   ...
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
我在StackOverflow上尝试了大量不同的解决方案,但都没有成功。我得到的最接近的代码是:

df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
但这将返回一个数据帧,该数据帧将以下三列添加到结构相同的末尾:

...  0              1             2
... 123 Court Dr   Springfield   None
... 45 Court Pl    PO Box 45     Pawnee
这很接近,但正如您所看到的,对于较短的条目,城市与较长条目的第二个地址行对齐

理想情况下,第2列应该用一个城市填充每一行,第1列应该在“无”和第二个地址行(如果适用)之间交替


我希望这是有意义的——这是一个很难用语言表达的问题。谢谢

地址,尤其是由人工输入产生的地址,可能很棘手。但是,如果您的地址只适合这两种格式,则这将起作用:

注意:如果有一个额外的格式,你必须考虑,这将打印罪犯

def split_address(df):
    for index,row in df.iterrows():
        full_address = df['address']
        if full_address.count(',') == 3:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['address_2'] = split[1]
            row['city'] = split[2]
        else if full_address.count(',') == 2:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['city'] = split[1]
        else:
            print("address does not fit known formats {0}".format(full_address))

基本上,有两件事应该对您有所帮助:告诉您字符串中逗号数量的
string.count()
函数,以及您已经找到的将输入拆分为数组的
string.split()
。您可以引用此数组的各个部分,将片段分配到正确的列。

您可以执行以下操作:

Index     Court       Address          Address2     City           Zip  Phone                                          
0         Court 1     123 Court Dr     NaN          Springfield    ...   ...           
1         Court 2     45 Court Pl      PO Box 45    Pawnee         ...   ...
2         Court 3     1725 Slough Ave  NaN          Scranton       ...   ...
3         Court 4     101 Court Ter    Unit 321     Eagleton       ...   ...
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]

您可以考虑使用包创建函数。当我需要将地址分成几个部分时,它对我非常有帮助:

import usaddress

df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
然后创建用于分割数据的函数:

def Address1(x):
    try:
        data = usaddress.tag(x)
        if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
            return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
    except:
        pass

def Address2(x):
    try:
        data = usaddress.tag(x)
        if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
            return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
        elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
            return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
    except:
        pass

def PlaceName(x):
    try:
        data = usaddress.tag(x)
        if 'PlaceName' in data[0].keys():
            return data[0]['PlaceName']
    except:
        pass

df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
输出:


谢谢大家的回复!