Python 如何按空格和单词列表拆分字符串

Python 如何按空格和单词列表拆分字符串,python,split,Python,Split,假设我有以下字符串: "USD Notional Amount: USD 50,000,000.00" "USD Fixed Rate Payer Currency Amount: USD 10,000,000" "USD Fixed Rate Payer Payment Dates: Annually" "KRW Fixed Rate Payer Payment Dates: Annually" 简单地说,使用split函数 df = pd.DataFrame(["USD Notional

假设我有以下字符串:

"USD Notional Amount: USD 50,000,000.00"
"USD Fixed Rate Payer Currency Amount: USD 10,000,000"
"USD Fixed Rate Payer Payment Dates: Annually"
"KRW Fixed Rate Payer Payment Dates: Annually"
简单地说,使用split函数

df = pd.DataFrame(["USD Notional Amount: USD 50,000,000.00"
                   ,"USD Fixed Rate Payer Currency Amount: USD 10,000,000"
                   ,"USD Fixed Rate Payer Payment Dates: Annually"
                   ,"KRW Fixed Rate Payer Payment Dates: Annually"])

df[0].apply(lambda x: x.split())
[输出]

0    [USD, Notional, Amount:, USD, 50,000,000.00]                 
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed, Rate, Payer, Payment, Dates:, Annually]         
3    [KRW, Fixed, Rate, Payer, Payment, Dates:, Annually]    
我想要一个复合词列表

words_list = ["Notional Amount:","Fixed Rate Payer Currency Amount:","Fixed Rate Payer Payment Dates:"]
我想要的是将字符串拆分为字符串数组,如下所示:

["USD","Notional Amount:","USD", "50,000,000.00"]
["USD","Fixed Rate Payer Currency Amount:","USD","10,000,000"]
["USD","Fixed Rate Payer Payment Dates:","Annually"]
["KRW","Fixed Rate Payer Payment Dates:","Annually"]

当我拆分这个字符串时,我想保留一些单词,因为它并不总是按空格拆分。有人知道如何在Python中进行这种字符串拆分吗?有什么想法吗?

我不认为有一种通用的方法可以做到这一点,您的拆分可能会变化太多,因此我建议先花一些时间将输入标准化(例如,将其放在电子表格中,每行的列数相同),这将真正简化您的其余过程)。但是,这里有一种方法,用你的数据

st = """USD Notional Amount: USD 50,000,000.00
USD Fixed Rate Payer Currency Amount: USD 10,000,000
USD Fixed Rate Payer Payment Dates: Annually
KRW Fixed Rate Payer Payment Dates: Annually"""

def split_stuff(st):
    res = []
    lines = st.split("\n") # splitting on carriage return
    for line in lines:
        currency, rest = line.split(" ", 1)  # splitting on space, stopping after first space to extract currency (USD, KRW)
        res.append([currency] + [e for e in deal_with_rest(rest)]) 
    return res

def deal_with_rest(rest):
    """ Deals with anything after the (first) currency """
    compound, amt_type = rest.rsplit(" ", 1) # gets the compound and the amt value or type (here, 'annually')
    if compound.strip().endswith("USD"): # if we see there's a currency again, we need to split on it one more time
        return [e for e in compound.rsplit(" ", 1)] + [amt_type] # creating new sublist with compound, currency, and amount
    else:
        return [compound, amt_type] # otherwise, just returning the compound and the amount

for e in split_stuff(st):
    print(e)
这将返回以下内容,但仅适用于特定字符串。如果您有更多的元素,或者不同的货币(例如,我在
deal\u with_rest()
)中只对“USD”进行了编码),那么您需要更改一些内容:


正如Xhattam所说,可能没有通用的方法来做你的事情

但是,假设您知道哪些字符串带有不希望拆分的空格,则可以执行以下操作(来自您的示例):

现在,您应该能够打印我的_列表并获得以下结果:

print(my_list)
['USD', 'Notional Amount:', 'USD', '50,000,000.00']

这是一个特定的示例,您可以轻松地适应其他字符串。

此生成器应该能够完成这项任务,但输出中将删除“:”。返回的是元组。所有这些工件都可以更改以符合您的格式:)

[输出]

0    [USD, Notional Amount:, USD, 50,000,000.00]                  
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed Rate Payer Payment Dates:, Annually]             
3    [KRW, Fixed Rate Payer Payment Dates:, Annually]    

在阿肯尼斯的帮助下,我这样编码。但是还有更好的解决方案吗?

你能描述一下这种模式吗?你能做的是如此精确,以至于连不知道单词意思的人都能理解吗?然后,您也应该能够用代码编写它。顺便说一句:这看起来像是家庭作业,你表现出的努力和一个特定的问题是家庭作业问题的必修课!很抱歉我添加了一些代码。我的意思是,当我使用“使用空格分隔符拆分”时,我想通过使用wordlist[]保留一些单词。
print(my_list)
['USD', 'Notional Amount:', 'USD', '50,000,000.00']
import re

def string_to_accounting(string):
    for line in string.split("\n"):
        a, b = line.split(":")         
        if re.search("[A-Z]{3} ", b): # this could be more strikt if needed
            yield a[:3], a[4:], b[1:4], b[5:]
        else:
            yield a[:3], a[4:], b[1:]
 def split_emptynword(string_array):
        for element in wordlist:
            if element in string_array :
                my_list = string_array.replace(element, 'Change').split()
                my_list = [ element if x == 'Change' else x for  x in my_list]
                break
            else:
                my_list = string_array.split()    
        return my_list

    df[0].apply(lambda x: split_emptynword(x))
0    [USD, Notional Amount:, USD, 50,000,000.00]                  
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed Rate Payer Payment Dates:, Annually]             
3    [KRW, Fixed Rate Payer Payment Dates:, Annually]