Python 如何按空格和单词列表拆分字符串_Python_Split

Python 如何按空格和单词列表拆分字符串

python

Python 如何按空格和单词列表拆分字符串,python,split,Python,Split,假设我有以下字符串： "USD Notional Amount: USD 50,000,000.00" "USD Fixed Rate Payer Currency Amount: USD 10,000,000" "USD Fixed Rate Payer Payment Dates: Annually" "KRW Fixed Rate Payer Payment Dates: Annually" 简单地说，使用split函数 df = pd.DataFrame(["USD Notional

假设我有以下字符串：

"USD Notional Amount: USD 50,000,000.00"
"USD Fixed Rate Payer Currency Amount: USD 10,000,000"
"USD Fixed Rate Payer Payment Dates: Annually"
"KRW Fixed Rate Payer Payment Dates: Annually"

简单地说，使用split函数

df = pd.DataFrame(["USD Notional Amount: USD 50,000,000.00"
                   ,"USD Fixed Rate Payer Currency Amount: USD 10,000,000"
                   ,"USD Fixed Rate Payer Payment Dates: Annually"
                   ,"KRW Fixed Rate Payer Payment Dates: Annually"])

df[0].apply(lambda x: x.split())

[输出]

0    [USD, Notional, Amount:, USD, 50,000,000.00]                 
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed, Rate, Payer, Payment, Dates:, Annually]         
3    [KRW, Fixed, Rate, Payer, Payment, Dates:, Annually]

我想要一个复合词列表

words_list = ["Notional Amount:","Fixed Rate Payer Currency Amount:","Fixed Rate Payer Payment Dates:"]

我想要的是将字符串拆分为字符串数组，如下所示：

["USD","Notional Amount:","USD", "50,000,000.00"]
["USD","Fixed Rate Payer Currency Amount:","USD","10,000,000"]
["USD","Fixed Rate Payer Payment Dates:","Annually"]
["KRW","Fixed Rate Payer Payment Dates:","Annually"]

当我拆分这个字符串时，我想保留一些单词，因为它并不总是按空格拆分。有人知道如何在Python中进行这种字符串拆分吗？有什么想法吗？

我不认为有一种通用的方法可以做到这一点，您的拆分可能会变化太多，因此我建议先花一些时间将输入标准化（例如，将其放在电子表格中，每行的列数相同），这将真正简化您的其余过程）。但是，这里有一种方法，用你的数据

st = """USD Notional Amount: USD 50,000,000.00
USD Fixed Rate Payer Currency Amount: USD 10,000,000
USD Fixed Rate Payer Payment Dates: Annually
KRW Fixed Rate Payer Payment Dates: Annually"""

def split_stuff(st):
    res = []
    lines = st.split("\n") # splitting on carriage return
    for line in lines:
        currency, rest = line.split(" ", 1)  # splitting on space, stopping after first space to extract currency (USD, KRW)
        res.append([currency] + [e for e in deal_with_rest(rest)]) 
    return res

def deal_with_rest(rest):
    """ Deals with anything after the (first) currency """
    compound, amt_type = rest.rsplit(" ", 1) # gets the compound and the amt value or type (here, 'annually')
    if compound.strip().endswith("USD"): # if we see there's a currency again, we need to split on it one more time
        return [e for e in compound.rsplit(" ", 1)] + [amt_type] # creating new sublist with compound, currency, and amount
    else:
        return [compound, amt_type] # otherwise, just returning the compound and the amount

for e in split_stuff(st):
    print(e)

这将返回以下内容，但仅适用于特定字符串。如果您有更多的元素，或者不同的货币（例如，我在

deal\u with_rest（）

）中只对“USD”进行了编码），那么您需要更改一些内容：

正如Xhattam所说，可能没有通用的方法来做你的事情

但是，假设您知道哪些字符串带有不希望拆分的空格，则可以执行以下操作（来自您的示例）：

现在，您应该能够打印我的_列表并获得以下结果：

print(my_list)
['USD', 'Notional Amount:', 'USD', '50,000,000.00']

这是一个特定的示例，您可以轻松地适应其他字符串。

此生成器应该能够完成这项任务，但输出中将删除“：”。返回的是元组。所有这些工件都可以更改以符合您的格式：）

[输出]

0    [USD, Notional Amount:, USD, 50,000,000.00]                  
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed Rate Payer Payment Dates:, Annually]             
3    [KRW, Fixed Rate Payer Payment Dates:, Annually]

在阿肯尼斯的帮助下，我这样编码。但是还有更好的解决方案吗？

你能描述一下这种模式吗？你能做的是如此精确，以至于连不知道单词意思的人都能理解吗？然后，您也应该能够用代码编写它。顺便说一句：这看起来像是家庭作业，你表现出的努力和一个特定的问题是家庭作业问题的必修课！很抱歉我添加了一些代码。我的意思是，当我使用“使用空格分隔符拆分”时，我想通过使用wordlist[]保留一些单词。

print(my_list)
['USD', 'Notional Amount:', 'USD', '50,000,000.00']

import re

def string_to_accounting(string):
    for line in string.split("\n"):
        a, b = line.split(":")         
        if re.search("[A-Z]{3} ", b): # this could be more strikt if needed
            yield a[:3], a[4:], b[1:4], b[5:]
        else:
            yield a[:3], a[4:], b[1:]

 def split_emptynword(string_array):
        for element in wordlist:
            if element in string_array :
                my_list = string_array.replace(element, 'Change').split()
                my_list = [ element if x == 'Change' else x for  x in my_list]
                break
            else:
                my_list = string_array.split()    
        return my_list

    df[0].apply(lambda x: split_emptynword(x))

0    [USD, Notional Amount:, USD, 50,000,000.00]                  
1    [USD, Fixed, Rate, Payer, Currency, Amount:, USD, 10,000,000]
2    [USD, Fixed Rate Payer Payment Dates:, Annually]             
3    [KRW, Fixed Rate Payer Payment Dates:, Annually]