String 如何快速处理一百万个字符串以删除引号并将它们连接在一起_String_Python 3.x_Python Multiprocessing

String 如何快速处理一百万个字符串以删除引号并将它们连接在一起

string python-3.x

String 如何快速处理一百万个字符串以删除引号并将它们连接在一起,string,python-3.x,python-multiprocessing,String,Python 3.x,Python Multiprocessing,我试图从一大组字符串（字符串列表）中删除额外的引号，因此对于每个原始字符串 """str_value1"",""str_value2"",""str_value3"",1,""str_value4""" 我想删除每个字符串值的开始引号和结束引号以及额外的引号对，因此结果如下所示： "str_value1","str_value2","str_value3",1,"str_value4" 然后用新行连接列表中的每个字符串我尝试了以下代码 for line in str_lines[1:]:

我试图从一大组字符串（字符串列表）中删除额外的引号，因此对于每个原始字符串

"""str_value1"",""str_value2"",""str_value3"",1,""str_value4"""

我想删除每个字符串值的开始引号和结束引号以及额外的引号对，因此结果如下所示：

"str_value1","str_value2","str_value3",1,"str_value4"

然后用新行连接列表中的每个字符串

我尝试了以下代码

for line in str_lines[1:]:
        strip_start_end_quotes = line[1:-1]
        splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"')
        str_lines[str_lines.index(line)] = splited_line_rem_quotes

for_pandas_new_headers_str = '\n'.join(splited_lines)

但是，如果列表包含超过100万行字符串，那么它的速度会非常慢（运行了很长时间）。那么，在时间效率方面，最好的方法是什么呢

我还尝试通过

def preprocess_data_str_line(data_str_lines):
"""

:param data_str_lines:
:return:
"""
    for line in data_str_lines:
        strip_start_end_quotes = line[1:-1]
        splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"')
        data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes

    return data_str_lines


def multi_process_prepcocess_data_str(data_str_lines):
    """

    :param data_str_lines:
    :return:
    """
    # if cpu load < 25% and 4GB of ram free use 3 cores
    # if cpu load < 70% and 4GB of ram free use 2 cores
    cores_to_use = how_many_core()

    data_str_blocks = slice_list(data_str_lines, cores_to_use)

    for block in data_str_blocks:
        # spawn processes for each data string block assigned to every cpu core
        p = multiprocessing.Process(target=preprocess_data_str_line, args=(block,))
        p.start()

def预处理数据行（数据行）：
"""
：参数数据线：
：返回：
"""
对于数据线中的线：
带\u开始\u结束\u引号=行[1:-1]
拆分的\u行\u rem \u引号=带\u开始\u结束\u引号。替换（“\”\”，““”）
数据行[数据行.索引（行）]=分割行
返回数据行
def多进程预压数据线（数据线）：
"""
：参数数据线：
：返回：
"""
#如果cpu负载<25%且无4GB ram，则使用3个内核
#如果cpu负载<70%且无4GB ram，则使用2个内核
cores\u to\u use=有多少个\u core（）
数据块=切片列表（数据线、芯线到芯线使用）
对于数据块中的块：
#为分配给每个cpu核心的每个数据字符串块生成进程
p=多处理。进程（目标=预处理数据行，args=（块，）
p、 开始（）

但我不知道如何将结果连接回列表中，以便可以通过新行连接列表中的字符串

因此，理想情况下，我正在考虑使用多处理+快速函数来预处理每一行，以加快整个过程。

我想象大量的处理时间花费在

数据结构行上。索引（行）

-要找到第N个元素的行，它必须先查看N-1个元素以找到原始行的索引（因此，不是循环100万次，而是循环~5000亿次）。相反，它必须跟踪当前索引并在运行时更新列表，例如：

for idx, line in enumerate(data_str_lines): # Do whatever you need to do with `line`... to create a `new_line` # ... # Update line to be the new line data_str_lines[idx] = new_line for_pandas = '\n'.join(data_str_lines)

如果我只想迭代
数据行的子列表，例如数据行[1:] ，我发现子列表中的第一个字符串的idx 是0，而不是原始列表中的1；因此必须idx+1 ；有没有直接的方法获取其原始索引。@daiyue使用枚举（数据行[1:]，start=1）但是切片会创建一个副本，因此如果您确实想避免内存开销，可以使用：enumerate（itertools.islice（data\u str\u line，1，None），start=1）你知道如何对任务进行多处理，并将每个过程的结果连接起来吗？是的……但这是一个完全不同的问题……对于让它工作的设置成本，你实际上不会得到性能提升——最多也不会太慢：）