如何使用python重构和更改数据集的结构？_Python

如何使用python重构和更改数据集的结构？

python

如何使用python重构和更改数据集的结构？,python,Python,我有一个数据集，我需要将该数据集中的一些数据重建为新的样式我的数据集如下所示（存储在名为train1.txt的文件中）： 2342728、2414939、2397722、2386848、2398737、2367906、2384003、2399896、2359702、2414293、2411228、2416802、2322710、2387437、239727474、234681、2396522、2386676、2413824、2328225、24138333、2335374、2328594、49

我有一个数据集，我需要将该数据集中的一些数据重建为新的样式

我的数据集如下所示（存储在名为train1.txt的文件中）：

2342728、2414939、2397722、2386848、2398737、2367906、2384003、2399896、2359702、2414293、2411228、2416802、2322710、2387437、239727474、234681、2396522、2386676、2413824、2328225、24138333、2335374、2328594、497966、2384001、2372746、2386538、238518、2380037、2374364、235205254、2377990、2377990、2367915、2410、2354807、23544646、， 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358,2373500、2411328、2348913、2372324、2368727、2323717、2409571、2403981、2353188、2343362、285721、2376836、2368107、2404464、2417233、2382750、2366329、675、2360991、2341475、2346242、239587、2321367、2416019、2343732、2384793、234711、2332212、138、234278、2405886、2372686、2365963、2342468

我需要转换为以下样式（我需要以train.txt格式存储在新文件中）：

我的python版本是2.7.13 我的操作系统是Ubuntu 14.04 LTS 我将感谢你的任何帮助。

非常感谢。

我建议使用正则表达式。这可能有点过分，但从长远来看，知道正则表达式是超级强大的

import re
def return_no_commas(string):
    regex = r'\d*'
    matches = re.findall(regex, string)
    for match in matches:
        print(match)


numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""

return_no_commas(numbers)

让我解释一下每件事的作用

import re

只导入正则表达式。我写的正则表达式是

regex = r'\d*'

开头的“r”表示它是一个正则表达式，它只查找任意数字（即“\d”部分），并表示它可以重复任意次数（即“*”部分）。然后我们打印出所有匹配项

我将您的数字保存在一个名为“数字”的字符串中，但您也可以轻松地在文件中读取并处理这些内容

您将得到如下结果：

听起来你的原始数据是用逗号分隔的。但是，您希望数据以新行字符（

\n

）分隔。这很容易做到

def covert_comma_to_newline(rfilename, wfilename):
    """
    rfilename -- name of file to read-from
    wfilename -- name of file to write-to
    """
    assert(rfilename != wfilename)
    # open two files, one in read-mode
    # the other in write-mode
    rfile = open(rfilename, "r")
    wfile = open(wfilename, "w")

    # read the file into a string
    rstryng = rfile.read()

    lyst = rstryng.split(",")
    # EXAMPLE:
    #     rstryng == "1,2,3,4"
    #     lyst    == ["1", "2", "3", "4"]

    # remove leading and trailing whitespace
    lyst = [s.strip() for s in lyst]

    wstryng = "\n".join(lyst)
    wfile.writelines(wstryng)
    rfile.close()
    wfile.close()
    return


covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

由于其他人已经添加了答案，我将使用

numpy

添加一个答案。如果您可以使用

numpy

，那么它非常简单：

 data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')

如果想要列表而不是numpy数组

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]

嗨，欢迎来到Stack Overflow。我认为你的任务可以很容易完成，但请提供你正在编写的代码。不太容易，但如果你对语言不挑剔的话。

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]