在起始字符上每隔N个字符拆分并解析(到新文件)字符串-python
我在一篇文章中提出了一种更通用的方法来解决这个问题,但我一直在尝试将结果解析为单个文件。我想迭代一个长字符串,从位置1(python 0)开始,每100个字符打印一次。然后,我想移动一个字符,从位置2(python 1)开始,重复这个过程,直到到达最后100个字符。我想将每个“100”行块解析为一个新文件。以下是我目前的工作内容:在起始字符上每隔N个字符拆分并解析(到新文件)字符串-python,python,loops,parsing,Python,Loops,Parsing,我在一篇文章中提出了一种更通用的方法来解决这个问题,但我一直在尝试将结果解析为单个文件。我想迭代一个长字符串,从位置1(python 0)开始,每100个字符打印一次。然后,我想移动一个字符,从位置2(python 1)开始,重复这个过程,直到到达最后100个字符。我想将每个“100”行块解析为一个新文件。以下是我目前的工作内容: seq = 7524 # I get this number from a raw_input read_num=100 for raw_reads
seq = 7524 # I get this number from a raw_input
read_num=100
for raw_reads in range(100):
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
f = open('read' + str(raw_reads), 'w')
f.write("read" '\n')
f.write(nlength_parts(seq,read_num))
f.close
我现在经常犯的错误是
f.write(nlength_parts(seq,read_num))
TypeError: expected a character buffer object
如果遇到一些问题,我们将非常感谢您的帮助
经过一些帮助,我做了一些更改,但仍然无法正常工作:
seq = 7524 # I get this number from a raw_input
read_num=100
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
for raw_reads in range(100): # Should be gene length - 100
f = open('read' + str(raw_reads), 'w')
f.write("read" + str(raw_reads))
f.write(nlength_parts)
f.close
我可能遗漏了一些重要的变量和定义,以保持我的文章简短,但这造成了混乱。我已经在下面粘贴了我的全部代码
#! /usr/bin/env python
import sys,os
import random
import string
raw = raw_input("Text file: " )
with open(raw) as f:
joined = "".join(line.strip() for line in f)
f = open(raw + '.txt', 'w')
f.write(joined)
f.closed
seq = str(joined)
read_num = 100
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
for raw_reads in range(100): # ideally I want range to be len(seq)-100
f = open('read' + str(raw_reads), 'w')
f.write("read" + str(raw_reads))
f.write('\n')
f.write(str(nlength_parts))
f.close
有几件事:
seq
和read_num
,然后在函数中使用相同的参数。您应该做的是使函数定义中的参数名称不同,然后在调用函数时将这两个变量传递给函数seq
,但seq
在代码中是一个整数。seq是您在评论中提到的文件的处理输出吗?如果是这样的话,它在实际代码中是否要大得多def nlength_parts(myStr, length, paddingChar=" "):
if(len(myStr) < length):
myStr += paddingChar * (length - len(myStr))
sequences = []
for i in range(0, len(myStr)-length + 1):
sequences.append(myStr[i:i+length])
return(sequences)
foo = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
nlengthfoo = nlength_parts(foo, 10)
for x in range(0, length(nlengthfoo):
with open("read" + (x+1), "w") as f:
f.write(nlengthfoo[x])
def nlength_parts(myStr,length,paddingChar=”“):
如果(len(myStr)<长度):
myStr+=paddingChar*(长度-len(myStr))
序列=[]
对于范围内的i(0,len(myStr)-长度+1):
sequences.append(myStr[i:i+length])
返回(序列)
foo=“abcdefghijklmnopqrstuvxyz”
nlengthfoo=nlength_零件(foo,10)
对于范围(0,长度(nlengthfoo))内的x:
打开(“读取”+(x+1),“w”)作为f:
f、 写入(nlengthfoo[x])
编辑:抱歉,根据您的评论更改了我的代码。有几件事:
seq
和read_num
,然后在函数中使用相同的参数。您应该做的是使函数定义中的参数名称不同,然后在调用函数时将这两个变量传递给函数seq
,但seq
在代码中是一个整数。seq是您在注释中提到的文件的处理输出吗?如果是,在实际代码中它是否大得多def nlength_parts(myStr, length, paddingChar=" "):
if(len(myStr) < length):
myStr += paddingChar * (length - len(myStr))
sequences = []
for i in range(0, len(myStr)-length + 1):
sequences.append(myStr[i:i+length])
return(sequences)
foo = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
nlengthfoo = nlength_parts(foo, 10)
for x in range(0, length(nlengthfoo):
with open("read" + (x+1), "w") as f:
f.write(nlengthfoo[x])
def nlength_parts(myStr,length,paddingChar=”“):
如果(len(myStr)<长度):
myStr+=paddingChar*(长度-len(myStr))
序列=[]
对于范围内的i(0,len(myStr)-长度+1):
sequences.append(myStr[i:i+length])
返回(序列)
foo=“abcdefghijklmnopqrstuvxyz”
nlengthfoo=nlength_零件(foo,10)
对于范围(0,长度(nlengthfoo))内的x:
打开(“读取”+(x+1),“w”)作为f:
f、 写入(nlengthfoo[x])
编辑:抱歉,更改了我的代码以回应您的评论。编辑以回应澄清的评论:
基本上,您需要字符串的滚动窗口。说long\u string=“012345678901234567890123456789…”
,总长度为100
In [18]: long_string
Out[18]: '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'
In [19]: window = 10
In [20]: for i in range(len(long_string) - window +1):
.....: chunk = long_string[i:i+window]
.....: print(chunk)
.....: with open('chunk_' + str(i+1) + '.txt','w') as f:
.....: f.write(chunk)
.....:
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
最后,
In [21]: ls
chunk_10.txt chunk_20.txt chunk_30.txt chunk_40.txt chunk_50.txt chunk_60.txt chunk_70.txt chunk_80.txt chunk_90.txt
chunk_11.txt chunk_21.txt chunk_31.txt chunk_41.txt chunk_51.txt chunk_61.txt chunk_71.txt chunk_81.txt chunk_91.txt
chunk_12.txt chunk_22.txt chunk_32.txt chunk_42.txt chunk_52.txt chunk_62.txt chunk_72.txt chunk_82.txt chunk_9.txt
chunk_13.txt chunk_23.txt chunk_33.txt chunk_43.txt chunk_53.txt chunk_63.txt chunk_73.txt chunk_83.txt
chunk_14.txt chunk_24.txt chunk_34.txt chunk_44.txt chunk_54.txt chunk_64.txt chunk_74.txt chunk_84.txt
chunk_15.txt chunk_25.txt chunk_35.txt chunk_45.txt chunk_55.txt chunk_65.txt chunk_75.txt chunk_85.txt
chunk_16.txt chunk_26.txt chunk_36.txt chunk_46.txt chunk_56.txt chunk_66.txt chunk_76.txt chunk_86.txt
chunk_17.txt chunk_27.txt chunk_37.txt chunk_47.txt chunk_57.txt chunk_67.txt chunk_77.txt chunk_87.txt
chunk_18.txt chunk_28.txt chunk_38.txt chunk_48.txt chunk_58.txt chunk_68.txt chunk_78.txt chunk_88.txt
chunk_19.txt chunk_29.txt chunk_39.txt chunk_49.txt chunk_59.txt chunk_69.txt chunk_79.txt chunk_89.txt
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt chunk_5.txt chunk_6.txt chunk_7.txt chunk_8.txt
原始响应
我只想把字符串当作一个文件来处理,这样可以避免任何切片的麻烦,而且非常简单,因为文件API可以让你轻松地“读取”成块
In [1]: import io
In [2]: long_string = 'a'*100 + 'b'*100 + 'c'*100 + 'e'*88
In [3]: print(long_string)
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [4]: string_io = io.StringIO(long_string)
In [5]: chunk = string_io.read(100)
In [6]: chunk_no = 1
In [7]: while chunk:
....: print(chunk)
....: with open('chunk_' + str(chunk_no) + '.txt','w') as f:
....: f.write(chunk)
....: chunk = string_io.read(100)
....: chunk_no += 1
....:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
注意,我使用的是ipython终端,因此您可以在解释器会话中使用终端命令
In [8]: ls chunk*
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt
In [9]: cat chunk_1.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
In [10]: cat chunk_2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
In [11]: cat chunk_3.txt
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
In [12]: cat chunk_4.txt
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [13]:
针对澄清意见进行编辑:
基本上,您需要字符串的滚动窗口。说long\u string=“012345678901234567890123456789…”
,总长度为100
In [18]: long_string
Out[18]: '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'
In [19]: window = 10
In [20]: for i in range(len(long_string) - window +1):
.....: chunk = long_string[i:i+window]
.....: print(chunk)
.....: with open('chunk_' + str(i+1) + '.txt','w') as f:
.....: f.write(chunk)
.....:
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
最后,
In [21]: ls
chunk_10.txt chunk_20.txt chunk_30.txt chunk_40.txt chunk_50.txt chunk_60.txt chunk_70.txt chunk_80.txt chunk_90.txt
chunk_11.txt chunk_21.txt chunk_31.txt chunk_41.txt chunk_51.txt chunk_61.txt chunk_71.txt chunk_81.txt chunk_91.txt
chunk_12.txt chunk_22.txt chunk_32.txt chunk_42.txt chunk_52.txt chunk_62.txt chunk_72.txt chunk_82.txt chunk_9.txt
chunk_13.txt chunk_23.txt chunk_33.txt chunk_43.txt chunk_53.txt chunk_63.txt chunk_73.txt chunk_83.txt
chunk_14.txt chunk_24.txt chunk_34.txt chunk_44.txt chunk_54.txt chunk_64.txt chunk_74.txt chunk_84.txt
chunk_15.txt chunk_25.txt chunk_35.txt chunk_45.txt chunk_55.txt chunk_65.txt chunk_75.txt chunk_85.txt
chunk_16.txt chunk_26.txt chunk_36.txt chunk_46.txt chunk_56.txt chunk_66.txt chunk_76.txt chunk_86.txt
chunk_17.txt chunk_27.txt chunk_37.txt chunk_47.txt chunk_57.txt chunk_67.txt chunk_77.txt chunk_87.txt
chunk_18.txt chunk_28.txt chunk_38.txt chunk_48.txt chunk_58.txt chunk_68.txt chunk_78.txt chunk_88.txt
chunk_19.txt chunk_29.txt chunk_39.txt chunk_49.txt chunk_59.txt chunk_69.txt chunk_79.txt chunk_89.txt
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt chunk_5.txt chunk_6.txt chunk_7.txt chunk_8.txt
原始响应
我只想把字符串当作一个文件来处理,这样可以避免任何切片的麻烦,而且非常简单,因为文件API可以让你轻松地“读取”成块
In [1]: import io
In [2]: long_string = 'a'*100 + 'b'*100 + 'c'*100 + 'e'*88
In [3]: print(long_string)
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [4]: string_io = io.StringIO(long_string)
In [5]: chunk = string_io.read(100)
In [6]: chunk_no = 1
In [7]: while chunk:
....: print(chunk)
....: with open('chunk_' + str(chunk_no) + '.txt','w') as f:
....: f.write(chunk)
....: chunk = string_io.read(100)
....: chunk_no += 1
....:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
注意,我使用的是ipython终端,因此您可以在解释器会话中使用终端命令
In [8]: ls chunk*
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt
In [9]: cat chunk_1.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
In [10]: cat chunk_2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
In [11]: cat chunk_3.txt
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
In [12]: cat chunk_4.txt
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [13]:
你为什么要在for循环中定义你的函数。这要么毫无意义,要么我今天没有喝足够的咖啡。我承认这可能不是最好的决定。我正在尝试解析一个新文件的每个NLENGHT_部分解决方案,但找不到最好的方法。我的意思是这看起来像胡言乱语。你在尝试什么ng通过在循环中定义它来完成,因为它只会做与从循环中取出它完全相同的事情。此外,您似乎正在用参数混淆全局变量…是的,当您第二次引用它时,原始读取已超出范围。我的代码一团糟…我移动了for循环,但仍然无法将其打印出来我的nlength_部分的输出…下面是代码:为什么要在for循环中定义函数。这要么毫无意义,要么我今天没有喝足够的咖啡。我承认,这可能不是最好的决定。我正在尝试将每个nlength_部分解决方案解析到一个新文件中,但找不到最好的方法。我的意思是就是这样,它看起来像是胡言乱语。通过在循环中定义它,你想做什么,因为它只会做与从循环中取出它完全相同的事情。而且你似乎在混淆你的全局变量