Apache spark 这个RDD中的空白是从哪里来的?
其目的是转换驻留在文件中的整数:Apache spark 这个RDD中的空白是从哪里来的?,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,其目的是转换驻留在文件中的整数: 1 2 3 4 5 6 7 8 9 分为三个数组,因此可以执行数学运算 预期的 [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 实际值 [[u'1', u' ', u'2', u' ', u'3'], [u'4', u' ', u'5', u' ', u'6'], [u'7', u' ', u'8', u' ', u'9']] 代码 txt = sc.textFile("integers.txt") print txt.collect(
1 2 3
4 5 6
7 8 9
分为三个数组,因此可以执行数学运算
预期的
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
实际值
[[u'1', u' ', u'2', u' ', u'3'], [u'4', u' ', u'5', u' ', u'6'], [u'7', u' ', u'8', u' ', u'9']]
代码
txt = sc.textFile("integers.txt")
print txt.collect()
#[u'1 2 3', u'4 5 6', u'7 8 9']
pairs = txt.map(lambda x: x.split(' '))
print pairs.collect()
#[[u'1', u'2', u'3'], [u'4', u'5', u'6'], [u'7', u'8', u'9']]
pairs = txt.map(lambda x: [s for s in x])
print pairs.collect()
#[[u'1', u' ', u'2', u' ', u'3'], [u'4', u' ', u'5', u' ', u'6'], [u'7', u' ', u'8', u' ', u'9']]
问题似乎在于数字是unicode格式,而不是int格式。 您可以通过将它们强制转换为int来解决此问题(请参见)
pairs = txt.map(lambda x: x.split(' '))
// this return every concatenated character that separated by space ' ', which kind of similar to following function (lamda also aware of newline from file)
def AFunc(aString):
returnArray = []
tempString = ""
foreach(char in aString)
if char == ' ':
if tempString != "":
returnArray.append(tempString)
tempString = ""
else:
tempString += char
return returnArray
// ..
pairs = txt.map(lambda x: [s for s in x])
// this return every character in a string, which kind of similar to following function (lamda also aware of newline from file)
def BFunc(aString):
returnArray = []
foreach(char in aString):
returnArray.append(char)
return returnArray
>>> pairs = txt.map(lambda x: x.split(' '))
>>> print pairs.collect()
[[u'1', u'2', u'3'], [u'4', u'5', u'6'], [u'7', u'8', u'9']]
>>> pairs2 = pairs.map(lambda x: [int(s) for s in x])
>>> print pairs2.collect()
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>>