Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用PySparkSQL设置列的标题?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何使用PySparkSQL设置列的标题?

Python 如何使用PySparkSQL设置列的标题?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,只是一个简单的问题,伙计们。使用pandas,我们可以创建一个数据帧并设置一个标头,如下所示: import pandas as pd df = pd.read_csv('/file/path', sep='|', names = ['A','B']) 使用Pypark: text_file = sc.textFile('path/file') 另一方面,尽管我已经阅读了的文档,但我没有找到如何设置标题和分隔符,或者如何将数据集的每一列的名称设置为熊猫。你知道如何用PySparkSQL为每一

只是一个简单的问题,伙计们。使用pandas,我们可以创建一个数据帧并设置一个标头,如下所示:

import pandas as pd
df = pd.read_csv('/file/path', sep='|', names = ['A','B'])
使用Pypark:

text_file = sc.textFile('path/file')
另一方面,尽管我已经阅读了的文档,但我没有找到如何设置标题和分隔符,或者如何将数据集的每一列的名称设置为熊猫。你知道如何用PySparkSQL为每一列命名吗

更新:

从@cafeed开始,我尝试了以下方法:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df_2 = sqlContext.read.format('com.databricks.spark.csv').options(header='false', delimiter='|').load('path')
df_2
然而,我得到了一个例外:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-31-ad726583541b> in <module>()
      2 sqlContext = SQLContext(sc)
      3 
----> 4 df_2 = sqlContext.read.format('com.databricks.spark.csv').options(header='false', delimiter='|').load('/Users/user/GitHub/PySpark-Notes/ml-100k/u.user')
      5 df_2

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/sql/readwriter.pyc in load(self, path, format, schema, **options)
    119         self.options(**options)
    120         if path is not None:
--> 121             return self._df(self._jreader.load(path))
    122         else:
    123             return self._df(self._jreader.load())

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     34     def deco(*a, **kw):
     35         try:
---> 36             return f(*a, **kw)
     37         except py4j.protocol.Py4JJavaError as e:
     38             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o67.load.
: java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:60)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:60)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:60)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:60)
    at scala.util.Try.orElse(Try.scala:82)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:60)
    ... 14 more
---------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在()
2 sqlContext=sqlContext(sc)
3.
---->4 df_2=sqlContext.read.format('com.databricks.spark.csv')。选项(header='false',分隔符='|')。加载('/Users/user/GitHub/PySpark Notes/ml-100k/u.user'))
5 df_2
/加载中的usr/local/ceral/apache spark/1.5.1/libexec/python/pyspark/sql/readwriter.pyc(self、path、format、schema、**选项)
119.自我选择(**选择)
120如果路径不是无:
-->121返回self.\u df(self.\u jreader.load(路径))
122.其他:
123返回self.\u df(self.\u jreader.load())
/usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in__调用(self,*args)
536 answer=self.gateway\u client.send\u命令(command)
537返回值=获取返回值(应答,self.gateway\u客户端,
-->538 self.target_id,self.name)
539
540对于临时参数中的临时参数:
/deco中的usr/local/cillar/apachespark/1.5.1/libexec/python/pyspark/sql/utils.pyc(*a,**kw)
34 def装饰(*a,**千瓦):
35尝试:
--->36返回f(*a,**kw)
37除py4j.protocol.Py4JJavaError为e外:
38 s=e.java_exception.toString()
/获取返回值中的usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
298 raise Py4JJavaError(
299'调用{0}{1}{2}时出错。\n'。
-->300格式(目标id,,,,名称),值)
301其他:
302升起Py4JError(
Py4JJavaError:调用o67.load时出错。
:java.lang.ClassNotFoundException:未能为数据源com.databricks.spark.csv加载类。
位于org.apache.spark.sql.execution.datasources.resolvedatasource$.lookUpdateSource(resolvedatasource.scala:67)
位于org.apache.spark.sql.execution.datasources.resolvedatasource$.apply(resolvedatasource.scala:87)
位于org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
位于org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:497)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
在py4j.Gateway.invoke处(Gateway.java:259)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:207)
运行(Thread.java:745)
原因:java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
位于java.net.URLClassLoader.findClass(URLClassLoader.java:381)
位于java.lang.ClassLoader.loadClass(ClassLoader.java:424)
位于java.lang.ClassLoader.loadClass(ClassLoader.java:357)
在org.apache.spark.sql.execution.datasources.resolvedatasource$$anonfun$4$$anonfun$apply$1.apply(resolvedatasource.scala:60)
在org.apache.spark.sql.execution.datasources.resolvedatasource$$anonfun$4$$anonfun$apply$1.apply(resolvedatasource.scala:60)
在scala.util.Try$.apply处(Try.scala:161)
位于org.apache.spark.sql.execution.datasources.resolvedatasource$$anonfun$4.apply(resolvedatasource.scala:60)
位于org.apache.spark.sql.execution.datasources.resolvedatasource$$anonfun$4.apply(resolvedatasource.scala:60)
在scala.util.Try.orElse(Try.scala:82)
位于org.apache.spark.sql.execution.datasources.resolvedatasource$.lookUpdateSource(resolvedatasource.scala:60)
…还有14个
提前感谢各位。

阅读文本文件并使用
分隔符设置分隔符
选项:

df = sqlContext.read \
   .format('com.databricks.spark.csv') \
   .options(header='false', delimiter='|') \
   .load(path)
可以使用
Schema
方法设置架构/名称:

sqlContext.read.schema(schema)
其中,架构是一个
StructType

schema = StructType([
    StructField("A", StringType(), True), StructField("B", StringType(), True)])
或者通过调用
toDF

df.toDF(['A','B'])
通过读取文本文件并使用
分隔符设置分隔符
选项:

df = sqlContext.read \
   .format('com.databricks.spark.csv') \
   .options(header='false', delimiter='|') \
   .load(path)
可以使用
Schema
方法设置架构/名称:

sqlContext.read.schema(schema)
其中,架构是一个
StructType

schema = StructType([
    StructField("A", StringType(), True), StructField("B", StringType(), True)])
或者通过调用
toDF

df.toDF(['A','B'])

tl;dr:当你启动pyspark时使用packages选项:
--packages com.databricks:spark-csv_2.10:1.4.0
@TristanReid谢谢你的帮助!tl;dr:当启动pyspark时使用packages选项:
--packages com.databricks:spark-csv_2.10:1.4.0
@TristanReid谢谢你的帮助!