PySpark dans iPython notebook soulève Py4JJavaError lors de l'utilisation de count() et()
Je suis en utilisant PySpark(v. 2.1.0) dans iPython notebook (python v. 3.6) plus de virtualenv dans mon Mac(Sierra 10.12.3 Bêta).
1.J'ai lancé iPython notebook par le tournage de ce Terminal -
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /Applications/spark-2.1.0-bin-hadoop2.7/bin/pyspark
2.Chargé mon fichier Étincelle Contexte et assuré sa chargé-
>>>lines = sc.textFile("/Users/PanchusMac/Dropbox/Learn_py/Virtual_Env/pyspark/README.md")
>>>for i in lines.collect():
print(i)
Et il a bien fonctionné et d'imprimer le résultat sur ma console comme indiqué:
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
and Spark Streaming for stream processing.
<http://spark.apache.org/>
## Online Documentation
You can find the latest Spark documentation, including a programming
guide, on the [project web page](http://spark.apache.org/documentation.html).
This README file only contains basic setup instructions.
Aussi vérifié le sc -
>>>print(sc)
<pyspark.context.SparkContext object at 0x101ce4cc0>
-
Maintenant, quand j'essaie d'exécuter
lines.count()
oulines.first()
fonctions au cours de mon CA, je me suis retrouvé avec d'erreur suivant
Py4JJavaError Traceback (most recent call last) <ipython-input-33-44aeefde846d> in <module>() ----> 1 lines.count() /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in count(self) 1039 3 1040 """ -> 1041 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 1042 1043 def stats(self): /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in sum(self) 1030 6.0 1031 """ -> 1032 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 1033 1034 def count(self): /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in fold(self, zeroValue, op) 904 # zeroValue provided to each partition is unique from the one provided 905 # to the final reduce call --> 906 vals = self.mapPartitions(func).collect() 907 return reduce(op, vals, zeroValue) 908 /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in collect(self) 807 """ 808 with SCCallSiteSync(self.context) as css: --> 809 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 810 return list(_load_from_socket(port, self._jrdd_deserializer)) 811 /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 22, localhost, executor driver): org.apache.spark.SparkException: Error from python worker: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 36, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/java_gateway.py", line 25, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module> "system node release version machine processor") File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in namedtuple TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' PYTHONPATH was: /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:934) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Error from python worker: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 36, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/java_gateway.py", line 25, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module> "system node release version machine processor") File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in namedtuple TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' PYTHONPATH was: /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more
Quelqu'un pourrait-il m'expliquer où est le problème ?
Remarque: Lorsque j'ai effectué les mêmes opérations dans mon Mac Terminal, ils ont travaillé exactement comme prévu.
OriginalL'auteur Panchu | 2017-01-24
Vous devez vous connecter pour publier un commentaire.
Pyspark 2.1.0 n'est pas compatible avec python 3.6, voir https://issues.apache.org/jira/browse/SPARK-19019.
Vous devez utiliser des précédentes version de python ou vous pouvez essayer de construction master ou 2.1 de la branche à partir de github et il devrait fonctionner.
OriginalL'auteur Mariusz
Ouais j'ai eu le même problème il y a longtemps dans Pyspark dans Anaconda j'ai essayé plusieurs façons de remédier à cette enfin j'ai trouvé sur mon propre à l'installation de Java pour anaconda séparément par la suite il n'y a pas de Py4jerror.
https://anaconda.org/cyclus/java-jdk
Cette réponse devra être considéré comme la bonne solution, de l'omi.
OriginalL'auteur Raja Rajan
Si vous utilisez Anaconda, essayez d'installer java jdk pour Anaconda:
OriginalL'auteur Tung Nguyen