22.08.2019 Views

cloudera-spark

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Running Spark Applications<br />

resources to pools and run the applications in those pools and enable applications running in pools to be preempted.<br />

See Dynamic Resource Pools.<br />

If you are using Spark Streaming, see the recommendation in Spark Streaming and Dynamic Allocation on page 13.<br />

Table 5: Dynamic Allocation Properties<br />

Property<br />

<strong>spark</strong>.dynamicAllocation.executorIdleTimeout<br />

<strong>spark</strong>.dynamicAllocation.enabled<br />

Description<br />

The length of time executor must be idle before it is removed.<br />

Default: 60 s.<br />

Whether dynamic allocation is enabled.<br />

Default: true.<br />

<strong>spark</strong>.dynamicAllocation.initialExecutors<br />

The initial number of executors for a Spark application when dynamic allocation<br />

is enabled.<br />

Default: 1.<br />

<strong>spark</strong>.dynamicAllocation.minExecutors<br />

<strong>spark</strong>.dynamicAllocation.maxExecutors<br />

The lower bound for the number of executors.<br />

Default: 0.<br />

The upper bound for the number of executors.<br />

Default: Integer.MAX_VALUE.<br />

<strong>spark</strong>.dynamicAllocation.schedulerBacklogTimeout<br />

The length of time pending tasks must be backlogged before new executors<br />

are requested.<br />

Default: 1 s.<br />

Optimizing YARN Mode in Unmanaged CDH Deployments<br />

In CDH deployments not managed by Cloudera Manager, Spark copies the Spark assembly JAR file to HDFS each time<br />

you run <strong>spark</strong>-submit. You can avoid this copying by doing one of the following:<br />

• Set <strong>spark</strong>.yarn.jar to the local path to the assembly JAR:<br />

local:/usr/lib/<strong>spark</strong>/lib/<strong>spark</strong>-assembly.jar.<br />

• Upload the JAR and configure the JAR location:<br />

1. Manually upload the Spark assembly JAR file to HDFS:<br />

$ hdfs dfs -mkdir -p /user/<strong>spark</strong>/share/lib<br />

$ hdfs dfs -put SPARK_HOME/assembly/lib/<strong>spark</strong>-assembly_*.jar<br />

/user/<strong>spark</strong>/share/lib/<strong>spark</strong>-assembly.jar<br />

You must manually upload the JAR each time you upgrade Spark to a new minor CDH release.<br />

2. Set <strong>spark</strong>.yarn.jar to the HDFS path:<br />

<strong>spark</strong>.yarn.jar=hdfs://namenode:8020/user/<strong>spark</strong>/share/lib/<strong>spark</strong>-assembly.jar<br />

Using PySpark<br />

Apache Spark provides APIs in non-JVM languages such as Python. Many data scientists use Python because it has a<br />

rich variety of numerical libraries with a statistical, machine-learning, or optimization focus.<br />

40 | Spark Guide

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!