cloudera-spark

More documents

Recommendations

Info

Developing Spark Applications Spark still looks for the data using the now-nonexistent column name and returns NULLs when it cannot locate the column values. To avoid behavior differences between Spark and Impala or Hive when modifying Parquet tables, avoid renaming columns, or use Impala, Hive, or a CREATE TABLE AS SELECT statement to produce a new table and new set of Parquet files containing embedded column names that match the new layout. For an example of writing Parquet files to Amazon S3, see Reading and Writing Data Sources From and To Amazon S3 on page 22. Building Spark Applications You can use Apache Maven to build Spark applications developed using Java and Scala. For the Maven properties of CDH 5 components, see Using the CDH 5 Maven Repository. For the Maven properties of Kafka, see Maven Artifacts for Kafka. Building Applications Follow these best practices when building Spark Scala and Java applications: • Compile against the same version of Spark that you are running. • Build a single assembly JAR ("Uber" JAR) that includes all dependencies. In Maven, add the Maven assembly plug-in to build a JAR containing all dependencies: maven-assembly-plugin jar-with-dependencies make-assembly package single This plug-in manages the merge procedure for all available JAR files during the build. Exclude Spark, Hadoop, and Kafka (CDH 5.5 and higher) classes from the assembly JAR, because they are already available on the cluster and contained in the runtime classpath. In Maven, specify Spark, Hadoop, and Kafka dependencies with scope provided. For example: org.apache.spark spark-core_2.10 1.5.0-cdh5.5.0 provided Building Reusable Modules Using existing Scala and Java classes inside the Spark shell requires an effective deployment procedure and dependency management. For simple and reliable reuse of Scala and Java classes and complete third-party libraries, you can use a module, which is a self-contained artifact created by Maven. This module can be shared by multiple users. This topic shows how to use Maven to create a module containing all dependencies. 28 | Spark Guide
Developing Spark Applications Create a Maven Project 1. Use Maven to generate the project directory: $ mvn archetype:generate -DgroupId=com.mycompany -DartifactId=mylibrary \ -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false Download and Deploy Third-Party Libraries 1. Prepare a location for all third-party libraries that are not available through Maven Central but are required for the project: $ mkdir libs $ cd libs 2. Download the required artifacts. 3. Use Maven to deploy the library JAR. 4. Add the library to the dependencies section of the POM file. 5. Repeat steps 2-4 for each library. For example, to add the JIDT library: a. Download and decompress the zip file: $ curl http://lizier.me/joseph/software/jidt/download.php?file=infodynamics-dist-1.3.zip > infodynamics-dist.1.3.zip $ unzip infodynamics-dist-1.3.zip b. Deploy the library JAR: $ mvn deploy:deploy-file \ -Durl=file:///HOME/.m2/repository -Dfile=libs/infodynamics.jar \ -DgroupId=org.jlizier.infodynamics -DartifactId=infodynamics -Dpackaging=jar -Dversion=1.3 c. Add the library to the dependencies section of the POM file: org.jlizier.infodynamics infodynamics 1.3 6. Add the Maven assembly plug-in to the plugins section in the pom.xml file. 7. Package the library JARs in a module: $ mvn clean package Run and Test the Spark Module 1. Run the Spark shell, providing the module JAR in the --jars option: $ spark-shell --jars target/mylibrary-1.0-SNAPSHOT-jar-with-dependencies.jar 2. In the Environment tab of the Spark Web UI application (http://driver_host:4040/environment/), validate that the spark.jars property contains the library. For example: Spark Guide | 29
Page 1 and 2: Spark Guide
Page 3 and 4: Table of Contents Apache Spark Over
Page 5 and 6: Apache Spark Overview Apache Spark
Page 7 and 8: Running Your First Spark Applicatio
Page 9 and 10: Developing Spark Applications Devel
Page 11 and 12: Developing Spark Applications } } /
Page 13 and 14: Developing Spark Applications spark
Page 15 and 16: Developing Spark Applications gb gb
Page 17 and 18: If Spark does not have the required
Page 19 and 20: Developing Spark Applications | tab
Page 21 and 22: Developing Spark Applications 15/07
Page 23 and 24: Developing Spark Applications 1. Sp
Page 25 and 26: Developing Spark Applications • B
Page 27: Developing Spark Applications |-- r
Page 31 and 32: Developing Spark Applications conf.
Page 33 and 34: Running Spark Applications Running
Page 35 and 36: Running Spark Applications Table 2:
Page 37 and 38: Running Spark Applications Table 3:
Page 39 and 40: Running Spark Applications 2. Run t
Page 41 and 42: Running Spark Python Applications A
Page 43 and 44: Installing and Maintaining Python E
Page 45 and 46: Running Spark Applications At each
Page 47 and 48: Running Spark Applications To avoid
Page 49 and 50: • Running tiny executors (with a
Page 51 and 52: Spark and Hadoop Integration Spark
Page 53 and 54: Appendix: Apache License, Version 2
Page 55: While redistributing the Work or De

cloudera-spark

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?