最近为了解决 Spark2.1 的 Bug,对 Spark 的源码做了不少修改,需要对修改的代码做编译测试,如果编译整个 Spark 项目快的话,也得半小时左右,所以基本上是改了哪个子项目就单独对那个项目编译打包。
Spark 官方已经给出了如何使用 mvn 单独编译子项目的方法:
使用 mvn 单独编译子项目是节约了不少时间。但是频繁的改动项目,每次用 mvn 编译还是挺耗时间的。
之前看官方文档提到,对于开发者,为了提高效率,推荐使用 sbt 编译。于是,又查了下文档资料:
咦,看到:Running Build Targets For Individual Projects,内容如下:
- $ # sbt
- $ build/sbt package
- $ # Maven
- $ build/mvn package -DskipTests -pl assembly
这不是坑么,虽然没怎么用 sbt 编译过 Spark,但是 sbt 俺还是用过的。
明明是编译整个项目的好吧,这哪是编译子项目啊。
- build/sbt package
翻遍官方所有跟编译有关的资料,无果。
最后,研究了下 Spark 的 sbt 定义,也就是下
文件,找到了使用 sbt 编译子项目的方法。
- project/SparkBuild.scala
下面是对 spark-core 重新编译打包的方法,我们需要使用 REPL 模式,大致的流程如下:
- ➜ spark git:(branch-2.1.0) ✗ ./build/sbt -Pyarn -Phadoop-2.6 -Phive
- ...
- [info] Set current project to spark-parent (in build file:/Users/stan/Projects/spark/)
- > project core
- [info] Set current project to spark-core (in build file:/Users/stan/Projects/spark/)
- > package
- [info] Updating {file:/Users/stan/Projects/spark/}tags...
- [info] Resolving jline#jline;2.12.1 ...
- ...
- [info] Packaging /Users/stan/Projects/spark/core/target/scala-2.11/spark-core_2.11-2.1.0.jar ...
- [info] Done packaging.
- [success] Total time: 213 s, completed 2017-2-15 16:58:15
最后将
替换到
- spark-core_2.11-2.1.0.jar
或者
- jars
目录下就可以了。
- assembly/target/scala-2.11/jars
选择的子项目的关键是
命令,如何知道有哪些定义好的子项目呢?这个还得参考
- project
中 BuildCommons 的定义:
- project/SparkBuild.scala
- object BuildCommons {
- private val buildLocation = file(".").getAbsoluteFile.getParentFile
- val sqlProjects@Seq(catalyst, sql, hive, hiveThriftServer, sqlKafka010) = Seq(
- "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10"
- ).map(ProjectRef(buildLocation, _))
- val streamingProjects@Seq(
- streaming, streamingFlumeSink, streamingFlume, streamingKafka, streamingKafka010
- ) = Seq(
- "streaming", "streaming-flume-sink", "streaming-flume", "streaming-kafka-0-8", "streaming-kafka-0-10"
- ).map(ProjectRef(buildLocation, _))
- val allProjects@Seq(
- core, graphx, mllib, mllibLocal, repl, networkCommon, networkShuffle, launcher, unsafe, tags, sketch, _*
- ) = Seq(
- "core", "graphx", "mllib", "mllib-local", "repl", "network-common", "network-shuffle", "launcher", "unsafe",
- "tags", "sketch"
- ).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects
- val optionallyEnabledProjects@Seq(mesos, yarn, java8Tests, sparkGangliaLgpl,
- streamingKinesisAsl, dockerIntegrationTests) =
- Seq("mesos", "yarn", "java8-tests", "ganglia-lgpl", "streaming-kinesis-asl",
- "docker-integration-tests").map(ProjectRef(buildLocation, _))
- val assemblyProjects@Seq(networkYarn, streamingFlumeAssembly, streamingKafkaAssembly, streamingKafka010Assembly, streamingKinesisAslAssembly) =
- Seq("network-yarn", "streaming-flume-assembly", "streaming-kafka-0-8-assembly", "streaming-kafka-0-10-assembly", "streaming-kinesis-asl-assembly")
- .map(ProjectRef(buildLocation, _))
- val copyJarsProjects@Seq(assembly, examples) = Seq("assembly", "examples")
- .map(ProjectRef(buildLocation, _))
- val tools = ProjectRef(buildLocation, "tools")
- // Root project.
- val spark = ProjectRef(buildLocation, "spark")
- val sparkHome = buildLocation
- val testTempDir = s"$sparkHome/target/tmp"
- val javacJVMVersion = settingKey[String]("source and target JVM version for javac")
- val scalacJVMVersion = settingKey[String]("source and target JVM version for scalac")
- }
我们看下这个例子:
- val sqlProjects@Seq(catalyst, sql, hive, hiveThriftServer, sqlKafka010) = Seq(
- "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10"
- ).map(ProjectRef(buildLocation, _))
这是对 sql 项目定义的子项目,有:
。
- catalyst, sql, hive, hiveThriftServer, sqlKafka010
我们如果需要编译 catalyst 这个项目,只需要进入 sbt:
选择 catalyst 项目就可以了,后面使用的 compile、package 等命令都是针对这个项目的。
- project catalyst
这下可以爽爽的编译 Spark 了。
还有一些有用的编译技巧,去参考就可以了。
来源: http://www.cnblogs.com/jasondan/p/6402731.html