当前位置：

首页
/
IT
/
程序
/
Objective-C
/
1.spark的wordcount解析

1.spark的wordcount解析

将程序达成jar 包 
 
在项目名称上右击点击export选择java 下的jar file，点击next，选择输出目录，输入文件名，点击next，点击next，然后点击完成。导出jar 包。

将jar放到系统某个目录中。执行 
 
. / spark - submit--class com.dt.spark.WordCount_Cluster--master spark: //worker1:7077 ./wordcount.jar

也可以将以上命令保存到. sh 文件中，直接执行 sh 文件即可。

二、使用 idea 开发 spark 的 Local 和 Cluster

（一）、配置开发环境

1. 要在本地安装好 java 和 scala。

由于 spark1.6 需要 scala 2.10.X 版本的。推荐 2.10.4，java 版本最好是 1.8。所以提前我们要需要安装好 java 和 scala 并在环境变量中配置好

2. 下载 IDEA 社区版本，选择 windows 版本并按照配置。

安装完成以后启动 IDEA，并进行配置，默认即可，然后点击 ok 以后，设置 ui 风格，然后点击 next 会出现插件的选择页面，默认不需求修改，点击 next，选择安装 scala 语言，点击 install 按钮（非常重要，以为要开发 spark 程序所以必须安装），等安装完成以后点击 start 启动 IDEA。

3. 创建 scala 项目

点击 create new project ，然后填写 project name 为 "Wordcount"，选择项目的保存地址 project location。
然后设置 project sdk 即 java 的安装目录。点击右侧的 new 按钮，选择 jdk，然后选择 java 的安装路径即可。
然后选择 scalasdk。点击右侧的 create ，默认出现时 2.10.x 版本的 scala，点击 ok 即可。然后点击 finish。

4. 设置 spark 的 jar 依赖。

点击 file->project structure 来设置工程的 libraries。核心是添加 spark 的 jar 依赖。选择 Libraries ，点击右侧的加号，选择 java，选择 spark1.6.0 的 spark-1.6.0-bin-hadoop2.6\lib\spark-assembly-1.6.0-hadoop2.6.0.jar。点击 ok。稍等片刻后然后点击 ok(Libraries 作用于 WordCount)，然后点击 apply，点击 ok。（这一步很重要，如果没有无法编写 spark 的代码）

(二)、编写代码

1. 在 src 下建立 spark 程序工程包

在 src 上右击 new ->package 填入 package 的 name 为 com.dt.spark。

2. 创建 scala 的入口类。

在包的名字上右击选择 new ->scala class 。在弹出框中填写 Name ，并制定 kind 为 object ，点击 ok。

3. 编写 local 代码

```
import org.apache.spark.SparkConf
```
```
import org.apache.spark.rdd.RDD
```

  def main(args: Array[String]): Unit ={

    * 集群的master的URL，如果设置为local则在本地运行。

```
    val conf = new SparkConf()
```
```
    conf.setMaster("local")
```

    /**第2步，创建SparkContext对象,SparkContext是spark程序所有功能的唯一入口，其作用是初始化spark应用程序的

```
    * */
```
```
 
```

      * 数据被RDD划分为一系列的Partitions，分配到每个partition的数据属于一个Task的处理范畴

    val lines = sc.textFile("G://datarguru spark//tool//spark-1.4.0-bin-hadoop2.6//README.md", 1) //读取本地文件并设置一个partition

    /**第4步，对初始的RDD进行Transformation级别的处理，如map、filter高阶函数编程，进行具体计算

    val words = lines.flatMap{ line => line.split(" ")}//对每行字符串进行单词拆分，并把所有拆分结果通过flat合并成一个大的单词集合

 (word, 1)} //在单词拆分基础上对每个单词实例计数为1

    wordCounts.foreach(wordNumberPair => println(wordNumberPair._1 + ":" + wordNumberPair._2))

```
  }
```

在代码去右击选择点击run"wordCount"来运行程序。在生成环境下肯定是写自动化shell 脚本自动提交程序的。 
 
注意：如果val sc = new SparkContext(conf)报错，并且没有运行结果，需要将scala的module改成scala 
2.10版本的。具体操作：File->project structure -> Dependencies ->删除scala 
2.11.x的module.-> 左上角的"+" -> scala ->选中scala2.10.4 -> apply

4. 编写 Cluster 模式代码

```
import org.apache.spark.SparkConf
```
```
import org.apache.spark.rdd.RDD
```

  def main(args: Array[String]): Unit ={

      * 集群的master的URL，如果设置为local则在本地运行。

```
    val conf = new SparkConf()
```

    //conf.setMaster("spark://master:7077")

      * 核心组件，包括DAGScheduler,TaskScheduler,SchedulerBackend

```
    val sc = new SparkContext(conf)
```

    /**第3步，根据数据源（HDFS，HBase，Local FS）通过SparkContext来创建RDD

```
      * */
```

    /**第4步，对初始的RDD进行Transformation级别的处理，如map、filter高阶函数编程，进行具体计算

    val words = lines.flatMap{ line => line.split(" ")}//对每行字符串进行单词拆分，并把所有拆分结果通过flat合并成一个大的单词集合

 (word, 1)} //在单词拆分基础上对每个单词实例计数为1

pairs._2, pairs._1)).sortByKey(false).map(pair=>(pair._1, pair._2))//相同的key，value累加并且排名

 println(wordNumberPair._1 + ":" + wordNumberPair._2))

```
  }
```

将程序达成jar 包 
 
点击file->project structure,在弹出的页面点击Artifacts，点击右侧的"+"，选择jar –> from
 modules with dependencies,在弹出的页面中，设置好main class 
然后点击ok，在弹出页面修改Name（系统生成的name不规范）、导出位置并删除scala和spark的jar（因为集群环境中已经存在）点击ok
 。然后在菜单栏中点击build –> Artifacts ,在弹出按钮中，点击bulid，会自动开始打包。

在 spark 中执行 wordcount 方法。将 jar 放到 linux 系统某个目录中。执行

```
 
```

注意事项： 
 
为什么不能再ide开发环境中，直接发布spark程序到spark集群中？ 
 
1． 开发机器的内存和cores的限制，默认情况情况下，spark程序的dirver在提交spark程序的机器上，如果在idea中提交程序的话，那idea机器就必须非常强大。 
 
2． Dirver要指挥workers的运行并频繁的发生同学，如果开发环境和spark集群不在同样一个网络下，就会出现任务丢失，运行缓慢等多种不必要的问题。 
 
3． 这是不安全的。

三、WordCount 的 java 开发版本

安装 jdk 并配置环境变量
系统变量→新建 JAVA_HOME 变量。
变量值填写 jdk 的安装目录（本人是 E:\Java\jdk1.7.0)
系统变量→寻找 Path 变量→编辑
在变量值最后输入 %JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;（注意原来 Path 的变量值末尾有没有; 号，如果没有，先输入；号再输入上面的代码）
系统变量→新建 CLASSPATH 变量值填写 .;%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar（注意最前面有一点）
Maven 的安装和配置
解压 apache-maven-3.1.1-bin.zip，并把解压后的文件夹下的 apache-maven-3.1.1 文件夹移动到 D:\Java 下，如果没有 Java 这个文件夹的话，请自行创建
新建系统变量 MAVEN_HOME 变量值：D:\Java\apache-maven-3.1.1。编辑系统变量 Path 添加变量值： ;%MAVEN_HOME%\bin。
在 mave 的目录中修改 conf/settings.xml，在 localRepository 属性后添加 D:/repository 修改 maven 下载 jar 的位置。
eclipse 中 java 和 maven 的配置
点击 window ->java ->Installed JREs ->add ->standard vm , 点击 next ，然后选择 jdk 的安装路径点击 finish 即可。
点击 window ->Maven ->Installations ->add 在弹出页面选择 mave 的安装路径，然后点击 finish。然后在列表中选择我们自己刚添加的那个 maven 信息。
然后点击 window ->Maven ->User Setings 在右侧的 User Settings 点击 browse 现在 mavenconf 目录下的 setttings.xml .（主要是修改 maven 下载依赖包存放的位置）

(二). 创建 maven 项目

创建 maven 项目
点击 file ?->new ->others ->maven project 点击 next，选择 maven-archetype-quickstart ，点击 next，group id 为 com.dt.spark，artifact id 为 sparkApps，然后点击 finish。
修改 jdk 和 pom 文件
创建 maven 项目后，默认的 jdk 是 1.5 要改成我们前面安装好的 jdk1.8。在项目上右击 build path ->configure build path 。在弹出页面点击 Libraries，选中 jre system library 。点击 edit，在弹出框选择 workspace default jre ，然后点击 finish。然后在点击 ok。将 pom 文件修改为如下内容，然后等待 eclipse 下载好 maven 依赖的 jar 包，并编译工程。编译好工程后有个错误提示，在此错误列上，右击选择 quick fix ，在弹出页面点击 finish 即可。

 xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

```
4.0.0</modelVersion>
```
```
com.dt.spark</groupId>
```
```
SparkApps</artifactId>
```
```
0.0.1-SNAPSHOT</version>
```
```
jar</packaging>
```
```
SparkApps</name>
```
```
http://maven.apache.org</url>
```
```
<properties>
```
```
UTF-8</project.build.sourceEncoding>
```
```
<dependencies>
```
```
junit</groupId>
```
```
junit</artifactId>
```
```
3.8.1</version>
```
```
test</scope>
```
```
org.apache.spark</groupId>
```
```
spark-core_2.10</artifactId>
```
```
1.6.0</version>
```
```
org.apache.spark</groupId>
```
```
spark-sql_2.10</artifactId>
```
```
1.6.0</version>
```
```
org.apache.spark</groupId>
```
```
spark-hive_2.10</artifactId>
```
```
1.6.0</version>
```
```
org.apache.spark</groupId>
```
```
spark-streaming_2.10</artifactId>
```
```
1.6.0</version>
```
```
org.apache.hadoop</groupId>
```
```
hadoop-client</artifactId>
```
```
2.6.0</version>
```
```
org.apache.spark</groupId>
```
```
spark-streaming-kafka_2.10</artifactId>
```
```
1.6.0</version>
```
```
org.apache.spark</groupId>
```
```
spark-graphx_2.10</artifactId>
```
```
1.6.0</version>
```
```
<build>
```
```
src/main/java</sourceDirectory>
```
```
src/main/test</testSourceDirectory>
```
```
<plugins>
```
```
maven-assembly-plugin</artifactId>
```
```
jar-with-dependencies</descriptorRef>
```
```
make-assembly</id>
```
```
package</phase>
```
```
single</goal>
```
```
org.codehaus.mojo</groupId>
```
```
exec-maven-plugin</artifactId>
```
```
1.3.1</version>
```
```
exec</goal>
```
```
java</executable>
```
```
false</includeProjectDependencies>
```
```
compile</classpathScope>
```

com.dt.spark.SparkApps.WordCount</mainClass>

```
org.apache.maven.plugins</groupId>
```
```
maven-compiler-plugin</artifactId>
```
```
1.6</source>
```
```
1.6</target>
```
```
</project>
```

创建包路径以及 java 代码
在包路径 com.dt.spark.SparkApps 上右击 new ->package 在弹出页面 name 中填写 com.dt.spark.SparkApps.cores, 点击 finish 的。
在包路径下 com.dt.spark.SparkApps.cores 上右击 new ->class ，在弹出窗口中 name 中填写 WordCount ，点击 finish。然后在 WordCount 中编写如下代码。

(三). local 版本

```
import java.util.Arrays;
```
```
import scala.Function;
```

    public static void main(String[] args){

```
//其底层就是scala的SparkContext
```

String> lines = sc.textFile("G://datarguru spark//tool//spark-1.4.0-bin-hadoop2.6//README.md");

String> words = lines.flatMap(new FlatMapFunction<String, String>(){

            public Iterable<String> call(String line)throws Exception{

```
              
```
```
        });
```

        JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>(){

            public Tuple2<String, Integer> call(String word)throws Exception{

```
String, Integer>(word, 1);
```
```
        });
```

        JavaPairRDD<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>(){ //对相同的Key，进行Value的累计（包括Local和Reducer级别同时Reduce）

            public Integer call(Integer v1, Integer v2)throws Exception{

```
        });
```

        wordsCount.foreach(new VoidFunction<Tuple2<String, Integer>>(){

            public void call(Tuple2<String, Integer>pair)throws Exception{

```
        });
```
```
}
```

在代码区右击 run as -> java application 。来运行此程序并查看运行结果。 (四). cluster 版本的代码

```
import java.util.Arrays;
```
```
import scala.Function;
```

 public static void main(String[] args){

String> lines = sc.textFile("/library/wordcount/input/Data");

String> words = lines.flatMap(new FlatMapFunction<String, String>(){

 public Iterable<String> call(String line)throws Exception{

```
 
```
```
 });
```

 JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>(){

 public Tuple2<String, Integer> call(String word)throws Exception{

```
String, Integer>(word, 1);
```
```
 });
```

 JavaPairRDD<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>(){

 public Integer call(Integer v1, Integer v2)throws Exception{

```
 });
```

 wordsCount.foreach(new VoidFunction<Tuple2<String, Integer>>(){

 public void call(Tuple2<String, Integer>pair)throws Exception{

```
 });
```
```
}
```

四、彻底解析 wordcount 运行原理

1. 从数据流动视角解密 WordCount

即用 Spark 作单词计数统计，数据到底是怎么流动的, 参看一图：

word,1)).reduceByKey(_+_).saveAsTextFile(outputPathwordcount)

简单实验

（1）在 IntelliJ IDEA 中编写下面代码：

```
import org.apache.spark.SparkConf
```
```
object WordCount {
```
```
    valconf = new SparkConf()
```
```
    conf.setMaster("local")
```

    val lines = sc.textFile("D://tmp//helloSpark.txt", 1)

```
line.split(" ") }
```
```
 (word,1) }
```

    wordCounts.foreach(wordNumberPair =>println(wordNumberPair._1 + " : " + wordNumberPair._2))

```
  }
```

（2）在D盘下地tmp文件夹下新建helloSpark.txt文件，内容如下：

```
Hello Hadoop
```
```
Spark is awesome
```
```
Flink : 1
```
```
is : 1
```
```
awesome : 1
```
```
Scala : 1
```

Spark 有三大特点：

分布式。无论数据还是计算都是分布式的。默认分片策略：Block 多大，分片就多大。但这种说法不完全准确，因为分片切分时有的记录可能跨两个 Block，所以一个分片不会严格地等于 Block 的大小，例如 HDFS 的 Block 大小是 128MB 的话，分片可能多几个字节或少几个字节。一般情况下，分片都不会完全与 Block 大小相等。
分片不一定小于 Block 大小，因为如果最后一条记录跨两个 Block 的话，分片会把最后一条记录放在前一个分片中。
基于内存（部分基于磁盘）
迭代

查看在 SparkContext.scala 中的 testFile 源码

```
      path: String,
```
```
    assertNotStopped()
```

      minPartitions).map(pair => pair._2.toString)

可以看出在进行了hadoopFile之后又进行了map操作。 
 
HadoopRDD从HDFS上读取分布式文件，并且以数据分片的方式存在于集群之中。

RDD.scala 中的 map 源码

   * Return a new RDD by applying a function to all elements of this RDD.

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {

    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

读取到的一行数据（key,value的方式），对行的索引位置不感兴趣，只对其value事情兴趣。pair时有个匿名函数，是个tuple，取第二个元素。 
 
此处又产生了MapPartitionsRDD。MapPartitionsRDD基于hadoopRDD产生的Parition去掉行的KEY。 
 
注：可以看出一个操作可能产生一个RDD也可能产生多个RDD。如sc.textFile就产生了两个RDD：hadoopRDD和MapParititionsRDD。
 
下一步：

```
line.split(" ") }
```

对每个 Partition 中的每行进行单词切分，并合并成一个大的单词实例的集合。
FlatMap 做的一件事就是对 RDD 中的每个 Partition 中的每一行的内容进行单词切分。
这边有 4 个 Partition，对单词切分就变成了一个一个单词，

下面是 FlatMap 的源码（RDD.scala 中）

   *  Return a new RDD by first applying a function to all elements of this

```
   */
```

 TraversableOnce[U]): RDD[U] = withScope {

    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))

可以看出flatMap又产生了一个MapPartitionsRDD,此时的各个Partition都是拆分后的单词。 
 
下一步：

```
 (word,1) }
```

将每个单词实例变为形如 word=>(word,1)
map 操作就是把切分后的每个单词计数为 1。
根据源码可知，map 操作又会产生一个 MapPartitonsRDD。此时的 MapPartitionsRDD 是把每个单词变成 Array(""Hello",1),("Spark",1) 等这样的形式。
下一步：

reduceByKey是进行全局单词计数统计，对相同的key的value相加，包括local和reducer同时进行reduce。所以在map之后，本地又进行了一次统计，即local级别的reduce。 
 
shuffle前的Local Reduce操作，主要负责本地局部统计，并且把统计后的结果按照分区策略放到不同的File。 
 
下一Stage就叫Reducer了，下一阶段假设有3个并行度的话，每个Partition进行Local Reduce后都会把数据分成三种类型。最简单的方式就是用HashCode对其取模。 
 
至此都是stage1。 
 
Stage内部完全基于内存迭代，不需要每次操作都有读写磁盘，所以速度非常快。

reduceByKey 的源码 (PairRDDFunctions.scala 中)：

```
 V): RDD[(K, V)] = self.withScope {
```
```
 v, func, func, partitioner)
```
```
 
```

   * Merge the values for each key using an associative and commutative reduce function. This will

   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.

  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {

```
  }
```
```
  /**
```

   * also perform the merging locally on each mapper before sending results to a reducer, similarly

```
   * parallelism level.
```

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = sel

来源: http://www.bubuko.com/infodetail-1972269.html

与本文相关文章

暂无,快来抢沙发吧！