启动 Spark-shell:
- [[email protected] ~]# spark-shell
- Setting default log level to "WARN".
- To adjust logging level use sc.setLogLevel(newLevel).
- Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 1.6.0
- /_/
- Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
- Type in expressions to have them evaluated.
- Type :help for more information.
- Spark context available as sc (master = yarn-client, App id = application_1554951897984_0111).
- SQL context available as sqlContext.
- scala> sc
- res0: org.apache.spark.SparkContext = [email protected]
- scala> sqlContext
- res1: org.apache.spark.sql.SQLContext = [email protected]
上下文已经包含 sc 和 sqlContext:
- Spark context available as sc (master = yarn-client, App id = application_1554951897984_0111).
- SQL context available as sqlContext.
本地创建 people07041119.JSON
- {
- "name":"zhangsan","job number":"101","age":33,"gender":"male","deptno":1,"sal":18000
- }
- {
- "name":"lisi","job number":"102","age":30,"gender":"male","deptno":2,"sal":20000
- }
- {
- "name":"wangwu","job number":"103","age":35,"gender":"female","deptno":3,"sal":50000
- }
- {
- "name":"zhaoliu","job number":"104","age":31,"gender":"male","deptno":1,"sal":28000
- }
- {
- "name":"tianqi","job number":"105","age":36,"gender":"female","deptno":3,"sal":90000
- }
本地创建 dept.JSON
- {
- "name":"development","deptno":1
- }
- {
- "name":"personnel","deptno":2
- }
- {
- "name":"testing","deptno":3
- }
将本地文件上传到 HDFS 上:
bash-4.2$ hadoop dfs -put /home/**/data/people07041119.JSON /user/**
bash-4.2$ hadoop dfs -put /home/**/data/dept.JSON /user/**
结果如下:
执行 Scala 脚本, 加载文件:
- scala> val people=sqlContext.jsonFile("/user/**/people07041119.json")
- warning: there were 1 deprecation warning(s); re-run with -deprecation for details
- people: org.apache.spark.sql.DataFrame = [age: bigint, deptno: bigint, gender: string, job number: string, name: string, sal: bigint]
- scala> val dept=sqlContext.jsonFile("/user/**/dept.json")
- warning: there were 1 deprecation warning(s); re-run with -deprecation for details
- people: org.apache.spark.sql.DataFrame = [deptno: bigint, name: string]
执行 Scala 脚本, 查看文件内容:
- scala> people.show
- +---+------+------+----------+--------+-----+
- |age|deptno|gender|job number| name| sal|
- +---+------+------+----------+--------+-----+
- | 33| 1| male| 101|zhangsan|18000|
- | 30| 2| male| 102| lisi|20000|
- | 35| 3|female| 103| wangwu|50000|
- | 31| 1| male| 104| zhaoliu|28000|
- | 36| 3|female| 105| tianqi|90000|
- +---+------+------+----------+--------+-----+
显示前三条记录:
- scala> people.show(3)
- +---+------+------+----------+--------+-----+
- |age|deptno|gender|job number| name| sal|
- +---+------+------+----------+--------+-----+
- | 33| 1| male| 101|zhangsan|18000|
- | 30| 2| male| 102| lisi|20000|
- | 35| 3|female| 103| wangwu|50000|
- +---+------+------+----------+--------+-----+
- only showing top 3 rows
查看列信息:
- scala> people.columns
- res5: Array[String] = Array(age, deptno, gender, job number, name, sal)
添加过滤条件:
- scala> people.filter("gender='male'").count
- res6: Long = 3
参考:
来源: http://www.bubuko.com/infodetail-3112908.html