1 Python 技术栈与 Spark 大数据数据平台整合
下载 Anaconda3 Linux 版本
Anaconda3-5.3.1-Linux-x86_64.sh
安装 Anaconda3
bash Anaconda3-5.3.1-Linux-x86_64.sh -b
环境变量配置 PYSPARK_DRIVER_PYTHON 以及 PYSPARK_PYTHON 配置
- export SCALA_HOME=/usr/local/install/scala-2.11.8
- export JAVA_HOME=/usr/lib/java/jdk1.8.0_45
- export HADOOP_HOME=/usr/local/install/hadoop-2.7.3
- export SPARK_HOME=/usr/local/install/spark-2.3.0-bin-hadoop2.7
- export FLINK_HOME=/usr/local/install/flink-1.6.1
- export ANACONDA_PATH=/root/anaconda3
- export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython
- export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python
- export JRE_HOME=${JAVA_HOME}/jre
- export CLASS_PATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
- export PATH=:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:$PATH
- export PATH=/root/anaconda3/bin:$PATH
启动 Saprk
启动 jupyter notebook
老版本
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root" pyspark
未来版本
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=`jupyter notebook --allow-root` pyspark
jupyter 远程访问
- vi ~/.jupyter/jupyter_notebook_config.py
- c.NotebookApp.ip = '*' # 允许访问此服务器的 IP, 星号表示任意 IP
- c.NotebookApp.open_browser = False # 运行时不打开本机浏览器
- c.NotebookApp.port = 12035 # 使用的端口, 随意设置
- c.NotebookApp.enable_mathjax = True # 启用 MathJax
jupyter NoteBook 开发界面
spark 程序调试
- lines=sc.textFile("/LICENSE")
- pairs = lines.map(lambda s: (s, 1))
- counts = pairs.reduceByKey(lambda a, b: a + b)
- counts.count()
- 243
- counts.first()
- ('Apache License', 1)
Standalone 模式启动
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root" MASTER=spark://SparkMaster:7077 pyspark
2 总结
通过 Python 技术栈与 Spark 大数据数据平台整合, 我们将实现 python 生态最完善的计算和可视化体系.
来源: https://juejin.im/post/5c124aa66fb9a049bd422cc8