使用 BASE SAS 和 SAS/ACCESS Interface to Hadoop 访问并分析存储在 HDP 上的大数据
SAS/ACCESS Interface to Hadoop 能在 SAS 中原生地访问存储在 Hadoop 中的数据集。借助 SAS/ACCESS to Hadoop:
通过在基于 IBM POWER8 处理器的服务器上运行的 HDP 集群,对 BASE SAS 和 SAS/ACCESS Interface to Hadoop 执行验证和测试,此过程的主要目的包括:
测试环境的高级组件包括:
BASE SAS 和 SAS/ACCESS Interface to Hadoop
Hortonworks Data Platform
图 1 描述了用于验证 SAS 软件与 Power 上运行的 HDP 的部署和高级架构。SAS 软件由 BASE SAS 和 SAS/ACCESS Interface to Hadoop 组成,安装和配置在一个虚拟机[或者 Power Systems 术语中的逻辑分区 (LPAR)]上,该虚拟机在基于 IBM POWER8 处理器的服务器上运行 IBM AIX 7.2 操作系统。在另一个基于 IBM POWER8 处理器的服务器上运行 RHEL 7.2 的虚拟机上,安装和配置一个单节点 HDP 集群。
配置 SAS/ACCESS Interface to Hadoop 来与 HDP 连接并访问 HDP 上的数据。请注意,BASE SAS 和 SAS/ACCESS Interface to Hadoop 的安装和配置对 HDP 集群中的节点数量透明。
点击查看大图
本节将介绍如何安装和配置一个 HDP 集群和 SAS 软件 - BASE SAS 和 SAS/ACESS Interface to Hadoop。
安装和配置 HDP 集群的总体步骤如下:
下载 MovieLens 和驱动程序测试数据,将该数据复制到 HDFS,然后创建 Hive 表。
- # su – hive
- # hadoop fs -mkdir -p /user/hive/dataset/drivers
- # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers
- # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers
- # hadoop fs -ls /user/hive/dataset/drivers
- Found 2 items
- -rw-r--r-- 3 hive hdfs 2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv
- -rw-r--r-- 3 hive hdfs 2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv
- # su – hive
- # hive
- hive>create database trucks;
- hive> use trucks;
- hive> create table drivers
- (driverId int,
- name string,
- ssn bigint,
- location string,
- certified string,
- wageplan string)
- ROW FORMAT DELIMITED
- FIELDS TERMINATED BY ','
- STORED AS TEXTFILE
- TBLPROPERTIES("skip.header.line.count"="1");
- hive> create table truck_events
- (driverId int,
- truckId int,
- eventTime string,
- eventType string,
- longitude double,
- latitude double,
- eventKey string,
- correlationId bigint,
- driverName string,
- routeId int,
- routeName string)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE
- TBLPROPERTIES("skip.header.line.count"="1");
- hive> show tables;
- OK
- drivers
- truck_events
- hive > LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv'overwrite into table truck_events;
- hive > LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv'overwrite into table drivers;
执行以下步骤,将 BASE SAS 和 SAS/ACCESS Interface to Hadoop 安装在 AIX 7.2 上
点击查看大图
点击查看大图
安装 BASE SAS 和 SAS/ACCESS Interface to Hadoop 后,按照指南针对 Base SAS 和 SAS/ACCESS 的 SAS 9.4 Hadoop 配置指南中的说明,配置 SAS/ACCESS Interface to Hadoop 来连接到 HDP 集群。
点击查看大图
点击查看大图
点击查看大图
完成上述安装和配置指令后,使用 BASE SAS 访问并分析存储在 HDP 上的数据。因为本文介绍的是使用 HDP 验证 SAS/ACCESS Interface to Hadoop ,所以我们尝试了一些重要接口来访问数据,并执行了基本分析。您可以在以下各节中介绍的示例的基础上进一步探索。
从 BASE SAS 使用 SAS PROC HADOOP 和 HDFS 语句访问存储在 HDFS 中的数据。以下样本 SAS 代码描述了如何连接到 Hadoop,将数据写入 HDFS,以及访问存储在 HDFS 上的数据。可以在 SAS 客户端系统上使用 SAS 命令行接口 (CLI) 运行 SAS 代码。
- # cat proc-hadoop-test.sas
- proc hadoop username='hdfs' password='xxxxxxxx' verbose;
- hdfs mkdir='/user/hdfs/sasdata';
- hdfs delete='/user/hdfs/test1';
- hdfs copytolocal='/user/hdfs/test/fromHDP.txt'
- out='/home/np/SASCode/fromHDP.txt';
- hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt'
- out='/user/hdfs/sasdata/fromSAS.txt';
- run;
- # /SASHome/SASFoundation/9.4/sas proc-hadoop-test.sas
运行代码后,检查日志文件的执行状态。
- # cat proc-hadoop-test.log
- 1 The SAS System 00:26
- Saturday, April 22, 2017
- NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
- NOTE: SAS (r) Proprietary Software 9.4 (TS1M4)
- Licensed to IBM CORP - FOR DEMO PURPOSES ONLY, Site 70219019.
- NOTE: This session is executing on the AIX 7.2 (AIX 64) platform.
- NOTE: Additional host information:
- IBM AIX AIX 64 2 7 00F9C48F4C00
- You are running SAS 9. Some SAS 8 files will be automatically converted by the V9 engine; others are incompatible. Please see
- http://support.sas.com/rnd/migration/planning/platform/64bit.html
- PROC MIGRATE will preserve current SAS file attributes and is
- recommended for converting all your SAS libraries from any
- SAS 8 release to SAS 9. For details and examples, please see
- http://support.sas.com/rnd/migration/index.html
- This message is contained in the SAS news file, and is presented upon initialization. Edit the file "news" in the "misc/base" directory to display site-specific news and information in the program log.
- The command line option "-nonews" will prevent this display.
- NOTE: SAS initialization used:
- real time 0.03 seconds
- cpu time 0.00 seconds
- 1 proc hadoop username='hdfs' password=XXXXXXXXXX verbose;
- 2 hdfs mkdir='/user/hdfs/sasdata';
- 3
- 4 hdfs delete='/user/hdfs/test1';
- 5
- 6 hdfs copytolocal='/user/hdfs/test/fromHDP.txt'
- 7 out='/home/np/SASCode/fromHDP.txt';
- 8
- 9 hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt'
- 10 out='/user/hdfs/sasdata/fromSAS.txt';
- 11 run;
- NOTE: PROCEDURE HADOOP used (Total process time):
- real time 4.49 seconds
- cpu time 0.05 seconds
- NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
- NOTE: The SAS System used:
- real time 4.53 seconds
- cpu time 0.05 seconds
检查本地目录和 HDFS 的内容,确认 SAS 代码在成功运行。
- # hadoop fs -ls /user/hdfs/sasdata/
- Found 1 items
- -rw-r--r-- 3 hdfs hdfs 30 2017-04-22 01:28 /user/hdfs/sasdata/fromSAS.txt
- [hdfs@hdpnode1 ~]$ hadoop fs -cat /user/hdfs/sasdata/fromSAS.txt
- Hello HDP user! Good morning!
- # ls -ltr
- total 136
- -rw-r--r-- 1 root system 333 Apr 22 00:25 proc-hadoop-test.sas
- -rw-r--r-- 1 root system 23 Apr 22 00:26 fromHDP.txt
- -rw-r--r-- 1 root system 2042 Apr 22 00:26 proc-hadoop-test.log
- l48fvp038_pub[/home/np/SASCode] > cat fromHDP.txt
- Hello SAS user! Howdy?
可以从 BASE SAS 对存储在 HDP 上的数据运行 Map Reduce 程序。MapReduce 程序将从 BASE SAS 使用 PROC HADOOP SAS 过程运行。例如,如果您有一个大型数据集需要在使用 SAS 执行实际分析前进行一定的预处理,可以编写一个 MapReduce 程序在 HDP 集群上执行预处理。
下面的样本 SAS 代码将一个文本文件复制到 HDFS,并运行 WordCount Map Reduce 程序。
- $ cat word-count-mr-ex.sas
- proc hadoop username='hdfs' password='xxxxxxx' verbose;
- #Copy text file from client machine to HDFS
- hdfs copyfromlocal='/home/np/SASCode/wordcount-input.txt'
- out='/user/hdfs/sasdata/wordcount-input.txt';
- #Map reduce statement to execute wordcount
- mapreduce input='/user/hdfs/sasdata/wordcount-input.txt'
- output='/user/hdfs/sasdata/wcout'
- jar='/hadoop/lib/hadoop-mapreduce-examples-2.7.3.2.5.0.0-1245.jar'
- outputkey='org.apache.hadoop.io.Text'
- outputvalue='org.apache.hadoop.io.IntWritable'
- reduce='org.apache.hadoop.examples.WordCount$IntSumReducer'
- combine='org.apache.hadoop.examples.WordCount$IntSumReducer'
- map='org.apache.hadoop.examples.WordCount$TokenizerMapper';
- run;
- # /SASHome/SASFoundation/9.4/sas word-count-mr-ex.sas
运行样本代码并检查 SAS 日志文件,确保该代码已成功运行。另外,检查 HDFS 目录的内容,验证来自 WordCount MapReduce 程序的输入文件内容和输出。
- # hadoop fs -ls /user/hdfs/sasdata/
- drwxr-xr-x - hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout
- -rw-r--r-- 3 hdfs hdfs 268 2017-04-22 01:46 /user/hdfs/sasdata/wordcount-input.txt
- # hadoop fs -cat /user/hdfs/sasdata/wordcount-input.txt
- This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce
- application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes
- the word counts to the output text file.
- # hadoop fs -ls /user/hdfs/sasdata/wcout
- Found 2 items
- -rw-r--r-- 3 hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout/_SUCCESS
- -rw-r--r-- 3 hdfs hdfs 270 2017-04-22 01:47 /user/hdfs/sasdata/wcout/part-r-00000
- # hadoop fs -cat /user/hdfs/sasdata/wcout/part-r-00000
- HADOOP 1
- Hadoop 2
- MapReduce 2
- PROC 1
- The 1
- This 1
- WordCount 1
- a 3
- and 1
- application 1
- breaks 1
- counts 2
- each 1
- example 2
- file, 1
- file. 1
- input 1
- into 1
- line 1
- output 1
- program 1
- reads 1
- server. 1
- submits 1
- text 2
- that 1
- the 4
- then 1
- to 2
- uses 1
- word 1
- words, 2
- writes 1
点击查看大图
使用 LIBNAME 语句以 SAS 数据集形式访问 Hive 表,使用 PROC SQL 过程通过 JDBC 对 HDP 集群运行 Hive 查询。
请参阅以下样本 SAS 代码,该代码从 BASE SAS 使用 LIBNAME 和 PROC SQL 来访问 Hive 表,并运行查询和执行分析。
- $ cat sas-hdp-hive-access.sas
- #LIBNAME statement to connect to HIVEService2 on HDP Cluster
- libname myhdp hadoop server='hdpnode1.dal-ebis.ihost.com' schema='default' user=hive;
- # Listing of tables
- proc datasets lib=myhdp details;
- run;
- # Get schema for all tables.
- proc contents data=myhdp._all_;
- run;
- # Run queries and analysis using PROC SQL procedure
- proc sql;
- select count(*) from MYHDP.TRUCK_EVENTS;
- run;
- # Run PROC FREQ statistical procedure against the HDFS data that #is not available as SAS dataset
- proc freq data=MYHDP.TRUCK_EVENTS;
- tables eventtype;
- run;
检查 SAS 日志文件中的一组表。查看摘自日志文件的以下内容,这些内容列出了来自默认 Hive 模式的表。
- 3 proc datasets lib=myhdp details;
- Libref MYHDP
- Engine HADOOP
- Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default
- Schema/Owner default
- # Name Type or Indexes Vars Label
- 1 DRIVERS DATA . 6
- 2 EXPORT_TABLE DATA . 2
- 3 EXTENSION DATA . 3
- 4 HOSTS DATA . 16
- 5 HOSTS1 DATA . 16
- 6 HOSTS2 DATA . 16
- 7 HOSTS3 DATA . 16
- 8 TEST2 DATA . 2
- 9 TRUCK_EVENTS DATA . 11
检查清单文件(扩展名为
)中的统计过程输出。例如,摘自清单文件的以下内容显示了表 truck_events 的所有表名称和细节。
- .lst
- The SAS System 02:19 Saturday, April 22, 2017 1
- The CONTENTS Procedure
- Libref MYHDP
- Engine HADOOP
- Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default
- Schema/Owner default
- # Name Type
- 1 DRIVERS DATA
- 2 EXPORT_TABLE DATA
- 3 EXTENSION DATA
- 4 HOSTS DATA
- 5 HOSTS1 DATA
- 6 HOSTS2 DATA
- 7 HOSTS3 DATA
- 8 TEST2 DATA
- 9 TRUCK_EVENTS DATA
- The CONTENTS Procedure
- Data Set Name MYHDP.TRUCK_EVENTS Observations .
- Member Type DATA Variables 11
- Engine HADOOP Indexes 0
- Created . Observation Length 0
- Last Modified . Deleted Observations 0
- Protection Compressed NO
- Data Set Type Sorted NO
- Label
- Data Representation Default
- Encoding Default
- Alphabetic List of Variables and Attributes
- # Variable Type Len Format Informat Label
- 8 correlationid Num 8 20. 20. correlationid
- 1 driverid Num 8 11. 11. driverid
- 9 drivername Char 32767 $32767. $32767. drivername
- 7 eventkey Char 32767 $32767. $32767. eventkey
- 3 eventtime Char 32767 $32767. $32767. eventtime
- 4 eventtype Char 32767 $32767. $32767. eventtype
- 6 latitude Num 8 latitude
- 5 longitude Num 8 longitude
- 10 routeid Num 8 11. 11. routeid
- 11 routename Char 32767 $32767. $32767. routename
- 2 truckid Num 8 11. 11. truckid
该清单文件还在 truck_events 表的 eventtype 列上显示了来自统计过程 PROC FREQ 的输出。参见摘自清单文件的以下内容。
- The FREQ Procedure
- eventtype
- Cumulative Cumulative
- eventtype Frequency Percent Frequency Percent
- --------------------------------------------------------------------------
- Lane Departure 11 0.06 11 0.06
- Normal 17041 99.80 17052 99.87
- Overspeed 9 0.05 17061 99.92
- Unsafe following distance 7 0.04 17068 99.96
- Unsafe tail distance 7 0.04 17075 100.00
使用显式直通模式访问 Hive 表
使用 PROC SQL 过程、CONECT TO HADOOP 和 EXECUTE 语句,在显式直通模式下运行 Hive 查询。这会绕过 JDBC,因此速度更快。此接口对以流形式从 Hive/HDFS 读取数据很有帮助。
下面的样本代码演示了此接口。
- # cat sas-hdp-hive-access-explicit-passthru.sas
- options dbidirectexec;
- options nodbidirectexec;
- proc sql;
- #Connect to Hadoop/HiveServer2
- connect to hadoop (server=xxxxx.xxxxxxxx.com' user=hive subprotocol=hive2);
- #Execute HIVEQL query to create an external HIVE table
- execute ( CREATE EXTERNAL TABLE movie_ratings (
- userid INT,
- movieid INT,
- rating INT,
- tstamp STRING
- ) ROW FORMAT DELIMITED
- FIELDS TERMINATED BY '#'
- STORED AS TEXTFILE
- LOCATION '/user/hive/sasdata/movie_ratings') by hadoop;
- disconnect from hadoop;
- proc hadoop username='hive' password='xxxxxx' verbose;
- #Copy data from client to the HDFS location for the movie_ratings table.
- hdfs copyfromlocal='/home/np/SASCode/ratings.txt'
- out='/user/hive/sasdata/movie_ratings';
- quit;
登录到 HDP 集群上的 Hive,并验证 movie_ratings 表已创建并拥有数据。
- # su - hive
- # hive
- hive> show tables;
- OK
- movie_ratings
- hive> desc movie_ratings;
- OK
- userid int
- movieid int
- rating int
- tstamp string
- Time taken: 0.465 seconds, Fetched: 4 row(s)
- hive> select count(*) from movie_ratings;
- Query ID = hive_20170422042813_e9e49803-144a-48e9-b0f6-f5cd8595d254
- Total jobs = 1
- Launching Job 1 out of 1
- Status: Running (Executing on YARN cluster with App id application_1492505822201_0034)
- Map 1: -/- Reducer 2: 0/1
- Map 1: 0/2 Reducer 2: 0/1
- Map 1: 0/2 Reducer 2: 0/1
- Map 1: 0(+1)/2 Reducer 2: 0/1
- Map 1: 0(+2)/2 Reducer 2: 0/1
- Map 1: 0(+2)/2 Reducer 2: 0/1
- Map 1: 2/2 Reducer 2: 0(+1)/1
- Map 1: 2/2 Reducer 2: 1/1
- OK
- 1000209
- Time taken: 14.931 seconds, Fetched: 1 row(s)
- F. Maxwell Harper 和 Joseph A. Konstan。2015 年。MovieLens 数据集:历史和上下文。交互式智能系统上的 ACM 事务 (TiiS) 5,4,文章 19(2015 年 12 月),19 页。DOI=http://dx.doi.org/10.1145/2827872
来源: http://www.ibm.com/developerworks/cn/linux/l-sas-hdp/index.html