搭建 Spark on YARN 模式的集群需要完成 Hadoop 和 Spark 的安装与配置,并确保它们能够协同工作。以下是详细的搭建步骤和代码示例:
1 系统准备
-
操作系统:推荐使用 CentOS 或 Ubuntu。
-
Java 环境:安装 JDK 1.8 或更高版本
sudo apt update sudo apt install openjdk-8-jdk
Scala 环境(可选,视需求而定):
-
sudo apt install scala
2 安装 Hadoop
-
下载并解压 Hadoop:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz tar -xzvf hadoop-3.3.4.tar.gz -C /opt/ mv /opt/hadoop-3.3.4 /opt/hadoop
配置 Hadoop:
-
hadoop-env.sh
:设置 Java 环境变量 -
vim /opt/hadoop/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
core-site.xml
:配置 HDFS -
<configuration><property><name>fs.defaultFS</name><value>hdfs://master:9000</value></property> </configuration>
hdfs-site.xml
:配置 HDFS 的存储位置 -
<configuration><property><name>dfs.replication</name><value>3</value></property> </configuration>
yarn-site.xml
:配置 YARN -
<configuration><property><name>yarn.nodemanager.pmem-check-enabled</name><value>false</value></property><property><name>yarn.nodemanager.vmem-check-enabled</name><value>false</value></property><property><name>yarn.resourcemanager.hostname</name><value>master</value></property> </configuration>
slaves
:添加从节点主机名 -
vim /opt/hadoop/etc/hadoop/slaves slave1 slave2
启动 Hadoop
/opt/hadoop/bin/hdfs namenode -format /opt/hadoop/sbin/start-dfs.sh /opt/hadoop/sbin/start-yarn.sh
3. 安装 Spark
-
下载并解压 Spark:
-
wget https://downloads.apache.org/spark/spark-3.2.4/spark-3.2.4-bin-hadoop3.2.tgz tar -xzvf spark-3.2.4-bin-hadoop3.2.tgz -C /opt/ mv /opt/spark-3.2.4-bin-hadoop3.2 /opt/spark
配置 Spark:
-
spark-env.sh
:设置环境变量cp /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.sh vim /opt/spark/conf/spark-env.sh export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
spark-defaults.conf
:启用事件日志 -
cp /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf vim /opt/spark/conf/spark-defaults.conf spark.eventLog.enabled true spark.eventLog.dir hdfs://master:9000/spark/eventLogs spark.yarn.historyServer.address master:18080
slaves
:添加从节点主机名 -
cp /opt/spark/conf/slaves.template /opt/spark/conf/slaves vim /opt/spark/conf/slaves slave1 slave2
分发配置文件: 将主节点的 Spark 配置文件分发到所有从节点。
-
scp -r /opt/spark/conf/spark-env.sh root@slave1:/opt/spark/conf/ scp -r /opt/spark/conf/spark-env.sh root@slave2:/opt/spark/conf/
启动 Spark
-
在主节点启动 Spark Shell。
-
spark-shell --master yarn --deploy-mode client
测试集群
-
提交一个示例作业
-
spark-submit --master yarn --deploy-mode cluster examples/src/main/python/pi.py 1000
验证集群
-
访问 YARN 的 Web UI 页面(
http://master:8088
),查看应用的运行情况。