阿里云-云小站(无限量代金券发放中)
【腾讯云】云服务器、云数据库、COS、CDN、短信等热卖云产品特惠抢购

Oozie分布式任务的工作流

90次阅读
没有评论

共计 40198 个字符,预计需要花费 101 分钟才能阅读完成。

在大数据的当下,各种 spark 和 Hadoop 的框架层出不穷。各种高端的计算框架,分布式任务如乱花般迷眼。你是否有这种困惑!——有了许多的分布式任务,但是每天需要固定时间跑任务,自己写个调度,既不稳定,又没有可靠的通知。

想要了解 Oozie 的基础知识,可以参考这里

那么你应该是在找——Oozie。

Oozie 是一款支持分布式任务调度的开源框架,它支持很多的分布式任务,比如 map reduce,spark,sqoop,pig 甚至 shell 等等。你可以以各种方式调度它们,把它们组成工作流。每个工作流节点可以串行也可以并行执行。

如果你定义好了一系列的任务,就可以开启工作流,设置一个 coordinator 调度器进行定时的调度了。

有了这些工作以后,还需要一个很重要的环节—— 就是邮件提醒。不管是任务执行成功还是失败,都可以发送邮件提醒。这样每天晚上收到任务成功的消息,就可以安心睡觉了。

因此,本篇就带你来看看如何在 Oozie 中使用 Email。

Email Action

在 Oozie 中每个工作流的环节都被设计成一个 Action,email 就是其中的一个 Action.

Email action 可以在 oozie 中发送信息,在 email action 中必须指定接收的地址,主题 subject 和内容 body。在接收地址参数中支持使用逗号分隔,添加多个邮箱地址。

email action 是同步执行的,因此必须等到邮件发出后,这个 action 才算完成,才能执行下一个 action。

email action 里面的所有参数都可以使用 EL 表达式。

语法规则

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <email xmlns="uri:oozie:email-action:0.2">
            <to>[COMMA-SEPARATED-TO-ADDRESSES]</to>
            <cc>[COMMA-SEPARATED-CC-ADDRESSES]</cc> <!-- cc is optional -->
            <subject>[SUBJECT]</subject>
            <body>[BODY]</body>
            <content_type>[CONTENT-TYPE]</content_type> <!-- content_type is optional -->
            <attachment>[COMMA-SEPARATED-HDFS-FILE-PATHS]</attachment> <!-- attachment is optional -->
        </email>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

to 和 cc 命令指定了谁来接收邮件。可以通过逗号分隔来指定多个邮箱地址。to 是必填项,cc 是可选的。

主题 subject 和正文 body 用于指定邮件的标题和正文,email-action:0.2 支持 text/html 这种格式的正文,默认是普通的文本 ”text/plain”

attachment 用于在邮件中添加一个 hdfs 文件的附件,也可以通过逗号分隔符指定多个附件。如果路径声明的不全,那么也会被当做 hdfs 中的文件。本地文件是不能添加到附件中的。

配置

email action 需要在 oozie-site.xml 中配置 SMTP 服务器配置。下面是需要配置的值:

oozie.email.smtp.host

这个值是 SMTP 服务器的地址,默认是 loalhost

oozie.email.smtp.port

是 SMTP 服务器的端口号,默认是 25.

oozie.email.from.address

发送邮件的地址,默认是 oozie@localhost

oozie.email.smtp.auth

是否开启认证,默认不开启

oozie.email.smtp.username

如果开启认证,登录的用户名,默认是空

oozie.email.smtp.password

如果开启认证,用户对应的密码,默认是空

PS. 在 linux 可以通过 find -name oozie-site.xml 在当前目录下查找。在我们的 CDH 版本中这个文件在./etc/oozie/conf.dist/oozie-site.xml

样例

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="an-email">
        <email xmlns="uri:oozie:email-action:0.1">
            <to>bob@initech.com,the.other.bob@initech.com</to>
            <cc>will@initech.com</cc>
            <subject>Email notifications for ${wf:id()}</subject>
            <body>The wf ${wf:id()} successfully completed.</body>
        </email>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

上面的例子中,邮件发给了 bob,the.other.bob 以及抄送给 will,并指定了邮件的标题和正文以及 workflow 的 id。

附录

为了更多的了解 Oozie,这里直接给出了 Oozie 相关的重要配置

oozie-site.xml 配置

<?xml version="1.0"?>
<configuration>
    <!--oozie-default.xml 文件是默认的配置 -->
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
        <value>*</value>
    </property>
</configuration>

oozie-defualt.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="https://www.linuxidc.com/Linux/2016-11/configuration.xsl"?>
<!--  Licensed to the Apache Software Foundation (ASF) under one  or more contributor license agreements.  See the NOTICE file  distributed with this work for additional information  regarding copyright ownership.  The ASF licenses this file  to you under the Apache License, Version 2.0 (the  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.-->
<configuration>

    <!-- ************************** VERY IMPORTANT  ************************** -->
    <!-- This file is in the Oozie configuration directory only for reference. -->
    <!-- It is not loaded by Oozie, Oozie uses its own privatecopy.            -->
    <!-- ************************** VERY IMPORTANT  ************************** -->

    <property>
        <name>oozie.output.compression.codec</name>
        <value>gz</value>
        <description>
            The name of the compression codec to use.
            where codec class implements the interface org.apache.oozie.compression.CompressionCodec.
            If oozie.compression.codecs is not specified, gz codec implementation is used by default.
        </description>
    </property>

    <property>
        <name>oozie.action.mapreduce.uber.jar.enable</name>
        <value>false</value>
        <description>
            which specify the oozie.mapreduce.uber.jar configuration property will fail.
        </description>
    </property>

    <property>
        <name>oozie.processing.timezone</name>
        <value>UTC</value>
        <description>
            is changed, note that GMT(+/-)#### timezones do not observe DST changes.
        </description>
    </property>

    <!-- Base Oozie URL: <SCHEME>://<HOST>:<PORT>/<CONTEXT> -->

    <property>
        <name>oozie.base.url</name>
        <value>http://localhost:8080/oozie</value>
        <description>
             Base Oozie URL.
        </description>
    </property>

    <!-- Services -->

    <property>
        <name>oozie.system.id</name>
        <value>oozie-${user.name}</value>
        <description>
            The Oozie system ID.
        </description>
    </property>

    <property>
        <name>oozie.systemmode</name>
        <value>NORMAL</value>
        <description>
            System mode for  Oozie at startup.
        </description>
    </property>

    <property>
        <name>oozie.delete.runtime.dir.on.shutdown</name>
        <value>true</value>
        <description>
            If the runtime directory should be kept after Oozie shutdowns down.
        </description>
    </property>

    <property>
        <name>oozie.services</name>
        <value>
            org.apache.oozie.service.SchedulerService,
            org.apache.oozie.service.InstrumentationService,
            org.apache.oozie.service.MemoryLocksService,
            org.apache.oozie.service.UUIDService,
            org.apache.oozie.service.ELService,
            org.apache.oozie.service.AuthorizationService,
            org.apache.oozie.service.UserGroupInformationService,
            org.apache.oozie.service.HadoopAccessorService,
/email
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>


    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            Comma separated AUTHORITY=SPARK_CONF_DIR, where AUTHORITY is the HOST:PORT of
            the ResourceManager of a YARN cluster. The wildcard '*' configuration is
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
@                                                                                                      
search hit BOTTOM, continuing at TOP
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>


    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
            The default value to use for the &lt;job-tracker&gt; element in applicable action types.  T
his value will be used when
            neither the action itself nor the global section specifies a &lt;job-tracker&gt;.  As expec
ted, it should be of the form
            "HOST:PORT".
        </description>
    </property>

</configuration>

更多详情见请继续阅读下一页的精彩内容:http://www.linuxidc.com/Linux/2016-11/137323p2.htm

继前一篇大体上翻译了 Email 的 Action 配置,本篇继续看一下 Shell 的相关配置。

Shell Action

Shell Action 可以执行 Shell 脚本命令,工作流会等到 shell 完全执行完毕后退出,再执行下一个节点。为了运行 shell,必须配置 job-tracker 以及 name-node,并且设置exec 来执行 shell.

Shell 既可以使用 job-xml 引用一个配置文件,也可以在 shell action 内直接配置。shell action 中的配置会覆盖 job-xml 中的配置。

EL 表达式同样适用于 shell action。

注意,mapred.job.tracker以及 fs.default.name 属性不能再 shell action 中直接配置。

在 mapreduce 任务中可以处理一些资源,这样 shell 就可以使用了。更多的内容参考 [WorkflowFunctionalSpec#FilesAchives]``[Adding Files and Archives for the Job] 章节。

shell 的输出可以被后面的工作流任务使用,这些信息可以用来配置一些关键的信息。如果 shell 的输出想要对整个工作流任务可用,那么必须满足

  • 输出的格式是标准的 java 属性文件
  • 输出的内容不能超过 2KB

语法

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.3">
    ...
    <action name="[NODE-NAME]">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[SHELL SETTINGS FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <exec>[SHELL-COMMAND]</exec>
            <argument>[ARG-VALUE]</argument>
                ...
            <argument>[ARG-VALUE]</argument>
            <env-var>[VAR1=VALUE1]</env-var>
               ...
            <env-var>[VARN=VALUEN]</env-var>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
            <capture-output/>
        </shell>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
  • prepare 元素 ,经常用于创建一系列的目录或者删除目录。注意目录必须是hdfs://host:port 这种格式的。
  • job-xml 元素,指定 shell 任务的配置。在 0.2 的 schema 中,job-xml 元素允许指定多个 job-xml 文件。
  • configuration 元素,包含了 shell 任务的配置信息。
  • exec 元素,这个是必填项。包含了 shell 脚本的路径,并执行它。参数可以设置 0 个或者多个 argument 元素。
  • argument 元素,用于传递给 shell 脚本。
  • env-var 元素 ,可以设置环境变量,传递给 shell 脚本。env-var 需要包含键值对这种的信息。比如包含$PATH 这种信息,那么需要设置 PATH=$PATH:mypath 这种格式。不要使用 ${}这种语法,因为它会被认为是 Oozie 的 EL 表达式。
  • shell action 也可以创建 Hadoop 的配置。shell 应用可以直接读取配置文件。
  • capture-output 元素, 用来指定输出端。shell 命令输出必须是 java 属性这种格式,并且小于 2kb. 通过工作流的定义,输出也可以通过 string action 实现。

上面这些元素都支持 EL 表达式。

例子

如何运行 shell 或者 perl 脚本。

<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
    <start to='shell1'/>
    <action name='shell1'>
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                  <name>mapred.job.queue.name</name>
                  <value>${queueName}</value>
                </property>
            </configuration>
            <exec>${EXEC}</exec>
            <argument>A</argument>
            <argument>B</argument>
            <file>${EXEC}#${EXEC}</file> <!--Copy the executable to compute node's current working directory -->
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name='end'/>
</workflow-app>

用于提交 oozie 工作的参数有

oozie.wf.application.path=hdfs://localhost:8020/user/kamrul/workflows/script#Execute is expected to be in the Workflow directory.
#Shell Script to run
EXEC=script.sh
#CPP executable. Executable should be binary compatible to the compute node OS.
#EXEC=hello
#Perl script
#EXEC=script.pl
jobTracker=localhost:8021
nameNode=hdfs://localhost:8020
queueName=default

如何运行 java 程序并添加 jar 包

<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
    <start to='shell1'/>
    <action name='shell1'>
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                  <name>mapred.job.queue.name</name>
                  <value>${queueName}</value>
                </property>
            </configuration>
            <exec>java</exec>
            <argument>-classpath</argument>
            <argument>./${EXEC}:$CLASSPATH</argument>
            <argument>Hello</argument>
            <file>${EXEC}#${EXEC}</file> <!--Copy the jar to compute node current working directory -->
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name='end'/>
</workflow-app>

提交的相关参数

oozie.wf.application.path=hdfs://localhost:8020/user/kamrul/workflows/script#Hello.jar file is expected to be in the Workflow directory.
EXEC=Hello.jar
jobTracker=localhost:8021
nameNode=hdfs://localhost:8020
queueName=default

shell Action 日志

shell action 标准输出和错误输出都可以直接输出到 oozie 的 mapreduce 任务控制台上。

通过 oozie web 控制台,也可以看到它的执行日志。

shell action 的限制

尽管 shell 可以执行任何的脚本命令,但是还是有一些限制的。

  • 不支持交互式的命令
  • 命令不能使用不同的用户执行
  • 用户必须严格控制上传的 jar 包。oozie 会把他上传到分布式环境中进行缓存
  • 尽管 oozie 在 hadoop 的计算节点执行 shell 命令,但是可能有一些默认安装的工能是不支持的。因此需要了解,oozie 可以支持安装在计算节点的命令。

实战分析

shell 可以输出 java properties 格式的数据,并且可以配合 EL 表达式,在其他的 action 中使用。因此它可以作为工作流的初始化任务,以及配置服务。

比如,在脚本中:

#!/bin/sh
a=1
b=2
echo "a=$a"
echo "b=$b"

在其他的节点中就可以通过 EL 表达式来使用了。

在大数据的当下,各种 spark 和 Hadoop 的框架层出不穷。各种高端的计算框架,分布式任务如乱花般迷眼。你是否有这种困惑!——有了许多的分布式任务,但是每天需要固定时间跑任务,自己写个调度,既不稳定,又没有可靠的通知。

想要了解 Oozie 的基础知识,可以参考这里

那么你应该是在找——Oozie。

Oozie 是一款支持分布式任务调度的开源框架,它支持很多的分布式任务,比如 map reduce,spark,sqoop,pig 甚至 shell 等等。你可以以各种方式调度它们,把它们组成工作流。每个工作流节点可以串行也可以并行执行。

如果你定义好了一系列的任务,就可以开启工作流,设置一个 coordinator 调度器进行定时的调度了。

有了这些工作以后,还需要一个很重要的环节—— 就是邮件提醒。不管是任务执行成功还是失败,都可以发送邮件提醒。这样每天晚上收到任务成功的消息,就可以安心睡觉了。

因此,本篇就带你来看看如何在 Oozie 中使用 Email。

Email Action

在 Oozie 中每个工作流的环节都被设计成一个 Action,email 就是其中的一个 Action.

Email action 可以在 oozie 中发送信息,在 email action 中必须指定接收的地址,主题 subject 和内容 body。在接收地址参数中支持使用逗号分隔,添加多个邮箱地址。

email action 是同步执行的,因此必须等到邮件发出后,这个 action 才算完成,才能执行下一个 action。

email action 里面的所有参数都可以使用 EL 表达式。

语法规则

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <email xmlns="uri:oozie:email-action:0.2">
            <to>[COMMA-SEPARATED-TO-ADDRESSES]</to>
            <cc>[COMMA-SEPARATED-CC-ADDRESSES]</cc> <!-- cc is optional -->
            <subject>[SUBJECT]</subject>
            <body>[BODY]</body>
            <content_type>[CONTENT-TYPE]</content_type> <!-- content_type is optional -->
            <attachment>[COMMA-SEPARATED-HDFS-FILE-PATHS]</attachment> <!-- attachment is optional -->
        </email>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

to 和 cc 命令指定了谁来接收邮件。可以通过逗号分隔来指定多个邮箱地址。to 是必填项,cc 是可选的。

主题 subject 和正文 body 用于指定邮件的标题和正文,email-action:0.2 支持 text/html 这种格式的正文,默认是普通的文本 ”text/plain”

attachment 用于在邮件中添加一个 hdfs 文件的附件,也可以通过逗号分隔符指定多个附件。如果路径声明的不全,那么也会被当做 hdfs 中的文件。本地文件是不能添加到附件中的。

配置

email action 需要在 oozie-site.xml 中配置 SMTP 服务器配置。下面是需要配置的值:

oozie.email.smtp.host

这个值是 SMTP 服务器的地址,默认是 loalhost

oozie.email.smtp.port

是 SMTP 服务器的端口号,默认是 25.

oozie.email.from.address

发送邮件的地址,默认是 oozie@localhost

oozie.email.smtp.auth

是否开启认证,默认不开启

oozie.email.smtp.username

如果开启认证,登录的用户名,默认是空

oozie.email.smtp.password

如果开启认证,用户对应的密码,默认是空

PS. 在 linux 可以通过 find -name oozie-site.xml 在当前目录下查找。在我们的 CDH 版本中这个文件在./etc/oozie/conf.dist/oozie-site.xml

样例

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="an-email">
        <email xmlns="uri:oozie:email-action:0.1">
            <to>bob@initech.com,the.other.bob@initech.com</to>
            <cc>will@initech.com</cc>
            <subject>Email notifications for ${wf:id()}</subject>
            <body>The wf ${wf:id()} successfully completed.</body>
        </email>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

上面的例子中,邮件发给了 bob,the.other.bob 以及抄送给 will,并指定了邮件的标题和正文以及 workflow 的 id。

附录

为了更多的了解 Oozie,这里直接给出了 Oozie 相关的重要配置

oozie-site.xml 配置

<?xml version="1.0"?>
<configuration>
    <!--oozie-default.xml 文件是默认的配置 -->
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
        <value>*</value>
    </property>
</configuration>

oozie-defualt.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="https://www.linuxidc.com/Linux/2016-11/configuration.xsl"?>
<!--  Licensed to the Apache Software Foundation (ASF) under one  or more contributor license agreements.  See the NOTICE file  distributed with this work for additional information  regarding copyright ownership.  The ASF licenses this file  to you under the Apache License, Version 2.0 (the  distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and  limitations under the License.-->
<configuration>

    <!-- ************************** VERY IMPORTANT  ************************** -->
    <!-- This file is in the Oozie configuration directory only for reference. -->
    <!-- It is not loaded by Oozie, Oozie uses its own privatecopy.            -->
    <!-- ************************** VERY IMPORTANT  ************************** -->

    <property>
        <name>oozie.output.compression.codec</name>
        <value>gz</value>
        <description>
            The name of the compression codec to use.
            where codec class implements the interface org.apache.oozie.compression.CompressionCodec.
            If oozie.compression.codecs is not specified, gz codec implementation is used by default.
        </description>
    </property>

    <property>
        <name>oozie.action.mapreduce.uber.jar.enable</name>
        <value>false</value>
        <description>
            which specify the oozie.mapreduce.uber.jar configuration property will fail.
        </description>
    </property>

    <property>
        <name>oozie.processing.timezone</name>
        <value>UTC</value>
        <description>
            is changed, note that GMT(+/-)#### timezones do not observe DST changes.
        </description>
    </property>

    <!-- Base Oozie URL: <SCHEME>://<HOST>:<PORT>/<CONTEXT> -->

    <property>
        <name>oozie.base.url</name>
        <value>http://localhost:8080/oozie</value>
        <description>
             Base Oozie URL.
        </description>
    </property>

    <!-- Services -->

    <property>
        <name>oozie.system.id</name>
        <value>oozie-${user.name}</value>
        <description>
            The Oozie system ID.
        </description>
    </property>

    <property>
        <name>oozie.systemmode</name>
        <value>NORMAL</value>
        <description>
            System mode for  Oozie at startup.
        </description>
    </property>

    <property>
        <name>oozie.delete.runtime.dir.on.shutdown</name>
        <value>true</value>
        <description>
            If the runtime directory should be kept after Oozie shutdowns down.
        </description>
    </property>

    <property>
        <name>oozie.services</name>
        <value>
            org.apache.oozie.service.SchedulerService,
            org.apache.oozie.service.InstrumentationService,
            org.apache.oozie.service.MemoryLocksService,
            org.apache.oozie.service.UUIDService,
            org.apache.oozie.service.ELService,
            org.apache.oozie.service.AuthorizationService,
            org.apache.oozie.service.UserGroupInformationService,
            org.apache.oozie.service.HadoopAccessorService,
/email
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>


    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            Comma separated AUTHORITY=SPARK_CONF_DIR, where AUTHORITY is the HOST:PORT of
            the ResourceManager of a YARN cluster. The wildcard '*' configuration is
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
@                                                                                                      
search hit BOTTOM, continuing at TOP
            IMPORTANT: if the StoreServicePasswordService is active, it will reset this value with the
value given in
                       the console.
        </description>
    </property>

    <property>
        <name>oozie.service.JPAService.pool.max.active.conn</name>
        <value>10</value>
        <description>
             Max number of connections.
        </description>
    </property>

   <!-- SchemaService -->

    <property>
        <name>oozie.service.SchemaService.wf.schemas</name>
        <value>
            oozie-workflow-0.1.xsd,oozie-workflow-0.2.xsd,oozie-workflow-0.2.5.xsd,oozie-workflow-0.3.x
sd,oozie-workflow-0.4.xsd,
            oozie-workflow-0.4.5.xsd,oozie-workflow-0.5.xsd,
            shell-action-0.1.xsd,shell-action-0.2.xsd,shell-action-0.3.xsd,
            email-action-0.1.xsd,email-action-0.2.xsd,
            hive-action-0.2.xsd,hive-action-0.3.xsd,hive-action-0.4.xsd,hive-action-0.5.xsd,hive-action
-0.6.xsd,
            sqoop-action-0.2.xsd,sqoop-action-0.3.xsd,sqoop-action-0.4.xsd,
            ssh-action-0.1.xsd,ssh-action-0.2.xsd,
            distcp-action-0.1.xsd,distcp-action-0.2.xsd,
            oozie-sla-0.1.xsd,oozie-sla-0.2.xsd,
            hive2-action-0.1.xsd, hive2-action-0.2.xsd,
            spark-action-0.1.xsd,spark-action-0.2.xsd
        </value>
        <description>
            List of schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.wf.ext.schemas</name>
        <value> </value>
        <description>
            List of additional schemas for workflows (separated by commas).
        </description>
    </property>

    <property>
        <name>oozie.service.SchemaService.coord.schemas</name>
/email
        <description>
             Base console URL for a workflow job.
        </description>
    </property>


    <!-- ActionService -->

    <property>
        <name>oozie.service.ActionService.executor.classes</name>
        <value>
            org.apache.oozie.action.decision.DecisionActionExecutor,
            org.apache.oozie.action.hadoop.JavaActionExecutor,
            org.apache.oozie.action.hadoop.FsActionExecutor,
            org.apache.oozie.action.hadoop.MapReduceActionExecutor,
            org.apache.oozie.action.hadoop.PigActionExecutor,
            org.apache.oozie.action.hadoop.HiveActionExecutor,
            org.apache.oozie.action.hadoop.ShellActionExecutor,
            org.apache.oozie.action.hadoop.SqoopActionExecutor,
            org.apache.oozie.action.hadoop.DistcpActionExecutor,
            org.apache.oozie.action.hadoop.Hive2ActionExecutor,
            org.apache.oozie.action.ssh.SshActionExecutor,
            org.apache.oozie.action.oozie.SubWorkflowActionExecutor,
            org.apache.oozie.action.email.EmailActionExecutor,
            org.apache.oozie.action.hadoop.SparkActionExecutor
        </value>
        <description>
            List of ActionExecutors classes (separated by commas).
            Only action types with associated executors can be used in workflows.
        </description>
    </property>

    <property>
        <name>oozie.service.ActionService.executor.ext.classes</name>
        <value> </value>
        <description>
            List of ActionExecutors extension classes (separated by commas). Only action types with ass
ociated
            executors can be used in workflows. This property is a convenience property to add extensio
ns to the built
            in executors without having to include all the built in ones.
        </description>
    </property>

    <!-- ActionCheckerService -->

    <property>
        <name>oozie.service.ActionCheckerService.action.check.interval</name>
/email
        <description>
            used when there is no exact match for an authority. The SPARK_CONF_DIR contains
            the relevant spark-defaults.conf properties file. If the path is relative is looked within
            the Oozie configuration directory; though the path can be absolute.  This is only used
            when the Spark master is set to either "yarn-client" or "yarn-cluster".
        </description>
    </property>

    <property>
        <name>oozie.service.SparkConfigurationService.spark.configurations.ignore.spark.yarn.jar</name>
        <value>true</value>
        <description>
            If true, Oozie will ignore the "spark.yarn.jar" property from any Spark configurations spec
ified in
            oozie.service.SparkConfigurationService.spark.configurations.  If false, Oozie will not ign
ore it.  It is recommended
            to leave this as true because it can interfere with the jars in the Spark sharelib.
        </description>
    </property>

    <property>
        <name>oozie.email.attachment.enabled</name>
        <value>true</value>
        <description>
            This value determines whether to support email attachment of a file on HDFS.
            Set it false if there is any security concern.
        </description>
    </property>

    <property>
        <name>oozie.actions.default.name-node</name>
        <value> </value>
        <description>
            The default value to use for the &lt;name-node&gt; element in applicable action types.  Thi
s value will be used when
            neither the action itself nor the global section specifies a &lt;name-node&gt;.  As expecte
d, it should be of the form
            "hdfs://HOST:PORT".
        </description>
    </property>

    <property>
        <name>oozie.actions.default.job-tracker</name>
        <value> </value>
        <description>
            The default value to use for the &lt;job-tracker&gt; element in applicable action types.  T
his value will be used when
            neither the action itself nor the global section specifies a &lt;job-tracker&gt;.  As expec
ted, it should be of the form
            "HOST:PORT".
        </description>
    </property>

</configuration>

更多详情见请继续阅读下一页的精彩内容:http://www.linuxidc.com/Linux/2016-11/137323p2.htm

Sqoop 的使用应该是 Oozie 里面最常用的了,因为很多 BI 数据分析都是基于业务数据库来做的,因此需要把 mysql 或者 Oracle 的数据导入到 hdfs 中再利用 mapreduce 或者 spark 进行 ETL,生成报表信息。

因此本篇的 Sqoop Action 其实就是运行一个 sqoop 的任务而已。

同样 action 会等到 sqoop 执行成功后,才会执行下一个 action。为了运行 sqoop action,需要提供 job-tracker,name-node,command 或者 arg 元素。

sqoop action 也可以在开启任务前去创建或者删除 hdfs 中的目录。

sqoop action 的配置可以通过 job-xml 指定文件进行配置,也可以直接在 configuration 元素中配置。

语法规则

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <command>[SQOOP-COMMAND]</command>
            <arg>[SQOOP-ARGUMENT]</arg>
            ...
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </sqoop>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
  • prepare 元素,用于创建或者删除指定的 hdfs 目录。
  • job-xml 可以指定 sqoop action 的参数配置
  • confuguration 用于配置 sqoop 任务

sqoop command

sqoop 命令可以通过 command 和 arg 标签组成。

当使用 command 元素时,oozie 将会按照空格切分命令,作为参数。因此当你使用 query 的时候,就不能用 command 了!

当使用 arg 的时候,每个 arg 都是一个参数。

所有的参数部分,都可以使用 EL 表达式。

例子

基于 command 的例子

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirsthivejob">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <command>import  --connect jdbc:hsqldb:file:db.hsqldb --table TT --target-dir hdfs://localhost:8020/user/tucu/foo -m 1</command>
        </sqoop>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

基于 arg 元素的例子

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirsthivejob">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:hsqldb:file:db.hsqldb</arg>
            <arg>--table</arg>
            <arg>TT</arg>
            <arg>--target-dir</arg>
            <arg>hdfs://localhost:8020/user/tucu/foo</arg>
            <arg>-m</arg>
            <arg>1</arg>
        </sqoop>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

遇到的问题

经常会遇到这种问题:直接使用 sqoop 可以执行,但是在 oozie 中就无法执行了。这个时候可以按照下面的思路进行排查:

  • 1 oozie 中的 lib 是否与 sqoop 相同。对比 sqoop/lib 以及 oozie/lib/xxx/sqoop 就可以了
  • 2 oozie 中如果是以 arg 这种方式启动。那么问题很有可能出在 query 的别名以及 split-by 参数上 …. 因为在 sqoop 中可以自动推断,但是在 oozie 中就无法知道字段所属的表了。

举个例子

sqoop --import .... --query "select a.*,b.* from t1 a left join t2 b on a.id=b.id..." --split-by id ...

这个时候 oozie 里面,无法知道 id 到底是哪个表的。需要指定它的别名才可以

...
<arg>--split-by</arg>
<arg>a.id</arg>
...

Spark 是现在应用最广泛的分布式计算框架,oozie 支持在它的调度中执行 spark。在我的日常工作中,一部分工作就是基于 oozie 维护好每天的 spark 离线任务,合理的设计工作流并分配适合的参数对于 spark 的稳定运行十分重要。

Spark Action

这个 Action 允许执行 spark 任务,需要用户指定 job-tracker 以及 name-node。先看看语法规则:

语法规则

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.3">
    ...
    <action name="[NODE-NAME]">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[SPARK SETTINGS FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <master>[SPARK MASTER URL]</master>
            <mode>[SPARK MODE]</mode>
            <name>[SPARK JOB NAME]</name>
            <class>[SPARK MAIN CLASS]</class>
            <jar>[SPARK DEPENDENCIES JAR / PYTHON FILE]</jar>
            <spark-opts>[SPARK-OPTIONS]</spark-opts>
            <arg>[ARG-VALUE]</arg>
                ...
            <arg>[ARG-VALUE]</arg>
            ...
        </spark>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

prepare 元素

它里面可以执行删除文件或者创建目录的操作,比如

<delete path="hdfs://xxxx/a"/>
<mkdir path="hdfs://xxxx"/>

一般来说,离线的 spark 任务最重都会生成一些数据,这些数据可能存储到数据库中,也可能直接存储到 hdfs,如果存储到 hdfs 就涉及到清除目录了。比如你可能在测试环境需要频繁的重复运行 spark 任务,那么每次都需要清除目录文件,创建新的目录才行。

job-xml

spark 任务的参数也可以放在 job-xml 所在的 xml 中。

confugration

这里面的配置的参数将会传递给 spark 任务。

master

spark 运行的模式,表示 spark 连接的集群管理器。默认可以使 spark 的独立集群(spark://host:port)或者是 mesos(mesos://host:port)或者是 yarn(yarn),以及本地模式 local

mode

因为 spark 任务也可以看做主节点和工作节点模式,主节点就是驱动程序。这个驱动程序既可以运行在提交任务的机器,也可以放在集群中运行。

这个参数就是用来设置,驱动程序是以客户端的形式运行在本地机器,还是以集群模式运行在集群中。

name

spark 应用的名字

class

spark 应用的主函数

jar

spark 应用的 jar 包

spark-opts

提交给驱动程序的参数。比如 --conf key=value 或者是在 oozie-site.xml 中配置的oozie.service.SparkConfiguationService.spark.configurations

arg

这个参数是用来提交给 spark 应用的参数

例子

官网给出的例子:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirstsparkjob">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>local[*]</master>
            <mode>client<mode>
            <name>Spark Example</name>
            <class>org.apache.spark.examples.mllib.JavaALS</class>
            <jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
            <spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
            <arg>inputpath=hdfs://localhost/input/file.txt</arg>
            <arg>value=2</arg>
        </spark>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

我自己工作时的例子:

<action name="NODE1">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>yarn</master>
        <mode>cluster</mode>
        <name>NODE1</name>
        <class>com.test.NODE1_Task</class>
        <jar>/lib/dw.jar</jar>
        <spark-opts>--executor-memory 1G --num-executors 6 --executor-cores 1 --conf spark.storage.memoryFraction=0.8</spark-opts>
        <arg>参数 1 </arg>
        <arg>参数 2 </arg>
        <arg>参数 3 </arg>
    </spark>
</action>

日志

spark action 日志会重定向到 oozie 的 mapr 启动程序的 stdout/stderr 中。

通过 oozie 的 web 控制条,可以看到 spark 的日志。

spark on yarn

如果想要把 spark 运行在 yarn 上,需要按照下面的步骤执行:

  • 在 spark action 中加载 spark-assembly 包
  • 指定 master 为 yarn-client 或者 yarn-master

为了确保 spark 工作在 spark 历史服务器中可以查到,需要保证在 –conf 中或者 oozie.service.SparkConfiturationService.spark.configrations 中设置下面的三个参数:

  • spark.yarn.historyServer.address=http://spark-host:18088
  • spark.eventLog.dir=hdfs://NN:8020/user/spark/applicationHistory
  • spark.eventLog.enabled=true

spark action 的 schema

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:spark="uri:oozie:spark-action:0.1" elementFormDefault="qualified"
           targetNamespace="uri:oozie:spark-action:0.1">    <xs:element name="spark" type="spark:ACTION"/>
    <xs:complexType name="ACTION">
        <xs:sequence>
            <xs:element name="job-tracker" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="name-node" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="prepare" type="spark:PREPARE" minOccurs="0" maxOccurs="1"/>
            <xs:element name="job-xml" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="configuration" type="spark:CONFIGURATION" minOccurs="0" maxOccurs="1"/>
            <xs:element name="master" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="mode" type="xs:string" minOccurs="0" maxOccurs="1"/>
            <xs:element name="name" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="class" type="xs:string" minOccurs="0" maxOccurs="1"/>
            <xs:element name="jar" type="xs:string" minOccurs="1" maxOccurs="1"/>
            <xs:element name="spark-opts" type="xs:string" minOccurs="0" maxOccurs="1"/>
            <xs:element name="arg" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="CONFIGURATION">
        <xs:sequence>
            <xs:element name="property" minOccurs="1" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="name" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="value" minOccurs="1" maxOccurs="1" type="xs:string"/>
                        <xs:element name="description" minOccurs="0" maxOccurs="1" type="xs:string"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="PREPARE">
        <xs:sequence>
            <xs:element name="delete" type="spark:DELETE" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="mkdir" type="spark:MKDIR" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="DELETE">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
    <xs:complexType name="MKDIR">
        <xs:attribute name="path" type="xs:string" use="required"/>
    </xs:complexType>
</xs:schema>

本文永久更新链接地址:http://www.linuxidc.com/Linux/2016-11/137323.htm

正文完
星哥说事-微信公众号
post-qrcode
 
星锅
版权声明:本站原创文章,由 星锅 2022-01-21发表,共计40198字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
【腾讯云】推广者专属福利,新客户无门槛领取总价值高达2860元代金券,每种代金券限量500张,先到先得。
阿里云-最新活动爆款每日限量供应
评论(没有评论)
验证码
【腾讯云】云服务器、云数据库、COS、CDN、短信等云产品特惠热卖中