saltstack 的 master 上 minion 连接较多, 下面这个程序可以分析哪些 minion 任务执行成功, 哪些执行失败以及哪些没有返回.
脚本说明:
一, 最先打印出本次任务的 job id,command name 以及其它相关信息, 然后是本次任务的执行流程和结果, 这和我们单独执行这个命令是一致的. 最后程序会打印出所有未成功的任务和未返回的任务, 并且重新执行一遍. 这里要说明的是, 因为没有查看对应的情景, 对于失败任务的排判断做的不好, 另外 minion 未连接我也归为任务未返回, 并且会再执行一遍, 实际上如果是 minion 未连接, 则不应该执行.
二, 程序我们先派生子进程去执行 salt 命令, 再 salt 命令执行完毕后, 我们的程序会对其中失败的和未返回的 minion 任务二次执行
三, 编写脚本
- import salt.utils.event
- import re
- import signal, time
- import sys
- import os
- def single_handler(target):
- os.execl('/usr/bin/salt', 'salt', target, 'state.sls', 'os')
- def handler(num1, num2):
- #signal.signal(signal.SIGCLD,signal.SIG_IGN)
- print 'We are in signal handler'
- print 'Job Not Ret:'+str(record[jid])
- print 'Job Failed:'+str(failedrecord[jid])
- print 'all done...'
- for item in failedrecord[jid]:
- #print item
- try:
- pid = os.fork()
- if pid == 0:
- single_handler(item)
- except OSError:
- print 'we exec.'+ item +'error!'
- for item in record[jid]:
- #print item
- try:
- print 'fork ok' + item
- pid = os.fork()
- if pid == 0 :
- single_handler(item)
- except OSError:
- print 'we exec.'+item + 'error!'
- sys.stdout.flush()
- os._exit(0)
- fd = open('/tmp/record', 'w+')
- #sys.stdout = fd
- #sys.stderr = fd
- signal.signal(signal.SIGCLD, handler)
- #fd = open('/var/log/record', 'w+')
- os.dup2(fd.fileno(), sys.stdout.fileno())
- os.dup2(fd.fileno(), sys.stderr.fileno())
- #sys.stdout = fd
- #sys.stderr = fd
- try:
- pid = os.fork()
- if pid == 0:
- time.sleep(2)
- try:
- os.execl('/usr/bin/salt', 'salt', '*', 'state.sls', 'os')
- except OSError:
- print 'exec error!'
- os._exit(1)
- except OSError:
- print 'first fork error!'
- os._exit(1)
- event = salt.utils.event.MasterEvent('/var/run/salt/master')
- flag=False
- reg=re.compile('salt/job/([0-9]+)/new')
- reg1=reg
- #a process to exec. command, but will sleep some time
- #another process listen the event
- #if we use this method, we can filter the event through func. name
- record={}
- failedrecord={}
- jid = 0
- #try:
- for eachevent in event.iter_events(tag='salt/job',full=True):
- eachevent=dict(eachevent)
- result = reg.findall(eachevent['tag'])
- if not flag and result:
- flag = True
- jid = result[0]
- print "job_id:" + jid
- print "Command:" + dict(eachevent['data'])['fun'] + '' + str(dict(eachevent['data'])['arg'])
- print "RunAs:" + dict(eachevent['data'])['user']
- print "exec_time:" + dict(eachevent['data'])['_stamp']
- print "host_list:" + str(dict(eachevent['data'])['minions'])
- sys.stdout.flush()
- record[jid]=eachevent['data']['minions']
- failedrecord[jid]=[]
- reg1 = re.compile('salt/job/'+jid+'/ret/([0-9.]+)')
- else:
- result = reg1.findall(eachevent['tag'])
- if result:
- record[jid].remove(result[0])
- if not dict(eachevent['data'])['success']:
- failedrecord[jid].append(result[0])
- #except:
- # print 'we in except'
- """``code`` print'Job Not Ret: '+str(record[jid])
- ``code`` print 'Job Failed:'+str(failedrecord[jid])
- ``code`` for item in failedrecord[jid]:
- ``code`` os.system('salt'+ str(item) + 'state.sls os')
- ``code`` for item in record[jid]:
- ``code`` os.system('salt'+ str(item) + 'state.sls os')
- ``code`` os._exit(0)
- ``code``"""
执行结果:
- job_id: 20151208025319005896
- Command: state.sls ['os']
- RunAs: root
- exec_time: 2015-12-08T02:53:19.006284
- host_list: ['172.18.1.212', '172.18.1.214', '172.18.1.213', '172.18.1.211']
- 172.18.1.213:
- ----------
- ID: configfilecopy
- Function: file.managed
- Name: /root/node3
- Result: True
- Comment: File /root/node3 is in the correct state
- Started: 02:53:19.314015
- Duration: 13.033 ms
- Changes:
- ----------
- ID: commonfile
- Function: file.managed
- Name: /root/commonfile
- Result: True
- Comment: File /root/commonfile is in the correct state
- Started: 02:53:19.327173
- Duration: 1.993 ms
- Changes:
- Summary
- ------------
- Succeeded: 2
- Failed: 0
- ------------
- Total states run: 2
- 172.18.1.212:
- ----------
- ID: configfilecopy
- Function: file.managed
- Name: /root/node2
- Result: True
- Comment: File /root/node2 is in the correct state
- Started: 02:53:19.337325
- Duration: 8.327 ms
- Changes:
- ----------
- ID: commonfile
- Function: file.managed
- Name: /root/commonfile
- Result: True
- Comment: File /root/commonfile is in the correct state
- Started: 02:53:19.345787
- Duration: 1.996 ms
- Changes:
- Summary
- ------------
- Succeeded: 2
- Failed: 0
- ------------
- Total states run: 2
- 172.18.1.211:
- ----------
- ID: configfilecopy
- Function: file.managed
- Name: /root/node1
- Result: True
- Comment: File /root/node1 is in the correct state
- Started: 02:53:19.345017
- Duration: 12.741 ms
- Changes:
- ----------
- ID: commonfile
- Function: file.managed
- Name: /root/commonfile
- Result: True
- Comment: File /root/commonfile is in the correct state
- Started: 02:53:19.357873
- Duration: 1.948 ms
- Changes:
- Summary
- ------------
- Succeeded: 2
- Failed: 0
- ------------
- Total states run: 2
- 172.18.1.214:
- Minion did not return. [Not connected]
- We are in signal handler
- Job Not Ret: ['172.18.1.214']
- Job Failed: []
- all done...
- fork ok 172.18.1.214
- 172.18.1.214:
- Minion did not return. [Not connected]
来源: http://www.bubuko.com/infodetail-2656918.html