High Availability in Workload Automation By Jared Dahl
W
hen designing a software system, most of the work focuses on the functional areas of the product like the user interface, system storage, and processes running in the background. The possibility of long-term downtime—due to hardware, operating system, or software failure—often gets neglected. IT management has been aware of unplanned downtime for many years. IBM estimated that in 1996, $4.54 billion in productivity and revenues were lost due to systems being unavailable.1
High availability is a key way to prevent such downtime. The concept is a design and implementation approach that ensures a software system stays available to its users even when there are hardware or software problems.
The Risk of Relying on Backups HIGH RISK Credit card processing & other financial processes Inventory pulls
Payroll processing
Imperfect Backups and the Redundancy Solution For years, the standard approach was to back up a system at some accepted interval and, if the system failed, to restore the backup. While this approach works well for non-critical systems, it can spell disaster for systems that are integral to an organization’s operations. For example, if Kay in accounting discovers her PC is fried after a power surge, restoring a backup during the next business day is probably fine. But that schedule would not be “fine” for Amazon.com if their credit card processing server failed at 6:00 p.m. the week before Christmas. Amazon could lose millions of dollars in 30 minutes of downtime. To prevent this problem, we must refer to the engineering technique known as redundancy. Put simply, redundancy is designing a system with backup components ready to replace the main component should it fail. Safety-critical systems are frequently designed with multiple layers of redundancy: aircraft have fly-by-wire 1
IBM Global Services, Improving systems availability, IBM Global Services, 1998
ERP system
Database housekeeping
Daily reports Data warehouse refresh Backups from staff desktop PCs & laptops
LOW RISK
Fast. Easy. Automate.
and hydraulic controls, large ships have twin propellers and motors, and parachutes always have a backup. They, like all redundant systems, are designed to eliminate single points of failure. In the case of software systems, this generally means installing and running the same software on at least two different computers and sharing the user’s setup and data between them. That way, if either computer fails, the other can be configured to continue operating with minimum human intervention.
High Availability in Workload Automation For workload automation software, offering a high availability solution is very important. Having a single point of control for the scheduling and setup of your enterprise workload is a distinct advantage. It allows users to schedule, prioritize, and coordinate work across a multitude of different computer architectures, operating systems, and platforms. They also can submit work to multiple target agents, schedule in a platform-agnostic way, and view the results of the work in a standardized fashion. However, this architecture has a drawback. By centralizing control into a single enterprise scheduler, IT runs the risk of creating a single point of failure for all their business processes. If the scheduling server is not operational, jobs will not be submitted to agents, users cannot edit setup or view history in the user interface, and there is no way to control the agents. For this reason, workload automation vendors are starting to include various high availability features in their products. Most of these features are very complex to set up, which limits a user’s ability to properly test them before purchasing.
High Availability with Skybot Scheduler
TM
Skybot Scheduler’s high availability feature (with version 3.0 and higher) introduces the ability to create a redundant Skybot Scheduler server on another computer. This system, known as the standby server, mirrors the setup and history information of a single master server, and with a simple command can step into the role of the master server. Recovery after a system failure is significantly easier with this setup. For example, a system administrator backs up his system daily, and it fails overnight due to a faulty disk. When the failure is discovered, the admin must:
• Replace the disk and reinstall the OS and scheduler software. • Recover the backup of the enterprise scheduler and apply it to the new hardware. • Correct any software licensing issues. • Direct all the agents to connect to the new server. • Go through the schedule to figure out which jobs haven’t run since the last backup.
The last step is by far the most complex and error-prone since the administrator has to manually check the job forecast, job history, and systems that run the schedule to see what work has been done. The last step also requires an understanding of the business side of the schedule. Re-running an inventory job or point-of-sale process that completed successfully could cause confusion across other departments. It may be equally devastating if the jobs don’t run at all. Contrast this to the recovery steps for the master and standby server in Skybot Scheduler:
• Standby server detects the absence of master server and notifies admin. • Admin activates standby server by running a single command via remote.
Fast. Easy. Automate.
Failed Backup Recovery Procedure PROBLEM: Faulty disk
MANUAL Replace the disk.
Reinstall OS and scheduling software.
Get new license from vendor.
YES
Restore most recent backup of the scheduler’s database. Does the software require a new license?
NO
AUTOMATIC
[
Standby server notifies admin that master server is down.
[
Activate standby server. • The standby server calculates & manages missed jobs • Agents point to standby • Agents report the status of jobs in progress • Scheduling & submitting of jobs continues from exact point of failure
Connect all agents to new server.
Determine which jobs still need to run.
YES
Check systems, job history, & forecast for missed jobs. Do you have any?
Check forecast for available resources. NO Run necessary missed jobs. Run new backup at earliest available time.
The admin’s work is complete, but Skybot Scheduler on the standby server will be busy:
• All missed jobs are automatically calculated and managed according to rules provided by the admin (which can include simply placing them in a “bucket” for the admin to look at later). • Scheduling and submitting of jobs pick up where it left off on the master. • All agents automatically switch to the standby server and report the status of jobs in progress during the master failure.
Since Skybot Scheduler uses streaming database replication under the covers, the product setup and history should be very close—if not identical—to the masters at the moment of failure.
Fast. Easy. Automate.
Conclusion Skybot Scheduler keeps your mission-critical processes protected from long-term, unscheduled downtime with a replicating database and a simple recovery procedure. Beyond system failure and data protection, workload automation solutions like Skybot Scheduler minimize the risks that result from less common areas, such as process chains, manual input, or undocumented dependencies. Finally, workload automation provides greater reliability and control over your entire network of processes, job streams, and tasks across multiple servers and platforms. Visit www.skybotsoftware.com for more information or call 1.877.506.4786 for a live demonstration of Skybot Scheduler’s high availability feature.
Visit www.skybotsoftware.com or call 1.877.506.4786 for more information or a FREE 30-day trial. ŠSkybot Software. All trademarks and registered trademarks are the property of their respective owners. MSS0912
1.952.746.4786 info@skybotsoftware.com www.skybotsoftware.com