Artificial intelligence could help data centers run far more efficiently A novel system developed by MIT researchers automatically “learns” how to schedule data-processing operations on thousands of servers — which is traditionally reserved for obscure, human-designed algorithms. Doing so will make today’s energy-hungry data centers run more efficiently. Data centers contain tens of thousands of servers, which constantly execute data processing tasks from developers and users. Cluster scheduling algorithms allocate incoming tasks on servers to use all available computing resources in real-time and speedup jobs.
Traditionally, however, humans have worked out some basic guidelines (“policies”) and various trading scheduling algorithms. For example, they can code an algorithm to quickly complete certain tasks, or even divide resources between jobs. But workloads — mixed workgroups — come in all sizes. Therefore, it is virtually impossible for humans to optimize their scheduling algorithms for a particular workload, and as a result, they are often limited to their actual capacity.
MIT researchers instead offloaded all of the manual codings to machines. In a paper on SIGCOMM, they describe a system that leverages the “reinforcement
learning” (RL), a trial-and-error machine-learning technique for making scheduling decisions for specific workloads in specific server clusters. To do so, they built novel RL techniques that can train on complex workloads. In training, the system tries many ways to allocate the upcoming workload on the servers, eventually finding the right conversion to use compute resources and fast processing speeds. There is no need for human intervention beyond simple instructions such as “reduce the time to finish the job.”
Compared to the best-handwritten scheduling algorithms, the researcher’s system completes jobs 20 to 30 percent faster, and twice as fast during high traffic times. However, the system learns how to efficiently workload to eliminate very little waste. The results indicate that the system can enable data centers to handle a single workload at a high speed using fewer resources. “If you have a way of trial and error using machines, they can try different ways of scheduling jobs and automatically figure out which strategy is better than others,” says Ph.D. Hongzi Mao. The student in the Department of Electrical Engineering and Computer Science (EECS). “It will automatically improve system performance. A slight improvement in consumption, even 1 percent would save millions of dollars and more energy in data centers. “ “Not everything is good enough to make scheduling decisions,” adds Mohamed Alizadeh, an EECS professor and co-author of the research at the Computer Science and Artificial Intelligence Laboratory (CSAIL). “In existing systems, these are hard-coded parameters. You have to decide in advance. Our system instead learns to fine-tune its scheduling process features based on the data center and workload.”
Joining Mao and Alizadeh on paper: Postdocs Malte Schwarzkopf and Shailesh Bozza Venkatakrishnan, and Graduate Research Assistant Jili Meng, all CSAIL. RL for schedule.
Typically, data processing jobs fall into data centers referred to as “nodes” and “edge” graphs. Each node represents some computational task, where the larger
the node, the more computing power required. The edges connecting the nodes connect the connected tasks. Scheduling algorithms allocate nodes to servers based on different mechanisms.
Traditional RL systems are not accustomed to processing such dynamic graphs. These systems use a software “agent” that makes decisions and rewards the feedback signal. Essentially, it seeks to maximize its rewards for any action to learn ideal behavior in a particular context. For example, they help robots learn how to take objects by interacting with the environment while processing video or images through simple set of pixels.
To create their RL-based scheduler called Decima, researchers need to develop a model that can process graph-structured jobs and scale it to a large number of jobs and servers. The “agent” of their system is a scheduling algorithm that affects the graph neural network, commonly used to process graph-structured data. To come up with a graph neural network suitable for scheduling, they implemented a custom component that integrates information along the paths of the graph — quickly assessing how much computation is needed to complete a given portion of the graph. This is important for job scheduling because “child” (bottom) nodes cannot begin to execute until their “parent” (top) nodes are completed, so it is important to attenuate future work in different ways on the graph to make future scheduling decisions.
To train their RL system, the researchers simulated several different graph sequences that simulate the workload that comes into the data centers. The agent then makes decisions about how to assign each node along the graph to each server. For each decision, a component calculates the reward based on how well it has performed in a particular job — such as reducing the average time taken to process a single job. Until the agent receives as much reward as possible, it will improve its decisions.
Baselining Workload
One concern, however, is that some workload sequences are more difficult to process than others because they involve large tasks or complex structures. They always take longer to process — and, therefore, the reward signal is always shorter — than the simpler ones. This does not mean that the system performs poorly: it can make a good time on challenging workloads but is slower than lightweight workloads. That variation in difficulty makes it difficult to determine which measures are good for the model.
To address this, researchers have adopted a technique called “baselining” in this case. This technique takes averages of scenarios with a large number of variables and uses those averages as a baseline to compare future outcomes. During training, they calculated the baseline for each input sequence. Then, they allow the scheduler to train multiple times for each workload. Subsequently, the system took on average performance in all decisions taken for a single input workload. That average is the baseline by which its future decisions can be compared to determine whether its decisions are good or bad. They refer to this new method as “input-based baselining”.
Researchers say that innovation applies to many different computer systems. “This is a simple way to practice reinforcement in an environment where there is this input process that affects the environment, and you want every training event to consider a model of that input process,” he said. “Almost all computer systems deal with constantly changing environments.” Aditya Akella, a professor of computer science at the University of Wisconsin in Madison, whose team has designed several high-performance schedulers, found that the MIT system helps them improve their own processes. “Decima can take it a step further and find opportunities for [scheduling] optimization, which are not realized through manual design/tuning processes,” Akella says. “The schedulers we designed have made significant improvements to the methods used
in production in terms of application performance and cluster efficiency, but there is still a gap with the ideal improvements we can achieve. , which is very surprising Spilled. ”
Currently, their model is trained on simulations that attempt to re-create incoming online traffic in real-time. Later, researchers hope to train the model on real-time traffic, which can crash servers. So, they are currently developing a “safety net” that will turn off their system when it causes a crash. “We think of it as training cycles,” says Alizadeh. “We want this system to be constantly trained, but it also has some training cycles. If it goes too far, it can keep falling.”