Prometheus: Online Estimation of Optimal Memory Demands for Workers in In-memory Distributed Computation
Xu, Guoyao ;  Xu, Cheng-Zhong
2017
会议日期2017
会议地点Santa Clara, CA
英文摘要Modern in-memory distributed computation frameworks like Spark adequately leverage memory resources to cache intermediate data across multi-stage tasks in pre-allocated worker processes, so as to speedup executions. They rely on a cluster resource manager like Yarn or Mesos to pre-reserve specific amount of CPU and memory for workers ahead of task scheduling. Since a worker is executed for an entire application and runs multiple batches of DAG tasks from multi-stages, its memory demands change over time [3]. Resource managers like Yarn solve the non-trivial allocation problem of determining right amounts of memory provision for workers by requiring users to make explicit reservations before execution. Since the underlying execution frameworks, workload and complex codebases are invisible, users tend to over-estimate or under-estimate workers' demands, leading to over-provisioning or under-provisioning of memory resources. We observed there exists a performance inflection point with respect to memory reservation per stage of applications. After that, performance fluctuates little even under over-provisioned memory [1]. It is the minimum required memory to achieve expected nearly optimal performance. We call these capacities as optimal demands. They are capacity cut lines to divide over-provisioning and under-provisioning. To relieve the burden of users, and provide guarantees over both maximum cluster memory utilization and optimal application performance, we present a system namely Prometheus for online estimation of optimal memory demand for workers per future stage, without involving users' efforts. The procedure to explore optimal demands is essentially a search problem correlated memory reservation and performance. Most existing searching methods [2] need multiple profiling runs or prior historical execution statistics, which are not applicable to online configuration of newly submitted or non-recurring jobs. The recurring applications' optimal demands also change over time under variations of input datasets, algorithmic parameters or source code. It becomes too expensive and infeasible to rebuild new search model for every setting. Prometheus adopts a two-step approach to tackle the problem: 1) For newly submitted or non-recurring jobs, we do profiling and histogram frequency analysis of job's runtime memory footprints from only one pilot run under over-provisioned memory. It achieves a highly accurate (over 80% accuracy) initial estimation of optimal demands per stage for each worker. By analyzing frequency of past memory usages per sampling time, we efficiently estimate probability of base demands and distinguish them from unnecessarily excessive usages. Allocation of base demands tends to achieve near-optimal performance, so as to approach optimal demands. 2) Histogram frequency analysis algorithm has an intrinsic property of self-decay. For subsequent recurring submissions, Prometheus exploits this property to efficiently perform a recursive search. It obtains stepwise refinement and rapidly reaches optimal demands through few recurring executions. We demonstrate this recursive search reaches up to 3-4 times lower searching overheads and 2-4 times more accuracy compared with alternative solutions like random search. We validate the design by implementing Prometheus atop of Spark and Yarn. The experimental results show that it achieves an ultimate accuracy of more than 92%. By deploying Prometheus and reserving memory according to the optimal demands, one could improve cluster memory utilization by about 40%. It simultaneously reduces individual application execution time by over 35% comparing to the state-of-the-art approaches. Overall, the optimal memory demands knowledge provided by Prometheus enables cluster managers to effectively avoid over-provisioning or under-provisioning of memory resources, and achieve optimal application performance and maximum resource efficiency.
语种英语
内容类型会议论文
源URL[http://ir.siat.ac.cn:8080/handle/172644/12676]  
专题深圳先进技术研究院_数字所
作者单位2017
推荐引用方式
GB/T 7714
Xu, Guoyao , Xu, Cheng-Zhong. Prometheus: Online Estimation of Optimal Memory Demands for Workers in In-memory Distributed Computation[C]. 见:. Santa Clara, CA. 2017.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace