Reducing faulty jobs by job submission verifier in grid engine

Misha Ahmadian, Eric Rees, Yong Chen, Yu Zhuang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Grid Engine is a Distributed Resource Manager (DRM), that manages the resources of distributed systems (such as Grid, HPC, or Cloud systems) and executes designated jobs which have requested to occupy or consume those resources. Grid Engine applies scheduling policies to allocate resources for jobs while simultaneously attempting to maintain optimal utilization of all machines in the distributed system. However, due to the complexity of Grid Engine’s job submission commands and complicated resource management policies, the number of faulty job submissions in data centers increases with the number of jobs being submitted. To combat the increase in faulty jobs, Grid Engine allows administrators to design and implement Job Submission Verifiers (JSV) to verify jobs before they enter into Grid Engine. In this paper, we will discuss a Job Submission Verifier that was designed and implemented for Univa Grid Engine, a commercial version of Grid Engine, and thoroughly evaluated at the High Performance Computing Center of Texas Tech University. Our newly developed JSV communicates with Univa Grid Engine (UGE) components to verify whether a submitted job should be accepted as is, or modified then accepted, or rejected due to improper requests for resources. It had a substantial positive impact on reducing the number of faulty jobs submitted to UGE by far. For instance, it corrected 28.6% of job submissions and rejected 0.3% of total jobs from September 2018 to February 2019, that may otherwise lead to long or infinite waiting time in the job queue.

Original languageEnglish
Title of host publicationProceedings of the Practice and Experience in Advanced Research Computing
Subtitle of host publicationRise of the Machines (Learning), PEARC 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450372275
DOIs
StatePublished - Jul 28 2019
Event2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019 - Chicago, United States
Duration: Jul 28 2019Aug 1 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2019 Conference on Practice and Experience in Advanced Research Computing: Rise of the Machines (Learning), PEARC 2019
CountryUnited States
CityChicago
Period07/28/1908/1/19

Keywords

  • Faulty Jobs
  • Grid Engine
  • Job Submission Verifier

Fingerprint Dive into the research topics of 'Reducing faulty jobs by job submission verifier in grid engine'. Together they form a unique fingerprint.

Cite this