This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5). Many studies have shown that human mistakes are an important source of system failures. Further, repairing mistakes is often time consuming, leading to high unavailability. In this project, we will explore a novel approach to dealing with human mistakes called operator-proof systems management. In an operator-proof system, an omnipresent management infrastructure will enable the system to defend itself against operator mistakes. The infrastructure will constantly monitor operator actions and the system state to decide when and how the system should defend itself. Possible defensive measures include blocking operator actions that could lead to a mistake and/or limiting operator access to prevent mistakes from spreading throughout the system. Blocks are later lifted if the system can test the correctness of the operator actions.To explore our ideas, we will design and implement two very different prototype operator-proof systems: an Internet service and an enterprise system. We will explore the design space and evaluate the overall approach by running a large set of experiments, where volunteer operators of different levels of experience are asked to perform a variety of tasks on the prototype systems.Broader impacts. Our research will provide a concrete step toward the realization of a model where large computer systems can be operated at lower cost by less skilled individuals. Our investigation will also expose a large number of students (acting as volunteer operators) to system management issues and our proposed solutions.
|Effective start/end date||9/1/09 → 8/31/12|
- National Science Foundation (National Science Foundation (NSF))