Uploaded image for project: 'Fault Management'
  1. Fault Management
  2. DOCTOR-11

Extend maintenance workflow

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:

      Description

      Current maintenance workflow looks as following:
      1. VIM receives maintenance notification
      2. VIM notifies affected Consumers
      3. Consumer takes actions if necessary (or automatic action is issued based on pre-defined policies)

      This workflow does not cover the following problems:
      I. Administrator does not know when he can start doing the maintenance actions as there is no information about when every Consumer finished his actions.
      II. Consumer cannot delay or reject the maintenance notification in case he cannot issue the necessary actions (e.g. migration fails, Consumers APP has already been overloaded so cannot take the effect of migration another VM away, Consumer APP has already lost 1+1 redundancy so cannot migrate the only working controller VM due to maintenance)
      III. Consumer's fail to state he is ready for maintenance within a given time period due to different reasons (e.g. Consumer missed the maintenance notification, Consumer crashed / not present, migration action takes too long)

      I propose the following extension for this workflow
      1. VIM receives maintenance notification
      2. VIM notifies the affected Consumers and puts the node in 'going-to-maintenance' state so that new workload will not be scheduled on that node
      3. Every consumer takes actions (or automatic actions are taken) successfully
      4. Every Consumer notifies the VIM that from his/her point of view maintenance actions can be taken on the node
      5. VIM puts the node in 'maintenance' state so that Administrator can see that the node is now read for maintenance actions.

      Besides the above happy case the following error cases needs to be covered
      I. One or more Consumer fails to execute the action (or automatic recovery action execution fails). In this case one or more of the Consumers will not confirm but reject the maintenance notification therefore VIM will not put the node to 'maintenance' state but puts it back to 'enabled' state and notify the Administrator about the problem
      II. One or more Consumer misses the notification (or crashes, or too slow) and never confirm the maintenance of the node. To avoid keeping the node in 'going-to-maintenance' state forever VIM should implement a timer (maybe with Administrator configurable timeout value). When this maintenance timer times out before every Consumer confirmed the maintenance state the state of the node goes back to 'enabled' and the Administrator is notified about the problem.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            Assignee:
            bertys Bertrand Souville
            Reporter:
            gibi Balazs Gibizer
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Due:
              Created:
              Updated:
              Resolved: