We had added an alert in March this year, and it went through multiple reviews, there had been several meetings, where most people from all sort of background had their suggestions.
Even before we start to code, there were many aspects discussed, and the most fun part was every meeting we discuss and document, it would change in the next meeting. The challenge is not writing the monitoring tool, but the basic understanding and the no. of people involved in suggestions, which actually defeats the purpose and we keep beating around the bushes.
We had apache oozie process for scheduling the data pipeline, and each run will consider the last status and continues from there, rather than time-based scheduling which is well supported in oozie. So in case of more than 2 to 3 hours delay, any process which is running will check when it was last stopped and processes from there, but the subsequent processes will not have any further processing to do. So ideally any successfully processed job, within an hour is good even if we had 2 to 3 failures.
The teams involved were from Marketing, Agile, SRE, DevOps there was less representation from Engineering as the focus was more on alert only. And DevOps wanted to write the alert taking input from all teams involved. Now here is where it takes longer to write as many people not aware of much internals, just keep suggesting the best, and every time we meet we get new suggestions.
Finally, after multiple trials, we get the right code in place and the alert is ready it took some almost 3 months to get it right, as it never has failed very often, and since it had been so stable that people often forget the alert, and when things fail everybody boils down to the impact. Well, it’s the same story at every place.
We rarely got the alert, and as we fine-tuned where it will also auto fix itself, if there was a small glitch or any change pushed to production.
We got an alert on Nov 1, which said it missed one job, but all data was processed, and there was no delay in any part, the last success my alert said was one hour old. We checked there were ntp alerts for some time, I checked all the logs, and though nothing was missing. Oh, the heck, we actually were in the time when DST was over, and all our system timing was PST.
But oozie uses GMT timing, my alert checks system time and generates alert, and we got 1 hour added, but there was nothing ran for that time, and it gave the hour delay. The interesting part was oozie also uses the system time to show the status when queried. As we improve the granularity of checks, we get into this situation, finally when we came to know it was a fake alert we were relieved.
We had a case in one organization when DevOps actually created a fake job to cover that daylight saving, either skipping or adding an hour and no noise for alerts.
Not many but some organizations still maintain system time as per their time zone, which is the problem, if you are working globally so its better to have system time as GMT which does not change with daylight saving will save a lot of our time, with this small glitches.