ER: add a duration threshold for memory and CPU alerts
Is your feature request related to a problem? Please describe.
The CPU alert in its current form is not useful and consequently never used as CPU can spike to 100% (or close to) multiple times per day for all sorts of reasons on a TM1 server that are within normal operating bounds (e.g. loading data with parallel threads, any operation involving MTQ such as query execution or processing feeders). Therefore this alert is never configured as it would lead to multiple false positive reports. What is required to make this alert useful is a duration threshold seconds e.g. set CPU at 99 and duration at 120 seconds would mean the alert will only be triggered if average CPU exceeds 99% for a time window of 120 seconds or longer.
Similarly the Memory alert (while already useful) would be much more useful if it also had a duration threshold seconds parameter as this would vastly reduce false positive alerts due to memory spiking for short durations. E.g. in some systems memory can spike above 90% for short duration due to processes which are external to TM1 like data loads via tm1py where python uses a lot of memory, or backups where large files are temporarily cached in memory while writing to disk.
Describe the solution you'd like
Add configurable duration parameter to BOTH CPU and Memory alert types.
Additional context
This would improve monitoring quality by eliminating false positive alerts. The problem with false positives is that they create background noise and increase the risk that legitimate alerts will be ignored or not actioned.