Hello,
Last weekend we had a major incident with our CPS Data Base.
The root cause was “db log” full … and unfortunately for us, the monitoring via Solman was not set properly, and consequence no alert has been generated.
Since the incident the solman monitoring has been reviewed , some metrics improved.
But we still have a problem. In Solman the metrics have been created/designed to ping / measure / check the availability of database, of central instance, the J2EE status etc .. but we would like monitor the real capacity for CPS to run/execute jobs to be sure he is alive.
Currently to be sure that CPS is up and running / working /processing jobs, a new job “sent mail” has been created .This job run 4x/day sending a sms with subject “ I’m alive”. a test has been done and if DB log is full, the job cannot be processed, so no message is delivered.
But we prefer to have message only when a problem occurs. Meaning that job sending the message “I’m alive” cannot run.
How can we manage/catch a “missing” job ,after 2x he has not been run as scheduled/planned to raise an alert in Solman ?
Or if you have an better idea to manage the “I’m alive and working”, you are welcome.
Thanks,
Delphine