I have Spark streaming application which basically get a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn’t got completed with in the usual time and currently there is no way we could figure out the jobs status without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it’s hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario i will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won’t be able to rely on the records received.
Is there a better way? TIA