Some jobs never end if stared by the job scheduler

Stefan_Badenhorst · December 6, 2024, 11:07am

Hi, I’m having a strange issue with running jobs on timescaledb 2.14.2 on PostgreSQL 14. I have a function that calls refresh materialized view. Calling the function directly finished in about 1 second.
I created a job to execute this on a schedule.
Calling directly run_job(1071) finished in about 1 second.
When the job is automatically started using the schedule, it never finishes. In pg_stat_activity, I can see the CALL to the function in the query column. There is no wait state reported and the state is active. The owner of the job, the function and view are all the same. Any idea why this could happen? I turned on debug logging and added a timeout in the job configuration of 10min. After 10 min I just see that the job failed because it timed out. No other information.

jonatasdp · December 6, 2024, 2:20pm

Hi @Stefan_Badenhorst ,

Do you see any other process competing for the same resource?

Do you have multiple refreshes? Other aggregates? Any compression in the continuous aggregates?

Have you tried to restart the background workers?

My first recommendation would be try to understand what could be locking the job.

Are you trying to change some metadata of the timescaledb job internally or some recursive rescheduling?

If you can share your basic refresh policies and some minimal POC that we can try, we can certainly find what’s going on.

Stefan_Badenhorst · December 9, 2024, 9:27am

Do you see any other process competing for the same resource?

Not sure how to check for that.

Do you have multiple refreshes? Other aggregates? Any compression in the continuous aggregates?

We do have multiple refresh jobs.

Have you tried to restart the background workers?

We have not tried restarting the background workers, many of the other jobs are working fine and we have recreated the problematic jobs.

My first recommendation would be try to understand what could be locking the job.

I ran some queries to look for locks but I think I need to spend more time on this to be sure.
The strange thing is, why would there be no locks if the job is manually run?

Are you trying to change some metadata of the timescaledb job internally or some recursive rescheduling?

No, just plainly scheduling a job.

There are a some warnings about max_workers_processes in the logs and we will be addressing those today to see if it makes a difference.

jonatasdp · December 9, 2024, 3:15pm

I assume other jobs could run in parallel and if the schedule overlaps is the problem, it will fail only via scheduler. If you run manually, rarely you’ll overlap the background ones.

No, just plainly scheduling a job.

In this case do you mean your background job as a add_job call inside it?

Stefan_Badenhorst · December 9, 2024, 4:58pm

No, the job function is only running the refresh of the view, nothing else.
There are only 2 jobs running at any point in time, since the max_worker_processes, max_parallel_workers, timescaledb.max_background_workers are blocking more of them from running.
So if there are locks preventing this job from completing, there is only one other job that it could be locking with.
There are many jobs that could be running at the same time, but it’s usually not the same jobs. Some of our jobs run every few seconds, while this job is once an hour, if I remember.

jonatasdp · December 12, 2024, 1:18pm

@Stefan_Badenhorst seems not the appropriate behavior. Do you see any messages in the server logs? maybe running out of memory?

If you can isolate your scenario, I’d recommend you bring it as a timescaledb issue on github. In this way our team can take a look and provide a fix.