Spark job output is not available immediately on Filesystem for other job to read

Akkism · ‎09-09-2024

there is a time delay between when the hadoop job completes, and the output is made available for other jobs to run. First job is completing succesfully the same output is feeding for another job but some times i'm getting file not found exception but after a minute i can see the file in same location.

This is only happening one in a hundered jobs and it's not consistent. Can someone please let me know what's going wrong and why output file is not available immediately after job completes?

I've checked the server resorce usage it's under utilized and everything looking good but i don't know what causing the issue. can someone please help me with this

Mr_Techie · ‎09-12-2024

@Akkism

Good day!

It might occurs due to delay in how Hadoop synchronizes its files after the completion of a Spark job. Even though the first job may complete successfully, the output might not be fully written to the filesystem, especially in a distributed environment.

The cause might be the file may exist in the system, but HDFS might not have fully replicated the data across the nodes, or the metadata might not have been updated, leading to a temporary visibility delay.

Spark jobs use the Hadoop Job Commit Protocol, and in some scenarios, speculative execution can cause multiple tasks to write output to the same location. If some tasks finish slightly before others, the job may appear complete before the files are fully visible on the filesystem.

You can try disabling speculative execution by setting the following configuration in your Spark job:

"spark.speculation = false" (this ensures that no extra tasks are run, reducing the chances of inconsistent outputs.)

- Introduce a small delay or retry mechanism in your second job to wait until the output becomes available.

"Thread.sleep(60000) # wait for 60 seconds before retrying" (Alternatively, you can implement a loop that checks for the file’s existence before proceeding.)

- In some distributed environments, network issues can cause latency between nodes when replicating files. Even if your system resources are under-utilized, intermittent network hiccups could cause short-lived delays in file visibility.

- Some file systems might cache the file's metadata or contents, which can result in a delay in propagating file changes, especially across different nodes. Maken sure that the job is not using any caching mechanisms that delay visibility.

I hope this give some insights to resolve your issue, let me know

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Spark job output is not available immediately on Filesystem for other job to read

Spark job output is not available immediately on Filesystem for other job to read

Re: Spark job output is not available immediately on Filesystem for other job to read