HPE Ezmeral Software platform
1753481 Members
4106 Online
108794 Solutions
New Discussion

Re: Expected performance of Spark when reading CSV/JSON and writing to MaprDB JSON

 
Harshkohli
HPE Pro

Expected performance of Spark when reading CSV/JSON and writing to MaprDB JSON

It looks like that using Spark Scala/Python when we load a CSV which has around 9 million records we get a performance of around 50 seconds or so when loading this to MaprDB Json.

Any pointers on how to make it better or this is expected performance. How can we increase parallelism.

Tried with number of executors but it didn't help.

Note: Assuming we have the latest mapr core patches and spark rpm running on 6.1 cluster.

 

I work for HPE
1 REPLY 1
tdunning
HPE Pro

Re: Expected performance of Spark when reading CSV/JSON and writing to MaprDB JSON

That is definitely absurdly too slow.

I would not be able to say anything without more diagnostics about how the program is actually running, but this should be going much faster even from a single thread (I would have expected 1000 times that even in the worst kind of non-parallel execution)

To solve the problem you need to break apart the pieces to find what is happening that is sooo pathologically wrong.

So ... how long does it take to tally a sum of all 9 million rows? What happens when you break the file into 10, 20, 50 or 100 pieces and try the same thing?

And how long does it take to write to a database from synthetic data? Consider writing consecutive integers starting from disjoint, possibly random points. If your program doing the writing is heavily parallelised, how fast can it write to the database? How about to files? And /dev/null?

How well distributed is your database? Presumably since you haven't been able to write to it at any decent speed, it is still quite small and thus limited to being on just a few machines. Try using the bulk loader to fill it up to a decent size in at least 10 containers and check your speed again.

Your speed is so slow that I almost wonder if it is being limited by the Spark scheduler itself. How fast does a simple multi-threaded java program running on a single machine perform the load? If much slower than Spark, you have some pathology in how your Spark program is running. Can you see some sort of execution trace? Is it perhaps true that you are running a single unit of execution per input line?

Any, you get the idea. Break the program down and replace the pieces with alternatives or with complete fakes to find out what is slowing things down. The problem is, almost by definition, not where you are looking so you must force yourself to bisect the problem space so that you can focus on aspects you have not considered. Since you have probably already thought about this problem at length, it will be hard to convince yourself that the problem is actually what it is. Thus, simple experiments are necessary. 

Keep in mind the risk that the problem will disappear in any dissected version of your program but reappear when you put it back together. This will imply that the problem is one of interaction ... that will also succumb to a similar bisection, but require a bisection of more subtlety.

It is also possible that the problem is in multiple places. As you dissect, if you find a serious problem, do not assume it is the only one. Test the other have of the problem space to assure yourself honestly that the problem you found is the only one.

 

 

I work for HPE