- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- Re: Expected performance of Spark when reading CSV...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-24-2021 07:15 PM
06-24-2021 07:15 PM
Expected performance of Spark when reading CSV/JSON and writing to MaprDB JSON
It looks like that using Spark Scala/Python when we load a CSV which has around 9 million records we get a performance of around 50 seconds or so when loading this to MaprDB Json.
Any pointers on how to make it better or this is expected performance. How can we increase parallelism.
Tried with number of executors but it didn't help.
Note: Assuming we have the latest mapr core patches and spark rpm running on 6.1 cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-24-2021 10:34 PM
06-24-2021 10:34 PM
Re: Expected performance of Spark when reading CSV/JSON and writing to MaprDB JSON
That is definitely absurdly too slow.
I would not be able to say anything without more diagnostics about how the program is actually running, but this should be going much faster even from a single thread (I would have expected 1000 times that even in the worst kind of non-parallel execution)
To solve the problem you need to break apart the pieces to find what is happening that is sooo pathologically wrong.
So ... how long does it take to tally a sum of all 9 million rows? What happens when you break the file into 10, 20, 50 or 100 pieces and try the same thing?
And how long does it take to write to a database from synthetic data? Consider writing consecutive integers starting from disjoint, possibly random points. If your program doing the writing is heavily parallelised, how fast can it write to the database? How about to files? And /dev/null?
How well distributed is your database? Presumably since you haven't been able to write to it at any decent speed, it is still quite small and thus limited to being on just a few machines. Try using the bulk loader to fill it up to a decent size in at least 10 containers and check your speed again.
Your speed is so slow that I almost wonder if it is being limited by the Spark scheduler itself. How fast does a simple multi-threaded java program running on a single machine perform the load? If much slower than Spark, you have some pathology in how your Spark program is running. Can you see some sort of execution trace? Is it perhaps true that you are running a single unit of execution per input line?
Any, you get the idea. Break the program down and replace the pieces with alternatives or with complete fakes to find out what is slowing things down. The problem is, almost by definition, not where you are looking so you must force yourself to bisect the problem space so that you can focus on aspects you have not considered. Since you have probably already thought about this problem at length, it will be hard to convince yourself that the problem is actually what it is. Thus, simple experiments are necessary.
Keep in mind the risk that the problem will disappear in any dissected version of your program but reappear when you put it back together. This will imply that the problem is one of interaction ... that will also succumb to a similar bisection, but require a bisection of more subtlety.
It is also possible that the problem is in multiple places. As you dissect, if you find a serious problem, do not assume it is the only one. Test the other have of the problem space to assure yourself honestly that the problem you found is the only one.