- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- spark.read.load return dataframe bigger than actua...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2022 11:59 PM - last edited on 05-17-2022 10:15 PM by support_s
05-16-2022 11:59 PM - last edited on 05-17-2022 10:15 PM by support_s
spark.read.load return dataframe bigger than actual csv file.
Hi ,
I try to test Ezmeral 7.0.0 with EEP8.1.0 (Spark 3.2.0).
When I use spark.read.load fnction to load csv file (around 5MB size) , dataframe recodes is not much csv recodes.
CSV file recodes is 51000.
Dataframe recode(result of df.count()) is 51953.
(Executer might load same recode twice.)
CSV file size less then 4MB , don't reproduce this issue.
If you have any solution , please share me.
My sample pyspark code is below:
# -*- coding: utf-8 -*-
import sys
from pyspark.sql import SparkSession
if __name__ == '__main__':
# spark session
spark = SparkSession.builder.getOrCreate()
mapr_path = 'maprfs:/rawdata/test.csv'
df = spark.read.load(mapr_path, format="csv", header=True)
print(df.count())
sys.exit(0)
Thanks in advance.
Additional:
I try to run same program with OSS environment (hadoop 3.2 and spark 3.2.0).
Don't reproduce this issue , result is correct. I guess EEP 8.1.0 have problem..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-17-2022 01:00 AM
05-17-2022 01:00 AM
Query: https://hpe.to/6608zMuXQ.load return dataframe bigger than actual csv file.
System recommended content:
1. HPE Ezmeral Data Fabric 7.0 Documentation | SparkSQL and DataFrames
Please click on "Thumbs Up/Kudo" icon to give a "Kudo".
Thank you for being a HPE valuable community member.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-07-2022 07:22 PM
07-07-2022 07:22 PM
Re: spark.read.load return dataframe bigger than actual csv file.
The issue has already been addressed. Please install the following RPM mapr-spark-3.2.0.1.202204272354-1.noarch which is uploaded to SFTP [https://sftp.mapr.com/] under path - /ecosystem/rpm/spark/mep-8.1.0. The username to login is "maprpatches".
Maven dependency :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.0.1-eep-810</version>
</dependency>
- Tags:
- Spark