topic Re: spark.read.load return dataframe bigger than actual csv file. in HPE Ezmeral Software platform

spark.read.load return dataframe bigger than actual csv file.

smarte_basis — Wed, 18 May 2022 05:15:41 GMT

Hi ,

I try to test Ezmeral 7.0.0 with EEP8.1.0 (Spark 3.2.0).
When I use spark.read.load fnction to load csv file (around 5MB size) , dataframe recodes is not much csv recodes.
CSV file recodes is 51000.
Dataframe recode(result of df.count()) is 51953.
(Executer might load same recode twice.)

CSV file size less then 4MB , don't reproduce this issue.
If you have any solution , please share me.

My sample pyspark code is below:

# -*- coding: utf-8 -*- import sys from pyspark.sql import SparkSession if __name__ == '__main__': # spark session spark = SparkSession.builder.getOrCreate() mapr_path = 'maprfs:/rawdata/test.csv' df = spark.read.load(mapr_path, format="csv", header=True) print(df.count()) sys.exit(0)

Thanks in advance.

Additional:
I try to run same program with OSS environment (hadoop 3.2 and spark 3.2.0).
Don't reproduce this issue , result is correct. I guess EEP 8.1.0 have problem..

Query: https://hpe.to/6608zMuXQ.load return dataframe bigger than actual csv file.

support_s — Tue, 17 May 2022 08:00:03 GMT

System recommended content:

1. HPE Ezmeral Data Fabric 7.0 Documentation | SparkSQL and DataFrames

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

Thank you for being a HPE valuable community member.

Re: spark.read.load return dataframe bigger than actual csv file.

Vinayak_Meghraj — Fri, 08 Jul 2022 02:22:28 GMT

The issue has already been addressed. Please install the following RPM mapr-spark-3.2.0.1.202204272354-1.noarch which is uploaded to SFTP [https://sftp.mapr.com/] under path - /ecosystem/rpm/spark/mep-8.1.0. The username to login is "maprpatches".

Maven dependency :

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.0.1-eep-810</version>
</dependency>