HPE Ezmeral Software platform
1826440 Members
3733 Online
109692 Solutions
New Discussion

spark.read.load return dataframe bigger than actual csv file.

 
smarte_basis
Occasional Collector

spark.read.load return dataframe bigger than actual csv file.

Hi ,

I try to test Ezmeral 7.0.0 with EEP8.1.0 (Spark 3.2.0).
When I use spark.read.load fnction to load csv file (around 5MB size) , dataframe recodes is not much csv recodes.
CSV file recodes is 51000.
Dataframe recode(result of df.count()) is 51953.
(Executer might load same recode twice.)

CSV file size less then 4MB ,  don't reproduce this issue.
If you have any solution , please share me.

My sample pyspark code is below:

 

# -*- coding: utf-8 -*-
import sys
from pyspark.sql import SparkSession

if __name__ == '__main__':
    # spark session
    spark = SparkSession.builder.getOrCreate()

    mapr_path = 'maprfs:/rawdata/test.csv'
    df = spark.read.load(mapr_path, format="csv", header=True)
    print(df.count())

    sys.exit(0)

 


Thanks in advance.

Additional:
I try to run same program with OSS environment  (hadoop 3.2 and spark 3.2.0).
Don't reproduce this issue , result is correct. I guess EEP 8.1.0 have problem..

2 REPLIES 2
support_s
System Recommended

Query: https://hpe.to/6608zMuXQ.load return dataframe bigger than actual csv file.

System recommended content:

1. HPE Ezmeral Data Fabric 7.0 Documentation | SparkSQL and DataFrames

 

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

 

Thank you for being a HPE valuable community member.


Accept or Kudo

Vinayak_Meghraj
Occasional Contributor

Re: spark.read.load return dataframe bigger than actual csv file.

The issue has already been addressed. Please install the following RPM mapr-spark-3.2.0.1.202204272354-1.noarch which is  uploaded to SFTP [https://sftp.mapr.com/] under path - /ecosystem/rpm/spark/mep-8.1.0. The username to login is "maprpatches".

Maven dependency :

<dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_2.12</artifactId>
       <version>3.2.0.1-eep-810</version>
</dependency>