AI Unlocked
1771209 Members
2790 Online
109004 Solutions
New Article ๎ฅ‚
AlisonGolan

Spark is the ultimate toolkit for a data engineer

Sometime between 2010 and 2015, the job of data engineer was born. As the availability of data grew, the need to build and maintain data infrastructure also grew. Someone needed to develop, test, and maintain data architectures so data scientists could do their jobs. And that newly created job fell to the data engineer.

spark best friend wake echo.pngSo, who or what helps data engineers do their jobs? According to a recent article by HPEโ€™s Don Wake, the answer is Spark. His recent article in CIO.com, Spark: A Data Engineerโ€™s Best Friend, explains why Spark is the ultimate toolkit for a data engineer. He provides two reasons for awarding Spark such high praise:

  • Spark simplifies the work environment by providing a platform to organize and execute complex data pipelines.
  • Spark consists of powerful tools for storing, retrieving, and transforming data.

Although Wake admits it doesnโ€™t do everything (and says data engineers love lots of other important tools), Spark does one important thing:

โ€œIt provides a unified environment that accepts data in many different forms and allows all the tools to work together on the same data, passing a data set from one step to the next. Doing this well means you can create data pipelines at scale."

What can you do with Spark?

Wake says a data engineer can do the following with Spark:

  • Connect to different data sources in different locations.

These include cloud sources such as Amazon S3, databases, Hadoop file systems, data streams, web services, and flat files.

  • Convert different data types into a standard format.

The Spark data processing API allows the use of multiple different types of input data. Spark then utilizes Resilient Distributed Datasets (RDDs) and Data Frames for simplified, yet advanced data processing.

  • Write programs that access, transform, and store the data.

Many common programming languages have APIs to integrate Spark code directly, and Spark offers many powerful functions for performing complex ETL-style data cleaning and transformation functions. Spark also includes a high-level API that allows users to seamlessly write queries in SQL.

  • Integrate with almost every important tool.

These include tools for data wrangling, data profiling, data discovery, and data graphing.

Why is Spark so special?

To better understand why Spark is unique, Wake compares it to the Hadoop infrastructure.

  • Spark is designed to be modular

Spark is essentially a modular toolkit initially designed to work with Hadoop via the YARN cluster manager interface. Yet, Spark is a valuable tool that can also be used outside of Hadoop and allows either resource to scale independently. Regardless of an organizationโ€™s favorite storage and compute infrastructure, Spark empowers users to interface with that infrastructure. 

  • Spark accepts data of any size and form 

Compared to Hadoop, Spark is more of an application toolkit that isnโ€™t concerned with one storage type but rather wants you to use any storage type. It also provides a broader, tailor-made environment. Spark takes raw materials, turns them into reusable forms, and delivers them in analytic workloads. Additionally, Spark can work in a batch fashion or interactive fashion. As a result, Spark has become the go-to platform for most data applications and is especially suited to solving a data engineerโ€™s problems. 

  • Spark supports multiple approaches and users 

While Spark can access raw forms of data and interact with Hadoop file systems, it isnโ€™t a single paradigm for achieving these aims. Instead, it is built from the ground up to provide multiple approaches in processing architectures -- all while using the same underlying data format. 

Designed for the data engineer

Spark was designed with data engineers in mind, offering them lots of elasticity and flexibility. Because Spark utilizes an API-driven approach, engineers can access a variety of tools with Spark as the analytics engine. This modularity lets data engineers use open-source tools and avoids vendor lock-in. 

And because Spark now works with Kubernetes and containerization, engineers can spin up and spin down Spark clusters, managing them as Kubernetes pods instead of physical, standalone, or bare-metal clusters. A Spark cluster deployed on top of a Kubernetes cluster leverages the hardware abstraction layer managed by Kubernetes, which avoids the complex and often time-consuming work of IT administration and cluster management. 

Everybodyโ€™s best friend

Spark was created to solve data engineering problems with an emphasis on analytics/machine learning -- AND be accessible and helpful to people further down the data pipeline. By offering scalable compute with scalable toolsets, Spark helps engineers empower others to leverage data to the fullest. According to Wake, that makes Spark not only a data engineerโ€™s best friend โ€” but everybodyโ€™s best friend.

Iโ€™ve only highlighted a portion of Wakeโ€™s original article; I encourage you to read the full article, Spark: A Data Engineerโ€™s Best Friend. To read more from Don Wake about Spark, check out his blog: Ready to become a superhero? Build an ML model with Spark on HPE Ezmeral now.

Alison Golan

Hewlett Packard Enterprise

twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/software

 

0 Kudos
About the Author

AlisonGolan

Alison Golan is a writer/editor for HPE's social marketing team. For 30+ years, sheโ€™s been writing about technology โ€“ from hardware and software to networking and streaming. She started her tech career as a public relations specialist, arranging media coverage with CBS, CNN, CNBC, The New York Times, The Wall Street Journal, Business Week, and Fortune. Today, she enjoys transforming technical jargon into compelling stories.