- Community Home
- >
- Software
- >
- AI Unlocked
- >
- Spark is the ultimate toolkit for a data engineer
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
Spark is the ultimate toolkit for a data engineer
Sometime between 2010 and 2015, the job of data engineer was born. As the availability of data grew, the need to build and maintain data infrastructure also grew. Someone needed to develop, test, and maintain data architectures so data scientists could do their jobs. And that newly created job fell to the data engineer.
So, who or what helps data engineers do their jobs? According to a recent article by HPEโs Don Wake, the answer is Spark. His recent article in CIO.com, Spark: A Data Engineerโs Best Friend, explains why Spark is the ultimate toolkit for a data engineer. He provides two reasons for awarding Spark such high praise:
- Spark simplifies the work environment by providing a platform to organize and execute complex data pipelines.
- Spark consists of powerful tools for storing, retrieving, and transforming data.
Although Wake admits it doesnโt do everything (and says data engineers love lots of other important tools), Spark does one important thing:
โIt provides a unified environment that accepts data in many different forms and allows all the tools to work together on the same data, passing a data set from one step to the next. Doing this well means you can create data pipelines at scale."
What can you do with Spark?
Wake says a data engineer can do the following with Spark:
- Connect to different data sources in different locations.
These include cloud sources such as Amazon S3, databases, Hadoop file systems, data streams, web services, and flat files.
- Convert different data types into a standard format.
The Spark data processing API allows the use of multiple different types of input data. Spark then utilizes Resilient Distributed Datasets (RDDs) and Data Frames for simplified, yet advanced data processing.
- Write programs that access, transform, and store the data.
Many common programming languages have APIs to integrate Spark code directly, and Spark offers many powerful functions for performing complex ETL-style data cleaning and transformation functions. Spark also includes a high-level API that allows users to seamlessly write queries in SQL.
- Integrate with almost every important tool.
These include tools for data wrangling, data profiling, data discovery, and data graphing.
Why is Spark so special?
To better understand why Spark is unique, Wake compares it to the Hadoop infrastructure.
- Spark is designed to be modular
Spark is essentially a modular toolkit initially designed to work with Hadoop via the YARN cluster manager interface. Yet, Spark is a valuable tool that can also be used outside of Hadoop and allows either resource to scale independently. Regardless of an organizationโs favorite storage and compute infrastructure, Spark empowers users to interface with that infrastructure.
- Spark accepts data of any size and form
Compared to Hadoop, Spark is more of an application toolkit that isnโt concerned with one storage type but rather wants you to use any storage type. It also provides a broader, tailor-made environment. Spark takes raw materials, turns them into reusable forms, and delivers them in analytic workloads. Additionally, Spark can work in a batch fashion or interactive fashion. As a result, Spark has become the go-to platform for most data applications and is especially suited to solving a data engineerโs problems.
- Spark supports multiple approaches and users
While Spark can access raw forms of data and interact with Hadoop file systems, it isnโt a single paradigm for achieving these aims. Instead, it is built from the ground up to provide multiple approaches in processing architectures -- all while using the same underlying data format.
Designed for the data engineer
Spark was designed with data engineers in mind, offering them lots of elasticity and flexibility. Because Spark utilizes an API-driven approach, engineers can access a variety of tools with Spark as the analytics engine. This modularity lets data engineers use open-source tools and avoids vendor lock-in.
And because Spark now works with Kubernetes and containerization, engineers can spin up and spin down Spark clusters, managing them as Kubernetes pods instead of physical, standalone, or bare-metal clusters. A Spark cluster deployed on top of a Kubernetes cluster leverages the hardware abstraction layer managed by Kubernetes, which avoids the complex and often time-consuming work of IT administration and cluster management.
Everybodyโs best friend
Spark was created to solve data engineering problems with an emphasis on analytics/machine learning -- AND be accessible and helpful to people further down the data pipeline. By offering scalable compute with scalable toolsets, Spark helps engineers empower others to leverage data to the fullest. According to Wake, that makes Spark not only a data engineerโs best friend โ but everybodyโs best friend.
Iโve only highlighted a portion of Wakeโs original article; I encourage you to read the full article, Spark: A Data Engineerโs Best Friend. To read more from Don Wake about Spark, check out his blog: Ready to become a superhero? Build an ML model with Spark on HPE Ezmeral now.
Alison Golan
Hewlett Packard Enterprise
twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/software
AlisonGolan
Alison Golan is a writer/editor for HPE's social marketing team. For 30+ years, sheโs been writing about technology โ from hardware and software to networking and streaming. She started her tech career as a public relations specialist, arranging media coverage with CBS, CNN, CNBC, The New York Times, The Wall Street Journal, Business Week, and Fortune. Today, she enjoys transforming technical jargon into compelling stories.
- Back to Blog
- Newer Article
- Older Article
- SFERRY on: What is machine learning?
- MTiempos on: HPE Ezmeral Container Platform is now HPE Ezmeral ...
- Arda Acar on: Analytic model deployment too slow? Accelerate dat...
- Jeroen_Kleen on: Introducing HPE Ezmeral Container Platform 5.1
- LWhitehouse on: Catch the next wave of HPE Discover Virtual Experi...
- jnewtonhp on: Bringing Trusted Computing to the Cloud
- Marty Poniatowski on: Leverage containers to maintain business continuit...
- Data Science training in hyderabad on: How to accelerate model training and improve data ...
- vanphongpham1 on: More enterprises are using containers; hereโs why.
- data science course on: Machine Learning Operationalization in the Enterpr...