What is Spark?

Apache Spark is an open-source, distributed processing system that utilizes in-memory caching and optimized query execution for faster queries.

All about Spark

Python is generally slower than Scala while Java is too verbose and does not support Read-Evaluate-Print-Loop (REPL).

Apache Spark currently supports multiple programming languages, including Java, Scala, R, and Python. The final language is chosen based on the efficiency of the functional solutions to tasks, but most developers prefer Scala.

Applications of Spark

Interactive analysis – MapReduce supports batch processing, whereas Apache Spark processes data quicker and thereby processes exploratory queries without sampling.
Event detection – Streaming functionality of Spark permits organizations to monitor unusual behaviors for protecting systems. Health/security organizations and financial institutions utilize triggers to detect potential risks.
Machine Learning – Apache Spark is provided with a scalable Machine Learning Library named as MLlib, which executes advanced analytics on iterative problems. Few of the critical analytics jobs such as sentiment analysis, customer segmentation, and predictive analysis make Spark an intelligent technology.

What is RDD's?

The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation.

This means, it stores the state of memory as an object across the jobs and the object is shareable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

Resilient distributed datasets (RDDs) are known as the main abstraction in Spark.
It is a partitioned collection of objects spread across a cluster and can be persisted in memory or on a disk.
Once created, RDDs are immutable.

Features of RDDs

Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready to recompute damaged or missing partitions due to node failures.
Dataset - A set of partitioned data with primitive values or values of values, For example, records or tuples.
Distributed with data remaining on multiple nodes in a cluster.

RDD Operations

flatMap, map,reduceByKey, and saveAsTextFile are the operations on the RDDs.
Count, Collect, Reduce, Take, and First are a few actions in spark.
foreach(func), saveAsTextFile(path) are also examples of Actions.

What is Lazy Evaluation?

When we call a transformation on RDD’s, the operation is not immediately executed. Alternatively, Spark internally records meta-data to show this operation has been requested. It is called Lazy evaluation.

DataFrame in Spark

DataFrames can be created from a wide array of sources like existing RDDs, external databases, tables in Hive, or structured data files.

Thanks for Reading !!!

Introduction to Apache Spark

Author: neptune | 27th-Jan-2023

Applications of Spark

What is RDD's?

Features of RDDs

RDD Operations

DataFrame in Spark

Introduction to Apache Spark

Author: neptune | 27th-Jan-2023

Applications of Spark

What is RDD's?

(adsbygoogle = window.adsbygoogle || []).push({});

Features of RDDs

RDD Operations

DataFrame in Spark