What is Spark?
Apache Spark is an open-source, distributed processing system that utilizes in-memory caching and optimized query execution for faster queries.
All about Spark
Python is generally slower than Scala while Java is too verbose and does not support Read-Evaluate-Print-Loop (REPL).
Apache Spark currently supports multiple programming languages, including Java, Scala, R, and Python. The final language is chosen based on the efficiency of the functional solutions to tasks, but most developers prefer Scala.
Interactive analysis – MapReduce supports batch processing, whereas Apache Spark processes data quicker and thereby processes exploratory queries without sampling.
Event detection – Streaming functionality of Spark permits organizations to monitor unusual behaviors for protecting systems. Health/security organizations and financial institutions utilize triggers to detect potential risks.
Machine Learning – Apache Spark is provided with a scalable Machine Learning Library named as MLlib, which executes advanced analytics on iterative problems. Few of the critical analytics jobs such as sentiment analysis, customer segmentation, and predictive analysis make Spark an intelligent technology.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation.
This means, it stores the state of memory as an object across the jobs and the object is shareable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.
Resilient distributed datasets (RDDs) are known as the main abstraction in Spark.
It is a partitioned collection of objects spread across a cluster and can be persisted in memory or on a disk.
Once created, RDDs are immutable.
Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready to recompute damaged or missing partitions due to node failures.
Dataset - A set of partitioned data with primitive values or values of values, For example, records or tuples.
Distributed with data remaining on multiple nodes in a cluster.
flatMap, map,reduceByKey, and saveAsTextFile are the operations on the RDDs.
Count, Collect, Reduce, Take, and First are a few actions in spark.
foreach(func), saveAsTextFile(path) are also examples of Actions.
What is Lazy Evaluation?
When we call a transformation on RDD’s, the operation is not immediately executed. Alternatively, Spark internally records meta-data to show this operation has been requested. It is called Lazy evaluation.
DataFrames can be created from a wide array of sources like existing RDDs, external databases, tables in Hive, or structured data files.
Thanks for Reading !!!