Introduction to Apache Spark

Author: neptune | 27th-Jan-2023 | Views: 311
#Python #Apache Spark

What is Spark?

Apache Spark is an open-source, distributed processing system that utilizes in-memory caching and optimized query execution for faster queries.

All about Spark

Python is generally slower than Scala while Java is too verbose and does not support Read-Evaluate-Print-Loop (REPL).

Apache Spark currently supports multiple programming languages, including Java, Scala, R, and Python. The final language is chosen based on the efficiency of the functional solutions to tasks, but most developers prefer Scala.

Applications of Spark

  • Interactive analysis – MapReduce supports batch processing, whereas Apache Spark processes data quicker and thereby processes exploratory queries without sampling.

  • Event detection – Streaming functionality of Spark permits organizations to monitor unusual behaviors for protecting systems. Health/security organizations and financial institutions utilize triggers to detect potential risks.

  • Machine Learning – Apache Spark is provided with a scalable Machine Learning Library named as MLlib, which executes advanced analytics on iterative problems. Few of the critical analytics jobs such as sentiment analysis, customer segmentation, and predictive analysis make Spark an intelligent technology.

What is RDD's?

The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. 

This means, it stores the state of memory as an object across the jobs and the object is shareable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

  • Resilient distributed datasets (RDDs) are known as the main abstraction in Spark.

  • It is a partitioned collection of objects spread across a cluster and can be persisted in memory or on a disk.

  • Once created, RDDs are immutable.

Features of RDDs

  • Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready to recompute damaged or missing partitions due to node failures.

  • Dataset - A set of partitioned data with primitive values or values of values, For example, records or tuples.

  • Distributed with data remaining on multiple nodes in a cluster.

RDD Operations

  • flatMap, map,reduceByKey, and saveAsTextFile are the operations on the RDDs.

  • Count, Collect, Reduce, Take, and First are a few actions in spark.

  • foreach(func), saveAsTextFile(path) are also examples of Actions.

What is Lazy Evaluation?

When we call a transformation on RDD’s, the operation is not immediately executed. Alternatively, Spark internally records meta-data to show this operation has been requested. It is called Lazy evaluation.

DataFrame in Spark

DataFrames can be created from a wide array of sources like existing RDDs, external databases, tables in Hive, or structured data files.

Thanks for Reading !!!

anonymous | May 19, 2022, 5:23 p.m.


Related Blogs
How to extract Speech from Video using Python?
Author: neptune | 01st-Dec-2022 | Views: 3100
Simple and easy way to convert video into audio then text using Google Speech Recognition API...

How to download video from youtube using python module ?
Author: neptune | 22nd-May-2022 | Views: 1851
We will let you know how you can easily download the Youtube high quality videos along with subtitle, thumbnail, description using python package..

Mostly asked Python Interview Questions - 2022.
Author: neptune | 25th-May-2022 | Views: 1059
#Python #Interview
Python interview questions for freshers. These questions asked in 2022 Python interviews...

How to reverse string in Python ?
Author: neptune | 16th-May-2022 | Views: 908
We are going to explore different ways to reverse string in Python...

Python Built-in functions lambda, map, filter, reduce.
Author: neptune | 22nd-May-2022 | Views: 883
We are going to explore in deep some important Python build-in functions lambda, map, filter and reduce with examples...

Best Python package manager and package for virtual environment ?
Author: neptune | 15th-Apr-2022 | Views: 832
#Python #Anaconda #Virtualenv #Pip
Which is the best package manager for python and Virtual environment management using Virtualenv and Anaconda...

Deploy Django project on AWS with Apache2 and mod_wsgi module.
Author: neptune | 22nd-May-2022 | Views: 688
#Python #Django
In this blog I use the AWS Ubuntu 18.22 instance as Hosting platform and used Apache2 server with mod_wsgi for configurations. We create a django sample project then configure server...

Will, AI kills Developer's jobs?
Author: neptune | 22nd-May-2022 | Views: 634
#Python #Machine learning #AI
GPT-3’s performance has convinced that Artificial intelligence is closer or at least AI-generated code is closer than we think. It generates imaginative, insightful, deep, and even excellent content...

Do you know Jupyter is now full-fledged IDE?
Author: neptune | 15th-Apr-2022 | Views: 581
#Python #Jupyter
Jupyter is a widely used tool by Data scientists. So developers from institutions like Two Sigma, Bloomberg and convert it into IDE lets see..

Core Python Syllabus for Interviews
Author: neptune | 11th-Jun-2022 | Views: 569
#Python #Interview
STRING MANIPULATION : Introduction to Python String, Accessing Individual Elements, String Operators, String Slices, String Functions and Methods...

Datatypes in Python.
Author: neptune | 22nd-May-2022 | Views: 340
Python have different types of datatypes like Numbers, Strings, Lists, Tuples, Dictionary, Set, Frozenset, Bool, Mutable, and Immutable...

View More