Introduction to Apache Spark

Author: neptune | 27th-Jan-2023
#Python

What is Spark?

Apache Spark is an open-source, distributed processing system that utilizes in-memory caching and optimized query execution for faster queries.

All about Spark

Python is generally slower than Scala while Java is too verbose and does not support Read-Evaluate-Print-Loop (REPL).

Apache Spark currently supports multiple programming languages, including Java, Scala, R, and Python. The final language is chosen based on the efficiency of the functional solutions to tasks, but most developers prefer Scala.




Applications of Spark

  • Interactive analysis – MapReduce supports batch processing, whereas Apache Spark processes data quicker and thereby processes exploratory queries without sampling.

  • Event detection – Streaming functionality of Spark permits organizations to monitor unusual behaviors for protecting systems. Health/security organizations and financial institutions utilize triggers to detect potential risks.

  • Machine Learning – Apache Spark is provided with a scalable Machine Learning Library named as MLlib, which executes advanced analytics on iterative problems. Few of the critical analytics jobs such as sentiment analysis, customer segmentation, and predictive analysis make Spark an intelligent technology.

What is RDD's?

The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. 

This means, it stores the state of memory as an object across the jobs and the object is shareable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

  • Resilient distributed datasets (RDDs) are known as the main abstraction in Spark.

  • It is a partitioned collection of objects spread across a cluster and can be persisted in memory or on a disk.

  • Once created, RDDs are immutable.


Features of RDDs

  • Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready to recompute damaged or missing partitions due to node failures.

  • Dataset - A set of partitioned data with primitive values or values of values, For example, records or tuples.

  • Distributed with data remaining on multiple nodes in a cluster.

RDD Operations

  • flatMap, map,reduceByKey, and saveAsTextFile are the operations on the RDDs.

  • Count, Collect, Reduce, Take, and First are a few actions in spark.

  • foreach(func), saveAsTextFile(path) are also examples of Actions.

What is Lazy Evaluation?

When we call a transformation on RDD’s, the operation is not immediately executed. Alternatively, Spark internally records meta-data to show this operation has been requested. It is called Lazy evaluation.

DataFrame in Spark

DataFrames can be created from a wide array of sources like existing RDDs, external databases, tables in Hive, or structured data files.


Thanks for Reading !!!




Related Blogs
How to extract Speech from Video using Python?
Author: neptune | 16th-Jun-2023
#Python #Projects
Simple and easy way to convert video into audio then text using Google Speech Recognition API...

How to download video from youtube using python module ?
Author: neptune | 15th-Jun-2023
#Python
We will let you know how you can easily download the Youtube high quality videos along with subtitle, thumbnail, description using python package..

Best Python package manager and package for virtual environment ?
Author: neptune | 18th-Jun-2023
#Python #Pip
We will explore the options of Pip, Virtualenv, Anaconda, and also introduce Pyenv as a helpful tool...

Deploy Django project on AWS with Apache2 and mod_wsgi module.
Author: neptune | 18th-May-2024
#Python #Django
In this blog I use the AWS Ubuntu 18.22 instance as Hosting platform and used Apache2 server with mod_wsgi for configurations. We create a django sample project then configure server...

Mostly asked Python Interview Questions - 2023.
Author: neptune | 30th-May-2023
#Python #Interview
Python interview questions for freshers. These questions asked in 2022 Python interviews...

Core Python Syllabus for Interviews
Author: neptune | 26th-Jul-2023
#Python #Interview
STRING MANIPULATION : Introduction to Python String, Accessing Individual Elements, String Operators, String Slices, String Functions and Methods...

How to reverse string in Python ?
Author: neptune | 16th-May-2022
#Python
We are going to explore different ways to reverse string in Python...

Python Built-in functions lambda, map, filter, reduce.
Author: neptune | 15th-Jun-2023
#Python
We are going to explore in deep some important Python build-in functions lambda, map, filter and reduce with examples...

Python 3.9 new amazing features ?
Author: neptune | 26th-Jul-2023
#Python
Python 3.9 introduces new features such as dictionary union, string methods to remove prefixes and suffixes, type hinting, and speed improvements for built-in functions...

10 Proven Ways to Earn Money Through Python
Author: neptune | 11th-Apr-2023
#Python
Python offers numerous earning opportunities from web development to teaching, data analysis, machine learning, automation, web scraping, and more...

5 Languages that Replace Python with Proof
Author: neptune | 13th-Apr-2023
#Python
Julia, Rust, Go, Kotlin, and TypeScript are modern languages that could replace Python for specific use cases...

Monkey Patching in Python: A Powerful Yet Controversial Technique
Author: neptune | 01st-Aug-2023
#Python
Monkey patching in Python is a dynamic technique to modify code at runtime. It can add/alter behavior, but use it judiciously to avoid maintainability issues...

Building a Simple Chatbot with Python and openpyxl
Author: neptune | 25th-Jun-2024
#Python #Projects
This chatbot reads questions and answers from an Excel file and provides responses based on user input...

Best Practices for Managing Requests Library Sessions When Interacting with Multiple APIs ?
Author: neptune | 22nd-Aug-2024
#Python
When working with Python's `requests` library, managing sessions is crucial, especially when your application interacts with multiple APIs...

How to Update XML Files in Python?
Author: neptune | 01st-Jul-2024
#Python
Handling XML files in Python is straightforward with the `xml.etree.ElementTree` module...

How to Ensure Proper Namespace Handling in XML with Python's lxml Library
Author: neptune | 01st-Jul-2024
#Python
By using `lxml`, you can effectively manage XML namespaces and ensure that your XML structure remains intact during updates...

View More