PySpark Milestone Black Friday Sales Data | Fresco Play Hackerrank

Author: neptune | 05th-Nov-2023
#Data Science #Hackerrank

Welcome to the Spark Challenge. You are provided with the Black Friday sales data and we as a big data developer needs to analyse and fetch the required data. 


• We have provided the template in challenge.py. In the challenge files the input path and output path is already set for each function. output path : ~project/challenge/output/.

• Output columns name should be same as given in the sample output.

• Do not modify the output files manually, then the modified output file won’t be taken for validation.

• Your output file should be stored as a CSV file.


Spark Shell : - This is for only those who choose the spark shell instead of the given template. Ignore this part if you are not choosing spark shell to solve this challenge

• You can also solve this challenge with a spark shell. But we recommend you to use the template that we have given in the challenge.py

For pyspark users:- Open a terminal (Right click -> New Terminal) and type pyspark in the terminal and press Enter. After opening the pyspark shell, you can import the functions and perform the operations using pyspark.


Note:

1. In the template we have defined functions and paths for the input, output files using parameters. You can ignore the functions and parameters. Focus on the given operations.

Output directory for the output files: -

    1. result_1 : project/challenge/output/result_1

    2. result_2 : project/challenge/output/result_2

    3. result_3 : project/challenge/output/result_3


Input File: -

• The input file contains the details about the black friday sales details.

• You are given a Black_Friday_Sales.csv project/challenge/inputfile/ .

****  load_data function :  ****

Complete the following operations inside the load_data function :

Note : Output files should be a single partition CSV file with header.


Task 0:-

# load_data function is important and necessary for all the tasks.


def read_data (spark,input_file):

    '''

spark_session : spark

for input_file : input_file

'''

    #Replace this line with your actual code

    df = spark.read.csv(input_file, header=True, inferSchema=True)


    return df



PROBLEM STATEMENT 

Task 1 : -

# Complete the following operations inside the result_1 function:


****  result_1 function :  ****

#The following are the parameters : -

#      ◦ input_file : input_df

1. Use the input_df to complete the task.

2. Find the highest purchase amount for each of the age groups given.

3. The name of the column of highest purchase should be as "Maximum_Purchase" 

4. Columns to be fetched : Age, Maximum_Purchase

5. Return the final dataframe.


def result_1(input_df):

    '''

    for input file: input_df

    '''

    print("-------------------")

    print("Starting result_1")

    print("-------------------")


#------------------------------

# Complete the following operations inside the result_1 function:

#The following are the parameters : -

#      ◦ input_file : input_df

#  1. Use the input_df to complete the task                                                    |

#  2. Find the highest purchase amount for each of the age groups given.

#  3. The name of the column of highest purchase should be as "Maximum_Purchase"

#  4. Columns to be fetched : Age, Maximum_Purchase                                                          |

#  5. In the challenge file, the return statement is already defined.

#     You need to replace the df with your final output data frame name.

#----------------------------------


    df = input_df.groupBy("Age").agg(max("Purchase").alias("Maximum_Purchase"))

    return df   #return the final dataframe



****  result_2 function :  ****

# The following are the parameters : -

# ◦ input_file : input_df

1. Use input_df to complete the task

2. Find the total sum of the purchase made by each City Category

3. The column name for the sum of purchase amount should be as "Total_sum"   

4. Fetch the columns  :                       

#    City_Category, Total_sum

5. Return the final dataframe.

def result_2(input_df):


    '''

    for input file: input_df

    '''

    print("-------------------------")

    print("Starting result_2")

    print("-------------------------")


#-------------------------------------------------------------------------------------------


# The following are the parameters : -


#      ◦ input_file : input_df


# 1. Use input_df to complete the task

# 2. Find the total sum of the purchase made by each City Category

# 3. The column name for the sum of purchase amount should be as "Total_sum"  

# 4. Fetch the columns  :                      

#    City_Category, Total_sum

# 5. In challenge file, the return statement is already defined.

#    You need to replace the df with your final output data frame name.

#-------------------------------------------------------------------------------------------


    df = input_df.groupBy("City_Category").agg(sum("Purchase").alias("Total_sum"))

  #Write your code


    return df     #return the final dataframe


**** result_3 function :  ****

Write a code to complete create result_3 :

# The following are the parameters : -

#◦ input_file : input_df   

1. Use the input_df to complete the task.

2. Fetch the records where the Purchase is greater than 1000 and Marital Status should be "Single".

3. Columns to be fetched :User_ID, Product_ID, Marital_Status, Purchase.

4. Return the final dataframe.


def result_3(input_df):


    '''

    for input file: input_df

    '''

    print("-------------------------")

    print("Starting result_3")

    print("-------------------------")

# #--------------------------------------------------------------------------------------------


# # The following are the parameters : -


# #      ◦ input_file : input_df    

#  1. Use the input_df to complete the task

#  2. Fetch the records where the Purchase is greater than 1000 and Marital Status

#     should be "Single"

#  3. Columns to be fetched :    

#        User_ID, Product_ID, Marital_Status, Purchase

#  4. In the challenge file, the return statement is already defined.

#     You need to replace the df with your final output data frame name.

# #--------------------------------------------------------------------------------------------

     

   # Filter records where Purchase > 1000 and Marital_Status is "Single"

    filtered_df = input_df.filter((col("Purchase") > 1000) & (col("Marital_Status") == "Single"))


    # Select the specified columns

    selected_columns_df = filtered_df.select("User_ID", "Product_ID", "Marital_Status", "Purchase")


    # Remove duplicate records

    final_df = selected_columns_df.dropDuplicates()

   

    return final_df   #return the final dataframe



****  load_data function :  ****

Write a code to store the outputs to the respective locations.


def load_data(data,outputpath):

#------------------------------------------------------------------------|

# 1. Write a code to store the outputs to the respective locations. |


# Note: |


# • Output files should be a single partition CSV file with header.|

# • load_data function is important for all the tasks. |

#-------------------------------------------------------------------------

    if (data.count() != 0):


      print("Loading the data",outputpath)

      # Save the DataFrame as a CSV file with a single partition and header

      data.coalesce(1).write.csv(outputpath, header=True)


#Write your code above this line


    else:


      print("Empty dataframe, hence cannot save the data",outputpath)



NOTE : 


Output Path : - ~project/challenge/output/

Input Path : - ~project/challenge//inputfile/

How to run a sample test case :

How to run a sample test case :

• Open a terminal and navigate to ~/project/challenge. Run the command in the terminal : - spark-submit sampletest.py

Note :

• The sample test case does not represent the main test. The actual test case will run only after clicking on the SUBMIT button.

• Click on SUBMIT to validate your solution.





Related Blogs
5. Solution of Hacker Rank Weather Observation Station 8.
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Query the list of CITY names from STATION which have vowels (i.e., a, e, i, o, and u) as both their first and last characters. Your result cannot contain duplicates...

The Blunder | Hackerrank
Author: neptune | 21st-Nov-2022
#SQL #Hackerrank
Write a query calculating the amount of error (i.e.: average monthly salaries), and round it up to the next integer...

7.Solution of Hacker Rank The Report
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Problem Statement : generate a report containing three columns: Name, Grade and Mark. Ketty doesn't want the NAMES of those students who received a grade lower than 8...

4. Solution of Hacker Rank Weather Observation Station 6.
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Query the list of CITY names starting with vowels (i.e., a, e, i, o, or u) from STATION. Your result cannot contain duplicates...

3. Solution of Hacker Rank Weather Observation Station 4.
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Problem Statement : Find the difference between the total number of CITY entries in the table and the number of distinct CITY entries in the table...

9 movies every Data Scientist should watch.
Author: neptune | 12th-May-2023
#Data Science
Nowadays, the movie industry has so far presented lots of movies related to sophisticated AI systems, humanoid robots, self-driving automobiles, and a world full of digital technologies...

The PADS | Hackerrank
Author: neptune | 21st-Nov-2022
#SQL #Hackerrank
Problem Statement: Generate the following two result sets: 1. Query an alphabetically ordered list of all names in OCCUPATIONS, immediately followed by the first letter of each profession...

6. Solution of Hacker Rank Employee Salaries.
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Problem Statement : Query that prints a list of employee names for employees in Employee having a salary greater than $2000 per month and experience less than 10 months...

Identifying the Odd One Out in a Series of Strings | Hackerrank
Author: neptune | 15th-Jun-2023
#Hackerrank #Problem Solving
The article presents an algorithm to identify the odd one out in a series of strings efficiently...

1. Basic SQL Select Query of Hacker Rank.
Author: neptune | 20th-Apr-2022
#SQL #Hackerrank
Problem Statement : Query all columns for all American cities in the CITY table with populations larger than 100000. The CountryCode for America is USA...

2. Solution of Hacker Rank Weather Observation Station 3.
Author: neptune | 23rd-Jan-2023
#SQL #Hackerrank
Problem Statement : Query a list of CITY names from STATION for cities that have an even ID number. Print the results in any order, but exclude duplicates from the answer...

Modified 0-1 knapsack problem | Frsco Play Hackerrank
Author: neptune | 05th-Nov-2023
#Hackerrank #Problem Solving
An automobile mechanic wants to buy a set of spare parts from a manufacturing unit. Goal is to maximise the amount of money the mechanic can earn...

AngularJS - Know Tech Frameworks | Fresco Play
Author: neptune | 05th-Nov-2023
#Hackerrank #Problem Solving
Build an application that displays the framework details using AngularJS routing. We will build an application that displays the Tech Frontend and Backend frameworks...

Solving the Ice Cream Parlor Problem | Hackerrank
Author: neptune | 04th-Jun-2023
#Hackerrank #Problem Solving
Two friends like to pool their money and go to the ice cream parlour. They always choose two distinct flavours and they spend all of their money...

Backspace String Compare using R | Fresco Play
Author: neptune | 05th-Nov-2023
#Hackerrank #Problem Solving
The code implementation in both R and Python solves the "Backspace String Compare" problem using stack data structure...

Cassandra Products JSON | Fresco Play
Author: neptune | 29th-Oct-2023
#Hackerrank #Problem Solving
We'll go through a series of common tasks related to managing data in Cassandra...

Finding the Most Expensive Keyboard and USB Drive within a Budget | Hackerrank
Author: neptune | 05th-Jun-2023
#Hackerrank #Problem Solving
A person wants to determine the most expensive computer keyboard and USB drive that can be purchased with a give budget...

Python - Number Based Problem | Hackerrank
Author: neptune | 17th-Aug-2023
#Hackerrank #Problem Solving
Determine whether the number in descending order is a prime or not. If the number is a prime, then print "Sorted Number is a prime number," otherwise, print "Sorted Number is not a prime number."..

Join Operations in R | Fresco Play
Author: neptune | 29th-Oct-2023
#Hackerrank #Problem Solving
Perform various join operations in R using two data sets, 'flights' and 'weather'...

View More