Big Data Computing in the Cloud

Question 1

Question 1a

Explain in details the significance of Apache Spark framework. Next, discuss in
details FIVE (5) limitations of using Apache Spark framework in big data processing.

Question 1b

Explain in details the Spark job execution process and how the Directed Acyclic Graph (DAG) works in Spark.

Question 2

Question 2a

Explain the key logic for both PySpark built-in programs in Figures Q2(a)(1) and Figure Q2(a)(2).

Question 2b

Appraise in details the concept of PySpark Resilient Distributed Datasets (RDD) and PySpark DataFrames. Use a table to highlight the main differences between RDD and DataFrames.

Question 3

In your local machine’s Spark setup, design and develop a PySpark program using PySpark RDD APIs to perform the following tasks. Show your full PySpark program and provide screenshots and results for all key steps where applicable.

Data sources used in this question are: (i) Alice’sAdventuresInWonderland.txt , and (ii) TheAdventuresOfSherlockHolmes.txt. Note that these data files can be downloaded from ICT337 Canvas webpage.

Question 3a
Read both text files and store their content using Spark RDDs. Show the total number of records

Question 3b

Perform the following tasks and show the results in each step:

Write a function to remove all characters that are not alpha-numeric and spaces. Change all characters to lower letter and remove any leading or trailing spaces.

Extract individual word token from the sentences.

Remove white spaces between sentences and find the word occurrence counts in terms of (word, count).

Question 3c

Setup Python Natural Language Toolkit ( https://www.nltk.org/ ) library in your
program and download Stop Words. Show the list of English Stop Words and its total count. Based on the results in Q3(b), remove all Stop Words from the resultant RDDs and then randomly sample 0.5 percent of the RDD words. Show the results and their total word count. Finally, find the TOP TEN (10) most frequent words.

Question 3d

Compute an average word occurrence frequency.

Question 3e

Find the common words between the two text files. Show the TOP THIRTY (30) most common words.

Question 4

In your local machine’s Spark setup, develop a PySpark program using PySpark RDD APIs to perform the following tasks. Show your full PySpark program and provide screenshots and results for all key steps where applicable.

Data sources used in this question are: (i) mov_rating.dat, (ii) mov_item.dat, (iii)
mov_genre.dat, (iv) mov_user.dat, and (v) mov_occupation.dat. Note that these data files can be downloaded from ICT337 Canvas webpage.

Question 4a

Construct program and perform the following tasks and show the results in each step:

Read the “mov_rating.dat” and “mov_item.dat” files and store the content using Spark RDDs.
Find the Top FIVE (5) users that review the most movies. Show the user ID
and total number of occurrences.
Find the Top TEN (10) most reviewed movies. Show the movie ID, movie
name, and total number of occurrences. Note that you may use Spark Broadcast variable to store the mapping of movie ID to movie name. 

Question 4b

Based on the RDDs of “mov_rating.dat” and “mov_item.dat” in Question 4 (a), create new RDDs of (movie ID, ((user ID, rating), genre_list)) using Spark join operation.

There are Nineteen different genre categories. For each category, find the TOP
THREE (3) most reviewed movies, sorted by average review ratings (i.e., highest to lowest ratings). Show the genre name, movie ID, movie name, and average rating.

Save the TOP THREE (3) movie results using RDD file saving mechanism and show the content. Note that the genre name can be referenced from “mov_genre.dat” and the mapping can be stored using Spark Broadcast variable.

Question 4c

Read the “mov_user.dat” to obtain user’s occupation and create new RDDs of (movie ID, ((user ID, rating), occupation)).

There are TWENTY-ONE categories of occupation. For each category, find the TOP THIRTY (30) most reviewed movies, sorted by average review ratings (i.e., highest to lowest ratings). Show the total movie counts for the occupation category, as well as occupation type, movie ID, movie name, and average rating.

Save the top thirty movie results using RDD file saving mechanism and show the content. Note that the occupation name can be referenced from “mov_occupation.dat”

Question 4d

We would like to build a simple movie recommendation engine with the available movie data.

To accomplish this, perform the following tasks and show the results in each step (i.e., sample RDD content and its total count):

Create movie rating RDD with key-value pairs of: (user ID, (movie ID, rating))

Perform an RDD self-join operation so as to find all combination of movie pairs
rated by a given user. The resultant RDD should have the structure of: (user ID,
((movie #1, rating #1), (movie #2, rating #2))).

Filter out movie pair duplication using condition of movie #1 < movie #2. This
should greatly reduce the RDD size.

Question 4e

Based on the RDD results from Q4(d), perform the following tasks and show the results in each step (i.e., sample RDD content and its total count):

Organize the RDD into key-value pairs of: ((movie #1, movie #2), (rating #1,
rating #2)). Then, collect all movie ratings for each movie pair, whereby the
resultant RDD structure should be in terms of ((movie #1, movie #2), ((rating
1, rating 2), (rating 3, rating 4), …)).

For each movie pairs, compute the Cosine Similarity

(https://en.wikipedia.org/wiki/Cosine_similarity) for the collection of movie
rating pairs. This is the key algorithm to measure the degree of similarity
between two movies based on ratings.

The cosine similarity value/score should be ranging from -1 to +1, where -1 refers to movies that are opposite in nature and +1 refers to movies that are highly similar. Figure 3 shows the definition of Cosine Similarity

On top of the cosine similarity score, your function should also count the total number of rating pairs (i.e., your function should have output information of: ((movie #1, movie #2), (score, numberOfRatingPairs))).

Note that you may use the RDD cache operation to store the final results.

Question 4f

Find the TOP TEN (10) movies that are similar to movie ID = 50 (i.e., Star Wars).
Constraint your movie similarity search using a threshold of Cosine Similarity score set to 0.97 and a threshold of the number of movie rating pairs (i.e., numberOfRatingPairs) set to 50. This allows us to only consider “good” quality search results.

GRAB 30% OFF ON YOUR ASSIGNMENTS NOW