Write My Paper Button

DAT 560M – Big Data and Cloud Computing 2023 – Homework #4 1 – DAT 560M: Big Data and Cloud Computing Fall 2023, Mini B Homework #4

DAT 560M – Big Data and Cloud Computing 2023 – Homework #4

1 –

DAT 560M: Big Data and Cloud Computing

Fall 2023, Mini B

Homework #4

INSTRUCTIONS

This is an individual assignment. You may not discuss your approach to solving these

questions with anyone, other than the instructor or TA.

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

Please include only your Student ID on the submission.The only allowed material is:

a. Class notes

b. Content posted on Canvas

c. Utilize ONLY the codes we practice. Anything beyond will not get any point!

You are not permitted to use other online resources.The physical submission is due by the next lab.There will be TA office hours. See the schedule on Canvas.

ASSIGNMENT

In this assignment, we are going to practice Spark on a file named loans.csv and the file is located

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

in our database. In case you don’t have the file, you can get it from the dataset folder on the server.

http://server-ip/dataset/loans.csv

This dataset has information about loans distributed to poor and financially excluded people

around the world by a company called Kiva. There are a few number of columns in the dataset

and we would like to do an analysis on that by pyspark. Please answer each question and provide

a screenshot.

Part 1- Initialize Spark (5 pts)

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

1- Start the PySpark engine and load the file. This homework is a little bit complex and its

better that we assign more resources. Then, when assigning your engine, you can assign

all available CPU cores on your machine to the Spark to perform faster. To do that, just

simply put local[*] instead of local (look at the following screenshot). If it crashes or

doesn’t work properly, you are more than welcome to go back to the normal initialization

process. (2 pts)

DAT 560M – Big Data and Cloud Computing 2023 – Homework #4

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

2 –

2- Get to know the dataset and do a preliminary examination (for example type of columns,

summary, …) (2 pts)

3- Here, we have two identifier for the country of the loan receiver, country, and

country_code and so one is enough. Then please drop country_code. (1 pts)

Part 2- Data analysis (50 pts)

4- Find the three most loan awarded sector when the loan amount is larger than 1000. (5 pts)

5- For the top sector you found in Q4, list 6 most used activities. (5 pts)

6- Find the number of given loans per year. For that, use the year from posted_time. You

may add a new column called “year”. (5 pts)

7- Using SQL syntax, list the number of loans per sector in decreasing order where the

countries are the 3 top countries in terms of the number of received loans. (10 pts)

8- Find the top 20 countries in terms of the total loan amount they have received where the

use of the loan include the word “stock”. You may use SQL. (5 pts)

9- Do a wordcount on the “use” column. For that, consider all lower case. If you can, it’s

great to remove stopwords and then do the wordcount. It’s OK if you don’t know how to

do so. (10 pts)

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

10- Group the loans into 5 categories. If the loan amount is equal or larger than 50000, call it

“super large”. If less but larger or equal to 10000, call it “large”. If less but larger or

equal to 5000, call it “medium”. If less but larger or equal to 1000, call it “small”. If less,

call it “tiny”. Then, find the number of given loans to each category per gender. For

gender, only consider “male” or “female”. (10 pts)

Part 3- Feature engineering (10 pts)

11- Let’s find how many people are involved in each loan application. To find it out, look at

gender column. You can see sometimes it is one value, and sometimes more than one.

Count it for each loan and add it to the dataframe. (10 pts)

DAT 560M – Big Data and Cloud Computing 2023 – Homework #4

3 –

Part 4- Machine learning (35 pts)

CLICK HERE TO GET THIS ANSWER https://bestwriters.org/order

12- Now let’s focus only on Retail, Agriculture, and Food sectors the remove the rest of the

rows (10 pts).

13- We like to predict the loan_amount based on sector, country, term_in_months, year, and

the new attribute you added in Q11 and drop the rest of the columns. (5 pts)

14- Prepare your data to do a prediction task. We are interested in predicting the loan

WhatsApp Widget
GET YOUR PAPER DONE