Assignment Task
Project – Analyzing New York City 311 Calls using Hive
1 Introduction
In this assignment, you will download a dataset of New York City 311 calls and analyze it using Hive. The dataset contains information about the types of complaints and service requests received by the city’s 311 call center. You will use Hive to load the data into a table and perform various analyses to gain insights into the data.
2 Downloading the Dataset
The dataset can be downloaded from the following link: NYC 311 Calls. You can also use a wget command to directly download the dataset using the command below (posted on Blackboard). Note that this is not the latest data but it is fine for this project.
3 Loading the Data into HDFS
To load the data into HDFS, follow these steps:
1. Open a terminal and navigate to the Hadoop installation directory.
2. Use the following command to create a directory in HDFS to store the dataset:
hadoop fs -mkdir /user//311_calls
3. Use the following command to copy the downloaded CSV file into the HDFS directory:
hadoop fs -copyFromLocal /user//311_calls/
4 Creating a Hive Table
To create a Hive table for the 311 calls dataset, follow these steps:
1. Open a terminal and launch the Hive shell using the following command:
hive
2. Use the following command to create a new database:
CREATE DATABASE 311_calls;
3. Use the following command to create a new table in the database:
1 USE 311_calls;
2 CREATE TABLE calls_YourName (
3 unique_key STRING,
4 created_date TIMESTAMP,
5 closed_date TIMESTAMP,
6 agency STRING,
7 agency_name STRING,
8 complaint_type STRING,
9 descriptor STRING,
10 location_type STRING,
11 incident_zip INT,
12 incident_address STRING,
13 street_name STRING,
14 address_type STRING,
15 city STRING,
16 borough STRING,
17 latitude FLOAT,
18 longitude FLOAT,
19 location STRING
20 )
21 ROW FORMAT DELIMITED
22 FIELDS TERMINATED BY ’,’
23 STORED AS TEXTFILE;
4. Use the following command to load the data from the CSV file into the table:
1 LOAD DATA INPATH ’/user//311_calls/’
2 INTO TABLE calls_YourName;
5 Analyzing the Data using Hive
Now that the data is loaded into a Hive table, you can begin to analyze it using various Hive queries. Here are some potential insights that could be gained from the dataset. You are welcome to come up with any other insights you think might be useful.
Top complaints: The dataset can provide information on the most common complaints and service requests made to the 311 call center. This information can be used to prioritize resources and improve city services.
Geographic distribution: The dataset can be used to analyze the geographic distribution of complaints and requests, allowing the city to identify areas with higher levels of need.
Time of day and day of the week: The dataset can also be used to analyze the frequency of complaints and requests based on the time of day and day of the week. This can help the city allocate resources and staff more effectively.
Response times: The dataset contains information on the response times for different types of complaints and requests. This can help the city identify areas where response times need to be improved and make changes to better serve its residents.
Trends over time: By analyzing the dataset over time, the city can identify trends in the types of complaints and requests being made, which can be used to inform policy decisions and resource allocation.
Demographic analysis: By analyzing the dataset by demographic factors such as age, gender, and race, the city can identify patterns in service requests that may be associated with disparities in access to services.