The Day that Had the Largest Number of Installed Drives Question
Description
Answer these questions using Spark code. Submit your code (in a py file) and the answers to the questions (in a text file). The answers should use the full dataset, not the small dataset. Start with the code shown below. (Hint: for any tasks that say max/largest, don’t use sortByKey, because that’s much slower than a better option.)
- Which day had the largest number of installed drives, and what was this number?
- How many distinct drives (by model+serial) are installed (i.e., that exist in the data) in each year?
- What’s the max drive capacity per year?
Full dataset: change the file path to: file:///ssd/data/backblaze.csv (146 million rows) – my solution took 17min
Run spark like this: spark-submit backblaze-spark.py –master=local[5]
Or to hide log messages: spark-submit backblaze-spark.py –master=local[5] 2> /dev/null
Starting code with some examples that you can remove:
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Backblaze").getOrCreate() schema = "day DATE, serial STRING, model STRING, capacity LONG, failure INTEGER" d = spark.read.schema(schema).load("file:///home/jeckroth/cinf201/spark/assignment/small-backblaze.csv", format="csv", sep=",", header="true") d = d.rdd # print first 10 rows print(d.take(10)) ## How many failures occurred each year? # make key (year) & value (failure 0/1) d2 = d.map(lambda row: (row.day.year, row.failure)) # add up failures per year failureCounts = d2.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt) print(failureCounts.collect()) ## Which model (not serial number) has the most failures overall? # grab model & failure from data, model is the key d3 = d.map(lambda row: (row.model, row.failure)) # count failures for that model; result so far: [(modelX, 55), (modelY, 2100)] d3 = d3.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt) # flip keys and values; result so far: [(55, modelX), (2100, modelY)] d3 = d3.map(lambda pair: (pair[1], pair[0])) # sort by value (second in the pair) d3 = d3.sortByKey(ascending=False) ### NOT EFFICIENT TECHNIQUE print(d3.collect())
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."