Spark 2 Workbook Answers Link

## 8. Final Checklist Before Submitting

## 6. Quick Reference Cheatsheet (Spark 2.4)

# 4️⃣ Action – trigger the computation and collect the count unique_word_count = distinct_words.count()

val result = df .groupBy($"department") .agg(count("*").as("emp_cnt"), avg($"salary").as("avg_salary")) .filter($"emp_cnt" > 5) spark 2 workbook answers

– bulk HTTP calls:

1. **Ingestion** – `spark.read.json` or `textFile`. 2. **Parsing** – `withColumn` + `from_unixtime`, `regexp_extract`. 3. **Cleaning** – filter out malformed rows, `na.drop`. 4. **Enrichment** – join with a static lookup table (broadcast). 5. **Aggregation** – `groupBy(date, status).agg(count("*").as("cnt"))`. 6. **Output** – write to Parquet partitioned by `date` **or** stream to console for debugging.

words = lines.flatMap(lambda line: line.split()) # optional cleaning cleaned = words.map(lambda w: w.lower().strip('.,!?"\'')) distinct_words = cleaned.distinct() count = distinct_words.count() **Ingestion** – `spark

---

sc = SparkContext(appName="WordCount") lines = sc.textFile("hdfs:///data/myfile.txt")

sc = SparkContext(appName="DistinctWordCount") avg($"salary").as("avg_salary")) .filter($"emp_cnt" &gt

def fetch_batch(it): session = requests.Session() for url in it: yield session.get(url).text session.close()

### 🎯 Your Next Step