Starting with Spark

I was checking my personal email when a question popped up “What will you learn this summer?” Well the answer was few lines down. Scalable Machine Learning is the course that has been offered by Berkeley University and sponsored by databricks in the EDX platform since June 29.

Spark was stocking me, on twitter, news, blogs, everywhere! that drove me to take this course and explore what is Spark, and what I can get from it.

In the firsts lessons I already learned the difference between Spark and one of his ancestor, Hadoop. For those who don’t know Hadoop, it is a framework for MapReduce (read What Is Apache Hadoop?). Spark moved the frontier in such frameworks with its Resilient Distributed Datasets (RDD).

RRD allows programmers use the memory on large clusters in a fault-tolerant way. Thanks to this, the iterative algorithms and interactive data mining tools improves their performance from tens to hundreds in the time of execution obtaining the same result.

In the next image you can see how Spark performs better in execution time versus a framework like Hadoop. As I already said this is due to the RRD that allows the use of intermediate results which cut the overhead produced by the data replication, disk I/O and serialization.

Comparison between Spark and Hadoop
Comparison between Hadoop MR and Spark in iterative ML algorithms. Taked from CS190.1x Week 2a slides.

Spark can do much more than improve time computing in iterative ML algorithms. This course is in his finals weeks but I hope you can enjoy as much as I’m doing now. You can click on the course logo to go to the class website at edX and learn the basics to become a future Spark developer.

Scalable Machine Learning Course Logo
Scalable Machine Learning