* Apache Spark workshop


Workshop on Apache Spark covering following topics:
 - Overview of Apache Spark
 - Spark execution model
 - Programming in Spark
 - Spark SQL
 - Spark Streaming

Level: Intermediate

Register now - Entry is free for limited period.

Apache Spark started as a project at AMP Labs in Berkeley. Today it is one of the most active open source projects. In this workshop, we will uncover key components of Apache Spark.

Apache Spark builds a Directed Acyclic Graph (DAG) of tasks split in to stages where the tasks are parallelized as much as possible before shuffling of data. We will discuss the execution paradigm in Spark along with in-memory/cache based persistence and processing of data giving it unique performance advantage.

Details will be discussed on how to program with Spark and use its API. Resilient Distributed Datasets (RDD) are parallelized collections of data sets and can be operated in parallel as distributed data set. There is a huge list of operators that can be used to do a transformation or action operation on the RDD.

Spark SQL is becoming a preferred choice of doing exploratory analytics on data. DataFrames are available as programming abstraction which provide convenient way of doing complex operations with its API. Coupled with a powerful Catalyst optimization engine, Spark SQL can run with or without Hive and will be discussed in detail as part of this workshop.

Streaming is one of the hottest areas in big data arena with a wide scope for near real time application development. Easily integrable with other modules of Spark, it provides a powerful way of doing micro-batch analytics. We take a look at some examples of real time streaming.