spark stage

Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines).
Shuffling is the process of data transfer between stages.

When you invoke an action on an RDD, a “job” is created. Jobs are work submitted to Spark.
Jobs are divided into “stages” based on the shuffle boundary. This can help you understand.
Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.
Wide transformations basically result in stage boundaries.
the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage.
