Apache Spark

Installation

Follow https://spark.apache.org/downloads.html, download and extract

REPL: pyspark(Python), spark-shell(Scala)
Job submission: spark-submit code.py/code.jar
Run example: run-example SparkPi, spark-submit examples/src/main/python/pi.py

Create SparkSession to use SQL interface: https://spark.apache.org/docs/latest/sql-programming-guide.html

SparkSession.builder.appName("SimpleApp").getOrCreate()

Old RDD interface uses SparkContext: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Get Dataset/DataFrame from files:

Or run SQL on files directly:

val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

See more on https://spark.apache.org/docs/latest/sql-data-sources.html

Save Dataset/DataFrame to files:

Supports functional programming, the functions transforms Dataset in to a new one.

Actions: extract values from Dataset

Supports caching in memory: cache()

Execute raw SQL queries: spark.sql(query), https://spark.apache.org/docs/latest/api/sql/

Register a Dataset as a temporary view in SQL: df.createOrReplaceTempView(name)