Spark

Apache Spark是一个开源集群运算框架

  • 开源

  • 高性能

  • 易用,支持R、Scala、Jav等

  • 通用,支持批处理、流处理、机器学习等多种场景

Spark生态

1595246924884

Apache Spark架构

1595247249282

基本概念

1595247356497
1595251272054

Spark APIS

1595251434282

RDD

1595251476144

可分区——可以实现分布式计算

上图是起4个sq同时执行4个task

1595251649507
1595251710932
1595251929473

RDD:Operations

RDD操作类型1:Transformations,基于已有的RDD生成新的RDD

1595252016633

http://spark.apache.org/docs/2.4.5/rdd-programming-guide.html#transformations

RDD操作类型2:Actions,触发生成job开始运算

1595252230490
1595252332141
1595252302446

DAG还是比较重要的。

Spark SQL DataFrames

1595252603579
1595252648459
1595252796040
1595252870562

进阶

Structured Streaming(SS)

1595253056145

例子:

./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999

1595252915034

ML Pipelines

1595253220471

阿里云EMR介绍

1595252980394

Spark SQL

Spark SQL is Apache Spark's module for working with structured data.

1595340779347
1595340954154
1595341105163
1595341326858
1595341380664
1595341392611
1595341488099
1595341662350
1595341711557
1595341797869
1595341826333
1595341858234
1595342060531
1595342090852
1595342214644
1595342227840
1595342349492

explain SQL语句:展示执行过程

explain extended SQL语句:更详细的信息

Spark for ETL & Data Science

1595424678470
1595424911975
1595425059693
1595425194283
1595425315101
1595425365597
1595425391365
1595425563680
1595425598054
1595425672683
1595425896424
1595425993460
1595426055626

Delta Lake

1595511636064
1595511744686
1595511869498
1595511974030
1595512043583
1595512063199
1595512156449
1595512199155
1595512315866
1595512468185
1595512484837
1595512527604
1595512656552
1595512670242
1595512703338
1595512759334
1595512813599
1595512846386
1595512881991
1595513024078
1595513080562
1595513137999
1595513145732
1595513217776
1595513326534

Last updated

Was this helpful?