Spark Summit East 2015 NYC Report

Apache Spark Summit East 2015 opened really smoothly with everything professionally run and well organized.   The event opened Wednesday at 9am with keynotes from Databricks in the Grand Ballroom.  A number of great talks on Spark continued with everyone assembled in the one room.

 

Presentations Available

 Some interesting facts:

  • 2014 – 150 committers to Spark
  • 2015 – 500 committers to Spark
  • The code based doubled in that time and now has over 500 active production deployments.
  • Apache Spark is the most active big data product, more than storm or map reduce
  • Apache Spark is the most active Apache project
  • On-Disk Sort Record to sort 100TB – Daytona GraySort Benchmark – Spark set the 2014 record using only 207 machines and 23 minutes using the public cloud, that destroyed Yahoo’s hadoop record.

Added in 2014

  • Added Spark SQL
  • Java 8 syntax
  • Python streaming
  • GraphX
  • Random forest
  • Streaming mlib

New direction for 2015

  • Data Science (high level interfaces similar to single-machine tools)
  • like R and Python tools to make them clustered
  • Platform Interfaces
  • plug in data sources and algorithms
  • ex: cassandra, aws, hadoop
  • DataFrames, in Spark 1.3, optimized automatically via SparkSQL
  • API similiar to data frames in R and Pandas
  • Compiled to java bytes codes (faster than scala api)
  • Machine Learning Pipelines
  • SciKit-Learn API like
  • featurization, evaluation, parameter search
  • tokenizer
  • hashingTF
  • model evaluation
  • Spark interface in R

 

MongoDB with Spark

 

External Data Sources

returns data frames usable in spark apps or SQL

pushes logic into sources

ex: cassandra, hbase, hive, parquet, postgresql, json, mysql, elasticsearch

Select * from mysql_users u JOIN
hive_logs h where u.lang = en

CAYdMzIUsAACJzV CAYdMzIUsAACJzV CAYW8aqUQAEfJJE

CAYWOKAUkAE3b8F CAZk16IVEAAbZEH CAZk09lU0AAGoLZ CAZkn2VUkAAnI4q CAZkoVMUIAARUqf CAZko4qUIAAXtvf CAZkpYKVIAAZ4Pz CAZkgVTUIAArOwn CAZkgxmUUAABMJn CAZkhUWVIAAn3fs CAZkhzZUsAApJ1a CAZkXLpWwAA1CjK CAZkXl1XIAAxtlc CAZkYBuXEAAuyJm CAZkYfQWsAAlz99 CAZkXLpWwAA1CjK (1) CAZkXl1XIAAxtlc (1) CAZkYBuXEAAuyJm (1) CAZkYfQWsAAlz99 (1) CAZkNQWU8AEfKtS CAZkNxRVIAAzCNb CAZkOSfUcAAQ2mL CAZkOzuUYAA8iup CAZkFRkUMAEajW8 CAZkFCWUwAATdQE CAZkF-QUMAA3pjt CAZkGe5UUAENucW CAZj5trXIAItIoY CAZj6K1W4AAW9ev CAZj6YwWAAANzUe CAZj6tGWwAA4gwB CAZjucDWAAAMVR5 CAZju1LWYAAOM6o CAZjvOwW4AA5vdL CAZjvlOWQAAj37F CAZiXVCWEAIZTK9 CAZiYSvWUAAmPM9 CAZiX3hWQAEZWpZ CAZiYguWwAAbhsq CAZdLELUsAAr6D4 CAYdMkbUIAAVJuc CAY2z4ZUkAEY7wO CAY0gTsUgAAujbR CAYz1DKUsAEhs0u CAYv2oTUkAAr3rX CAYvxqzUIAA4a0J CAYvnA9UQAAYE0c CAYkDohWMAAFJbh CAYdMkbUIAAVJuc

 

Notes:

 

can join between data sources, query federation, minimize amount of work done to get data out of datasources in Spark 1.3

JDBC datasource

community 3rd party packages

 

bin/spark-shell –packages databricks/spark-csv:0.2

ex: spark-avro, spark-csv, sparking-water, ….

Spark Stack Diagram

Scala, Java, Python, R

Data Frames ML Pipelines

Spark SQL, Spark Streaming, Mlib, Graphx

Spark Core

Data Sources

Goal: unified engine across data sources, workloads and environments

Harnessing the power of spark with databricks cloud

accelerating spark adoption

certifying applications and distributions

are free and successful

75 applications

(alpine, elasticsearch, typesafe,)

11 distributions

(Hortonworks, pivotal, ibm, oracle, mapr, sap, datastax, bluedata, stratio, transwarp, guavus)

online courses

Intro to big data with Spark

Scalable Machine Learning

46000 registered

july 2014 – databricks cloud → 3,500 registered users

nov 2014 – launched limited

100 companies using databricks cloud

big data projects are hard

setup and maintain cluster (6-9 months)

data preparation (ingestion, etl) (months)

productize (weeks to months)

exploration / kpi

reports / dashboards (weeks)

insights

statistics, machine learning, graphing, iterative

productionize those learnings

zero management, real-time, unified platform, accelerate from months to days

Open Platform

Hosted on AWS

Spark, Spark Cluster Manager

Workspace (notebooks, dashboards, jobs)

seconds to build/destroy/scale/share clustered

notebooks – interactive visualization

one api, one engine for all workloads (batch, streaming, interactive, ml, graph computations, real-time)

One set of tools

publish dashboard from notebook to production with 1 click

data sources (s3, kafka, kinesis, redshift, hdfs, cassandra, mysql)

external packages (jars, libraries)

download code and run in any spark distributions

ODBC driver to bi tools like tableau, qlik

Example

MyFitnessPal / UnderArmour

36 million recipes

14.5 billion logged foods

5 million food items

80 million registered users

using Spark and DataBricks for 1 year

Spark project for suggested serving sizes, search, food data cleaning

Will use Spark for Ad-targeting/recommendation systems, deep-dive into customer understanding, large-scale ETL

Automatic Lab

IOT

Flood buoys, red light cameras, car.

Rob Fergusson – connected car

wireless tether to phone from device. Drive smarter. Safer. Notifying. Car analysis.

Connect to other things.

Lot of data in the car

Organized app data (postgresql)

noisey time series readings from car

some in amazon redshift

terabytes

Redshift, where good data goes to die

Developers didnt know the tools, afraid of losing data, production only

unloaded into CSV to S3

pg_dump to S3

then pulled into spark

deduped data

they analyzed fuel efficiency

collaberative data democracy

engineering expenses are the most, not data storage

will eventually open to 3rd parties for spark databricks cloud

CLOSEST PARTNERS

analytics bi tool

ZoomData

BI tool integrated with Databricks Cloud

Spark – Pivot, Sort, Calculations, Filters, Joins

@zoomdata

zoomdata.cloud.databricks.com

showing 1 billion rows

almost instant performance for query changes with 1 billion rows

3 second attention span for people

scalable, secure, ldap, make available

zoomdata connectors (impala, solr, mongo, oracle, sqlserver, …)

real-time / live data

zoomdata.com/databricks

uncharted (oculus info inc)

PanTera Tool

built on spark/databricks cloud

plot millions/billions of records on maps

generating views on demand

web map zoom/pan to blocks

can see one side of the street

can include social database

All Spark Summit Registered people will be access to Databricks Cloud next week!

 

palantir (ex-paypal guys)

searching for illegal trading

huge diverse data

trader oversight

improve the risk model

find outlier behavior

compare self vs cohort

have to determine clustering of people by database

generate alerts

improve interface to display them

data integration, analytics, decisions

Land Data

logs, jdbc, streams

get parquet, avro files into S3 or HDFS

data versioning

differentials / snapshot / append / update

Spark Transform, Spark SQL

Pandas Dataframe, Python Script

SchemaRDD

Goldman Sachs

Data Analytics Platform

Matt glickman

embracing spark

scalable data analytics platform for the enterprise

they saw it at Strata+Hadoop World NY oct 2014

intutive bindings to scala, java, python, r

relational, functional, iterative api into lazy ealuation data pipeline

storage agnostic

lambda closures

power of distributed computing

using scala

Elasticity in 3 dimensions

data storage

compute

users

power of spark is the api abstractions (RDD, dataframe)

Spark is becoming the lingua franca of big data analytics

contribute to open source

Step 1 is DataLake

How to consume and curate

Spark RDD DataFrames

DAG of all datasets

store curated data back to Data Lake

spin up cpu segregated clusters on demand

Embed spark driver in JVM applications like a scala library

use existing JVM IDE

sparkcontext

add jar

get classesfromclasspath

share spark clusters

Dynamically deploy code to cluster at run-time with lambda closures

enable debugging real-time of code on a distributed clustered

can add breakpoints in lambda spark

can see the data set

Provision machines to run spark

library synchronization

run on same cluster as HDFS?

Cloud data services vs internal

ETL is a big problem

Moving data is hard

scalable storage

data from external vendors

reconstruction vendor databases in data lakes

Cloud Data Exchange

ETL once loading of vendor data by vendors

scalable compute near database

vendors can provision their data

enterprises load their data securely

this would be key for managed cloud data service

run Spark and SQL on this database

use Spark Client API as the new JDBC/ODBC

use Spark APIs

risk of not contributing

accelerated move of enterprise data to the public cloud

Peter Wang, Continuum Analytics

Python and Spark

NumPy, SciPy

PySpark

Anaconda Cluster

Blaze

python > 15 year history in scientific computing

high performance for clustering and linear algebra

pandas

python for data analysis book

ipython

GPU, distributed linear algebra, plotting, deep learning

streamparse → Storm

DeliRoll Architecture – 1 billion rows a second with python beating redshift

AdRoll

./bin/pyspark

help(pyspark)

SQL, Streaming, MLLib

Difficulties

Package Management

outside normal java build toolchain

rapid iteration issues

devops vs data scientist

versioning, deployments

pip install, conda install

operations

Anaconda Cluster (free)

manages python, java, r, js packages across clustered

EC2, digital ocean, on prem

isolates, cross platforms

manage runtime state across the cluster

python science distribution plus clustering

2 commands to create cluster and submit a job

profile file

destroy cluster with 1 command

Conda is the package manager for Anaconda Python

(language agnostic: directories, interpreter, Spark, Node.js, Java)

sandboxed with containers or virtualization, single directory, linux, windows, mac; versioned

conda environment/sandbox

 

 

3rd platform

 

Sparksql

Hivecontext

Dataframes replace schemardd

 

Builtin

Json

Jdbc

Parquet

Hive

Myswl

Hdfs

S3

Postgresql

 

External

Avro

Csv

Hbase

 

Dataframeapi

 

Aggregation

Filter

 

Spark api expressive lambda

Rdd

 

Data Frames

 

Table

Groupby

 

Faster than rdd

 

Sqlcontext.load(“a”,json)

Udf support

 

Faster

 

Partition parquet automatic

Working on data skipping via min and max summaries

 

Pushing predicates to jdbc

Happen late

Across function

 

 

Machine learning pipeline

 

Jdbc uses thrift hive metastores

 

Data store uses own for parquet

 

Cassandra spark

 

Spark kafka cassandra akka

 

Helena

Helenaedelson slideshare

Fault tolerance

Batch and stream

 

Cassandra

Bigtable

Amaxon dynamo paper

 

On spark

Gossip consensus paxos

 

Cql

Datastax

C*

Spark Cassandra connector

Scala api and java

Time series

 

Data locality aware

Savetocassandra

Keyspace table

 

Twitterutils

 

Streaming context

Cassandrarow

Tuples

Udts

Timeseries

Primary key has year month day hour

With clustering order by

Timeuuid

 

Data model like query

 

Akka supervisor strategy

Actors

Actor supervisor hierarchy

Reactive streams

Kafkastreamsave tocassandra

 

Killrweather on github (Spark, Cassandra, Kafka)

Graphx facebook spark

PageRank algo

 

Github databricks reference-apps

 

Graphx

 

Ratings

Users

Products

 

Collaborative filtering

 

Bipartite graph

Mllib

How do you store graph

Complex pipeline

Community detection

Hyperlinks

Page rank

Tables and graphs

 

Graph processing in table oriented spark

Graphx api

 

Property graph

Vertex property

 

Create a graph. Scala

 

Vertices

Edges

Rdds to create a graph

Triplets.  Edges with vertex props

Sendmsg along edges

Page rank built-in

Leave a Reply