Spark Conference Updates 2017


Presentations and news from around the world for Spark 2.X, with many coming from the recent Spark Summit talks.

There are a huge number of great presentations on what’s changing in Apache Spark 2.x, here is a chance to catch up on a ton of great presentations in powerpoint and video format.

Simplifying Big Data Applications WIth Apache Spark 2.0 (YouTube)

The Next AmpLab:  RISELabs (YouTube)

Spark Performance (YouTube)

Automatic Checkpointing in Spark (YouTube)

Effective Spark with Alluxio (In-Memory) (YouTube)


Making the Switch: Predictive Maintenance on Railway Switches (YouTube)

Data-Aware Spark (Video)

Vegas, the Missing MatPlotLib for Spark (Video)

A Deep Dive into the Catalyst Optimizer (Video)

Lambda Architecture with Spark in the IoT (Video)

OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Video)

A Deep Dive into the Catalyst Optimizer-Hands on Lab (Video)

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster (Video)  (Github Spark Metrics) (Github Spark Lint)

Performance Characterization of Apache Spark on Scale-up Servers (Video)

Origin-Destination Matrix Using Mobile Network Data with Spark (Video)

How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming


Sparkling Water 2.0: The Next Generation of Machine Learning on Apache Spark (Video)

Boosting Spark Performance on Many-Core Machines (Video)

Spark and Object Stores —What You Need to Know (Video)

Better Together: Fast Data with Apache Spark and Apache Ignite (Video)

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! (Video)

HBase Conference Slides

The recent HBaseCon has produced a number of excellent presentations that are available for viewing.  I have ranked some of the best and most useful for you to look at.

HBaseCon 2016 Slides

  1. Phoenix Use Case
  2. Apache HBase Just the Basics
  3. Solving Multitenancy and G1GC in Apache HBase
  4. Rolling out Apache HBase for Mobile Offerings at Visa 
  5. Apache Spark on Apache HBase Current and Future
  6. Breaking the Sound Barrier with Persistent Memory
  7. Apache HBase Accelerated In-Memory Flush and Compaction
  8. Improvements to Apache HBase and It’s Applications in Alibaba Search
  9. OffHeaping the Apache HBase Read Path

Strata Hadoop World is Here!!!

Strata Hadoop World Training is today (Monday September 26, 2016) and regular tutorials and conferences start tomorrow.

If you can’t attend, many of the keynotes will be streaming live below:


If you are coming, stop by and say Hi and also download the event app.

Conference Mobile App
Download the event app at

Upcoming Machine Learning Event



new-york – The Open Tour – NYC – July 19/20, 2016

Two packed days of demos, keynotes and training classes on this very cool open source machine learning framework with over 15 speakers.   I have used this against HDP, HDP Spark 1.6 and a standalone Spark 1.6 cluster and it performed very well.  Download H2O for Hadoop or Sparkling Water for Spark here.   The product includes an awesome UI / data scientist notebook for rapid development of models.  I will be attending and report on the interesting talks.   H2O is a very interesting open source machine learning/deep learning framework and UI that works on top of Hadoop, Spark or stand-alone.   One unique feature is it’s ability to generate a POJO from a model that can then be used in regular Java programs or in a Hive UDF.

For more information, see the presentation my friend, Dr. Fogelson, did in Princeton on using H2O for Predicting Repeat Shoppers.

Contact me for a 20% discount.

Image title

H2O supports all the machine learning algorithms you would expect like GBM, Decision Trees, K-Means, Deep Learning, Naïve Bayes and more.   H2O is very mature and has been in production for years.   H2O is certified on the Hortonworks HDP platform.

This tutorial is pretty awesome as you can build a POJO and then use it as a Hive UDF.

Check out the project.  They also have awesome tutorials to get you started.

Take a look at this very cool Visual Introduction to Machine Learning.

MQTT, IOT, Scala/Java Tools, Reactive Programming, NIFI



Parallel HTTP Client


REST Client at Scale




IoT NEST Protocol




Data Science Manifesto


Spark Scala Tutorial


Setup up structor for 6 nodes




The Best New Apache Spark Presentations















Related Githubs



Upcoming Meetups

Deep dive Avro and Parquet – Read Avro/Write Parquet using Kafka and Spark

Tuesday, Apr 5, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

49 Hadoop Attending

Agenda1) Avro and Parquet – When and Why to use which format? 2) Data modeling – Avro and Parquet schema 3) Workshop – Read Avro input from Kafka – Transform data in Spark – Write data frame to Parquet – Read back from ParquetSpeakersTimothy Spann – Sr. Solutions Architect, airisDATA Srinivas Daruna – Data Engineer, airisDATA

Check out this Meetup →


Scala + Spark SQL Workshop

Thursday, Mar 10, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

42 Hadoop Attending

Agenda1) Scala and Spark – Why functional paradigm? – Fn prog fundamentals – A Prog feature and hands-on (e.g functions, collections, pattern matching, implicits – speaker choice) – Tie it back to Spark2) Spark SQL – data frames and data sets – logical and physical plan – hands-on workshopSpeakers Rajiv Singla – Data Engineer, airisDATA Kristin…

Check out this Meetup →

NJ Data Science – Apache Spark

Princeton, NJ
360 Data Scientists

Large Scale Data Analysis to improve Business Profitability• Framework – Apache Spark, Hadoop,• Machine learning – SparkML, H20, R• Graph Processing – GraphX, Titan, Neo4J…

Check out this Meetup Group →


Workshop – How to Build Recommendation Engine using Spark 1.6 and HDP

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

53 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a predictive analytics (MLlib) …

Check out this Meetup →