Spark Conference Updates 2017


Presentations and news from around the world for Spark 2.X, with many coming from the recent Spark Summit talks.

There are a huge number of great presentations on what’s changing in Apache Spark 2.x, here is a chance to catch up on a ton of great presentations in powerpoint and video format.

Simplifying Big Data Applications WIth Apache Spark 2.0 (YouTube)

The Next AmpLab:  RISELabs (YouTube)

Spark Performance (YouTube)

Automatic Checkpointing in Spark (YouTube)

Effective Spark with Alluxio (In-Memory) (YouTube)


Making the Switch: Predictive Maintenance on Railway Switches (YouTube)

Data-Aware Spark (Video)

Vegas, the Missing MatPlotLib for Spark (Video)

A Deep Dive into the Catalyst Optimizer (Video)

Lambda Architecture with Spark in the IoT (Video)

OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Video)

A Deep Dive into the Catalyst Optimizer-Hands on Lab (Video)

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster (Video)  (Github Spark Metrics) (Github Spark Lint)

Performance Characterization of Apache Spark on Scale-up Servers (Video)

Origin-Destination Matrix Using Mobile Network Data with Spark (Video)

How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming


Sparkling Water 2.0: The Next Generation of Machine Learning on Apache Spark (Video)

Boosting Spark Performance on Many-Core Machines (Video)

Spark and Object Stores —What You Need to Know (Video)

Better Together: Fast Data with Apache Spark and Apache Ignite (Video)

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! (Video)

HBase Conference Slides

The recent HBaseCon has produced a number of excellent presentations that are available for viewing.  I have ranked some of the best and most useful for you to look at.

HBaseCon 2016 Slides

  1. Phoenix Use Case
  2. Apache HBase Just the Basics
  3. Solving Multitenancy and G1GC in Apache HBase
  4. Rolling out Apache HBase for Mobile Offerings at Visa 
  5. Apache Spark on Apache HBase Current and Future
  6. Breaking the Sound Barrier with Persistent Memory
  7. Apache HBase Accelerated In-Memory Flush and Compaction
  8. Improvements to Apache HBase and It’s Applications in Alibaba Search
  9. OffHeaping the Apache HBase Read Path

Strata Hadoop World is Here!!!

Strata Hadoop World Training is today (Monday September 26, 2016) and regular tutorials and conferences start tomorrow.

If you can’t attend, many of the keynotes will be streaming live below:


If you are coming, stop by and say Hi and also download the event app.

Conference Mobile App
Download the event app at

Upcoming Machine Learning Event



new-york – The Open Tour – NYC – July 19/20, 2016

Two packed days of demos, keynotes and training classes on this very cool open source machine learning framework with over 15 speakers.   I have used this against HDP, HDP Spark 1.6 and a standalone Spark 1.6 cluster and it performed very well.  Download H2O for Hadoop or Sparkling Water for Spark here.   The product includes an awesome UI / data scientist notebook for rapid development of models.  I will be attending and report on the interesting talks.   H2O is a very interesting open source machine learning/deep learning framework and UI that works on top of Hadoop, Spark or stand-alone.   One unique feature is it’s ability to generate a POJO from a model that can then be used in regular Java programs or in a Hive UDF.

For more information, see the presentation my friend, Dr. Fogelson, did in Princeton on using H2O for Predicting Repeat Shoppers.

Contact me for a 20% discount.

Image title

H2O supports all the machine learning algorithms you would expect like GBM, Decision Trees, K-Means, Deep Learning, Naïve Bayes and more.   H2O is very mature and has been in production for years.   H2O is certified on the Hortonworks HDP platform.

This tutorial is pretty awesome as you can build a POJO and then use it as a Hive UDF.

Check out the project.  They also have awesome tutorials to get you started.

Take a look at this very cool Visual Introduction to Machine Learning.

MQTT, IOT, Scala/Java Tools, Reactive Programming, NIFI



Parallel HTTP Client


REST Client at Scale




IoT NEST Protocol




Data Science Manifesto


Spark Scala Tutorial


Setup up structor for 6 nodes




Machine Learning, ODPi, Deduping with Scala, OCR

ODPi for Hadoop Standards:   The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data:  Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data:  Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

Tesseract-OCR FAQ

Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop:  Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial.  Using PIG for NY Exchange Data.   Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC.

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including  Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN.  Also see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:

Am I the only person trying to remove duplicates from data?   CSV Data?   People?    

I have also been looking for some good resources on NLP (Natural Language Processing).   There’s some interesting text problems I am looking at.   

The Best New Apache Spark Presentations















Related Githubs