NJ Data Science Meetup – How to Build Data Analytics Applications with Spark and Hortonworks

Workshop – How to build data Analytics app & Reco Engine using Spark + Horton

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

16 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a whole web app with predictive…

Check out this Meetup →

Scala Days 2016 NYC

Scala Days 2016 Schedule have been announced!


Beyond Shuffling: Scaling Apache Spark
by Holden Karau @holdenkarau

Scala: The Unpredicted Lingua Franca for Data Science
by Andy Petrella @noootsab and Dean Wampler@deanwampler

Build a Recommender System in Apache Spark and Integrate It Using Akka
by Willem Meints @willem_meints

Implementing Microservices with Scala and Akka
by Vaughn Vernon @VaughnVernon

Microservices based off Akka cluster at iHeartRadio
by Kailuo Wang @kailuowang

Building a High-Performance Database with Scala, Akka & Spark
by Evan Chan @evanfchan

Large scale graph analysis using Scala and Akka
by Ben Fonarov @chuwiey

Distributed Real-Time Stream Processing: Why and How
by Petr Zapletal @petr_zapletal

Deep Learning and NLP with Spark
by Andy Petrella @noootsab


Fans of Scala, Spark, Big Data, Machine Learning, Real-time computing, Stream processing, functional programming and reactive programming all have great talks to choose from.   Tons of great speakers including the developer of Spark Notebook, top people in Scala and a good representation from leading industry users.


Spark Summit East 2016 Schedule Released

Head on over to Spark Summit East 2016.   The schedule of talks is now up.

The highlights include a talk on Spark 2.0, giving us a peak into the future.

The developer tracks look pretty amazing to me:

There are also great enterprise track talks:


Scala Days New York 2016

Scala Days

May 9th-13th, 2016 New York

Scala Days will be in NYC on May 9th through May 11th, 2016.  It will be conveniently located mid-town at AMA Executive Conference Center.   If you are a Scala developer, Spark developer, a Java developer looking to learn or a Big Data enthusiast this will be the place to be in May.  More than a few presentations from the last few Scala Days have entered my rotation of must read.

I will be providing in-depth coverage of the event.   This along with Spark Summit will be on my must do list for 2016.

The two days immediately following the event will be some awesome training opportunities for Scala, Spark, SMACK Stack, Microservices, Reactive programming and Akka.    The price goes up January 20th and then again on March 16th.   So put in those requests ASAP.

The program will be posted in February, but it’s guaranteed to be interesting with topics on Scala, Akka and Spark pretty much assured.   Typesafe is involved, so you know there will be good quality content.

Check out some videos from the 2015 Scala Days in San Francisco.

Bold Radius had an awesome talk in 2015 at the event on Akka.   Be on hand so you  can ask the experts and speakers questions and get more in-depth knowledge on these advanced topics.

Looking at previous agendas like the one in Amsterdam will be a good preview for you.   Looking at the topics like “Scala – The Real Spark of Data Science”, give you an idea that this will be a very worthwhile conference.

I hope to see you there, I’ll post some more details when they come in.

Check with your local meetup for a discount code.   If you are in New Jersey, see me at the NJ Data Science / Spark Meetup and I’ll get you a code.

If you are curious about Scala, check out Typesafe Activator and SBT to quickly get up and running with the full development tool kit in a rapid manner.




Free Hadoop, Spark, Big Data Training

Free Hadoop Training List


Spark Fundamentals I


Spark Fundamentals II


Hadoop Fundamentals I

Big Data Fundamentals

Hadoop Developer Day Event

Introduction to Pig



Accessing Hadoop Data Using Hive


Using HBase for Real-time Access to your Big Data – Version 2


Introduction to Scala






















MapReduce and YARN




Pivotal HDB





Cloudera VM Download


Cask VM Download




Tools for Troubleshooting, Installation and Setup of Apache Spark Big Data Environments

Validate that you connectivity and no firewall issues when you are starting.   Conn Check is an awesome tool for that.

You may need to setup a number of servers at once, checkout Sup.

First get the 1.8 of the JDK.  Apache Spark works best with Scala, Java and Python.  Get the version of Scala you may need.   Scala Version 2.10 is the standard version and used for the precompiled downloads.   You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself, good idea to have.   Install Python 2.6 for PySpark.  Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark:// --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:


Machine Learning