Machine Learning, ODPi, Deduping with Scala, OCR

ODPi for Hadoop Standards:   The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data:  Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data:  Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

Tesseract-OCR FAQ

Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop:  Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial.  Using PIG for NY Exchange Data.   Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC.

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including  Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN.  Also see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:

Am I the only person trying to remove duplicates from data?   CSV Data?   People?    

I have also been looking for some good resources on NLP (Natural Language Processing).   There’s some interesting text problems I am looking at.   

Spark Summit East 2016 Schedule Released

Head on over to Spark Summit East 2016.   The schedule of talks is now up.

The highlights include a talk on Spark 2.0, giving us a peak into the future.

The developer tracks look pretty amazing to me:

There are also great enterprise track talks:


Scala Days New York 2016

Scala Days

May 9th-13th, 2016 New York

Scala Days will be in NYC on May 9th through May 11th, 2016.  It will be conveniently located mid-town at AMA Executive Conference Center.   If you are a Scala developer, Spark developer, a Java developer looking to learn or a Big Data enthusiast this will be the place to be in May.  More than a few presentations from the last few Scala Days have entered my rotation of must read.

I will be providing in-depth coverage of the event.   This along with Spark Summit will be on my must do list for 2016.

The two days immediately following the event will be some awesome training opportunities for Scala, Spark, SMACK Stack, Microservices, Reactive programming and Akka.    The price goes up January 20th and then again on March 16th.   So put in those requests ASAP.

The program will be posted in February, but it’s guaranteed to be interesting with topics on Scala, Akka and Spark pretty much assured.   Typesafe is involved, so you know there will be good quality content.

Check out some videos from the 2015 Scala Days in San Francisco.

Bold Radius had an awesome talk in 2015 at the event on Akka.   Be on hand so you  can ask the experts and speakers questions and get more in-depth knowledge on these advanced topics.

Looking at previous agendas like the one in Amsterdam will be a good preview for you.   Looking at the topics like “Scala – The Real Spark of Data Science”, give you an idea that this will be a very worthwhile conference.

I hope to see you there, I’ll post some more details when they come in.

Check with your local meetup for a discount code.   If you are in New Jersey, see me at the NJ Data Science / Spark Meetup and I’ll get you a code.

If you are curious about Scala, check out Typesafe Activator and SBT to quickly get up and running with the full development tool kit in a rapid manner.




Tools for Troubleshooting, Installation and Setup of Apache Spark Big Data Environments

Validate that you connectivity and no firewall issues when you are starting.   Conn Check is an awesome tool for that.

You may need to setup a number of servers at once, checkout Sup.

First get the 1.8 of the JDK.  Apache Spark works best with Scala, Java and Python.  Get the version of Scala you may need.   Scala Version 2.10 is the standard version and used for the precompiled downloads.   You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself, good idea to have.   Install Python 2.6 for PySpark.  Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark:// --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:


Machine Learning



Apache Spark Recent Links

Apache Blur