Strata Hadoop World is Here!!!

Strata Hadoop World Training is today (Monday September 26, 2016) and regular tutorials and conferences start tomorrow.

If you can’t attend, many of the keynotes will be streaming live below:

http://strataconf.com/live.

 

If you are coming, stop by and say Hi and also download the event app.

Conference Mobile App
Download the event app at http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/content/mobile-app.

Upcoming Machine Learning Event

 

 

new-york

 

H20.ai – The Open Tour – NYC – July 19/20, 2016

Two packed days of demos, keynotes and training classes on this very cool open source machine learning framework with over 15 speakers.   I have used this against HDP, HDP Spark 1.6 and a standalone Spark 1.6 cluster and it performed very well.  Download H2O for Hadoop or Sparkling Water for Spark here.   The product includes an awesome UI / data scientist notebook for rapid development of models.  I will be attending and report on the interesting talks.   H2O is a very interesting open source machine learning/deep learning framework and UI that works on top of Hadoop, Spark or stand-alone.   One unique feature is it’s ability to generate a POJO from a model that can then be used in regular Java programs or in a Hive UDF.

For more information, see the presentation my friend, Dr. Fogelson, did in Princeton on using H2O for Predicting Repeat Shoppers.

Contact me for a 20% discount.

Image title

H2O supports all the machine learning algorithms you would expect like GBM, Decision Trees, K-Means, Deep Learning, Naïve Bayes and more.   H2O is very mature and has been in production for years.   H2O is certified on the Hortonworks HDP platform.

This tutorial is pretty awesome as you can build a POJO and then use it as a Hive UDF.

Check out the project.  They also have awesome tutorials to get you started.

Take a look at this very cool Visual Introduction to Machine Learning.

MQTT, IOT, Scala/Java Tools, Reactive Programming, NIFI

MQTT

https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala

https://github.com/richards-tech/RTMQTT

 

Parallel HTTP Client

http://www.parallec.io/

 

REST Client at Scale

http://www.parallec.io/

 

Testing

http://gatling.io/#/download

 

IoT NEST Protocol

https://github.com/openthread/openthread

 

MQTT for NIFI

https://github.com/richards-tech/RTNiFiStreamProcessors

https://www.baldengineer.com/mqtt-tutorial.html

https://github.com/gm-spacagna/lanzarote-awesomeness

 

Data Science Manifesto

http://www.datasciencemanifesto.org/

 

Spark Scala Tutorial

https://github.com/tspannhw/spark-scala-tutorial/tree/master/tutorial#code/src/main/scala/sparkworkshop/Intro1-script.scala

 

Setup up structor for 6 nodes

https://github.com/hortonworks/structor

 

SPARK WITH ORC

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_spark-guide/content/ch_orc-spark.html

https://github.com/DhruvKumar/spark-workshop

 

https://github.com/abajwa-hw/hdp-datascience-demo

 

Machine Learning, ODPi, Deduping with Scala, OCR

ODPi for Hadoop Standards:   The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data:  Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data:  Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

Tesseract-OCR FAQ

Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop:  Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial.  Using PIG for NY Exchange Data.   Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC.

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including  Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN.  Also see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:

Am I the only person trying to remove duplicates from data?   CSV Data?   People?    

I have also been looking for some good resources on NLP (Natural Language Processing).   There’s some interesting text problems I am looking at.   

The Best New Apache Spark Presentations

 

 

 

 

 

 

 

 

 

 

 

SMACK

 

 

Related Githubs

https://github.com/socrata-platform/socrata-http

 

 

Upcoming Meetups

Deep dive Avro and Parquet – Read Avro/Write Parquet using Kafka and Spark

Tuesday, Apr 5, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

49 Hadoop Attending

Agenda1) Avro and Parquet – When and Why to use which format? 2) Data modeling – Avro and Parquet schema 3) Workshop – Read Avro input from Kafka – Transform data in Spark – Write data frame to Parquet – Read back from ParquetSpeakersTimothy Spann – Sr. Solutions Architect, airisDATA Srinivas Daruna – Data Engineer, airisDATA

Check out this Meetup →

 

Scala + Spark SQL Workshop

Thursday, Mar 10, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

42 Hadoop Attending

Agenda1) Scala and Spark – Why functional paradigm? – Fn prog fundamentals – A Prog feature and hands-on (e.g functions, collections, pattern matching, implicits – speaker choice) – Tie it back to Spark2) Spark SQL – data frames and data sets – logical and physical plan – hands-on workshopSpeakers Rajiv Singla – Data Engineer, airisDATA Kristin…

Check out this Meetup →

NJ Data Science – Apache Spark

Princeton, NJ
360 Data Scientists

Large Scale Data Analysis to improve Business Profitability• Framework – Apache Spark, Hadoop,• Machine learning – SparkML, H20, R• Graph Processing – GraphX, Titan, Neo4J…

Check out this Meetup Group →

 

Workshop – How to Build Recommendation Engine using Spark 1.6 and HDP

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

53 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a predictive analytics (MLlib) …

Check out this Meetup →

 

 

 

 

NJ Data Science Meetup – How to Build Data Analytics Applications with Spark and Hortonworks

Workshop – How to build data Analytics app & Reco Engine using Spark + Horton

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

16 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a whole web app with predictive…

Check out this Meetup →