H20.ai – The Open Tour – NYC – July 19/20, 2016
Two packed days of demos, keynotes and training classes on this very cool open source machine learning framework with over 15 speakers. I have used this against HDP, HDP Spark 1.6 and a standalone Spark 1.6 cluster and it performed very well. Download H2O for Hadoop or Sparkling Water for Spark here. The product includes an awesome UI / data scientist notebook for rapid development of models. I will be attending and report on the interesting talks. H2O is a very interesting open source machine learning/deep learning framework and UI that works on top of Hadoop, Spark or stand-alone. One unique feature is it’s ability to generate a POJO from a model that can then be used in regular Java programs or in a Hive UDF.
For more information, see the presentation my friend, Dr. Fogelson, did in Princeton on using H2O for Predicting Repeat Shoppers.
Contact me for a 20% discount.
H2O supports all the machine learning algorithms you would expect like GBM, Decision Trees, K-Means, Deep Learning, Naïve Bayes and more. H2O is very mature and has been in production for years. H2O is certified on the Hortonworks HDP platform.
This tutorial is pretty awesome as you can build a POJO and then use it as a Hive UDF.
Check out the project. They also have awesome tutorials to get you started.
Take a look at this very cool Visual Introduction to Machine Learning.
ODPi for Hadoop Standards: The ODPi + ASF to consolidate Hadoop and all the versions. Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data. To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome. For now HDP, Pivotal and IBM are part of the ODPi.
Structured Data: Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that. It let’s you read HDFS/Hive tables from EDB with SQL. (Github)
Semistructured Data: Using Apache NIFI with Tesseract for OCR: HP and Google have been fine-tuning Tesseract for awhile to handle OCR. Using dataflow technology from the NSA, you can automate OCR tasks on Mac. Pretty Cool. On my machine, I needed to install a few things first:
Searching Through PDFs with Tesseract with Apache SOLR
Atlas + Ranger for Tag Based Policies in Hadoop: Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem. Add to that a cool example with Apache SOLR.
Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial. Using PIG for NY Exchange Data. Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.
It’s hard to think of Modern Big Data Applications without thinking of Scala. A number of interesting resources have come out after Scala Days NYC.
Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN. Also see this article.
For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release.
An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:
Am I the only person trying to remove duplicates from data? CSV Data? People?
I have also been looking for some good resources on NLP (Natural Language Processing). There’s some interesting text problems I am looking at.