Machine Learning, ODPi, Deduping with Scala, OCR

ODPi for Hadoop Standards:   The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data:  Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data:  Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

Tesseract-OCR FAQ

Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop:  Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial.  Using PIG for NY Exchange Data.   Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC.

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including  Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN.  Also see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:

Am I the only person trying to remove duplicates from data?   CSV Data?   People?    

I have also been looking for some good resources on NLP (Natural Language Processing).   There’s some interesting text problems I am looking at.   

Leave a Reply