Spark Conference Updates 2017

ZONE:  BIg DATA

Presentations and news from around the world for Spark 2.X, with many coming from the recent Spark Summit talks.

There are a huge number of great presentations on what’s changing in Apache Spark 2.x, here is a chance to catch up on a ton of great presentations in powerpoint and video format.

Simplifying Big Data Applications WIth Apache Spark 2.0 (YouTube)

The Next AmpLab:  RISELabs (YouTube)

Spark Performance (YouTube)

Automatic Checkpointing in Spark (YouTube)

Effective Spark with Alluxio (In-Memory) (YouTube)

DYNAMIC ON-THE-FLY MODIFICATIONS OF SPARK APPLICATIONS (YouTube)

Making the Switch: Predictive Maintenance on Railway Switches (YouTube)

Data-Aware Spark (Video)

Vegas, the Missing MatPlotLib for Spark (Video)

A Deep Dive into the Catalyst Optimizer (Video)

Lambda Architecture with Spark in the IoT (Video)

OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Video)

A Deep Dive into the Catalyst Optimizer-Hands on Lab (Video)

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster (Video)  (Github Spark Metrics) (Github Spark Lint)

Performance Characterization of Apache Spark on Scale-up Servers (Video)

Origin-Destination Matrix Using Mobile Network Data with Spark (Video)

How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming

(Video)

Sparkling Water 2.0: The Next Generation of Machine Learning on Apache Spark (Video)

Boosting Spark Performance on Many-Core Machines (Video)

Spark and Object Stores —What You Need to Know (Video)

Better Together: Fast Data with Apache Spark and Apache Ignite (Video)

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! (Video)

HBase Conference Slides

The recent HBaseCon has produced a number of excellent presentations that are available for viewing.  I have ranked some of the best and most useful for you to look at.

HBaseCon 2016 Slides

  1. Phoenix Use Case
  2. Apache HBase Just the Basics
  3. Solving Multitenancy and G1GC in Apache HBase
  4. Rolling out Apache HBase for Mobile Offerings at Visa 
  5. Apache Spark on Apache HBase Current and Future
  6. Breaking the Sound Barrier with Persistent Memory
  7. Apache HBase Accelerated In-Memory Flush and Compaction
  8. Improvements to Apache HBase and It’s Applications in Alibaba Search
  9. OffHeaping the Apache HBase Read Path

Strata Hadoop World is Here!!!

Strata Hadoop World Training is today (Monday September 26, 2016) and regular tutorials and conferences start tomorrow.

If you can’t attend, many of the keynotes will be streaming live below:

http://strataconf.com/live.

 

If you are coming, stop by and say Hi and also download the event app.

Conference Mobile App
Download the event app at http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/content/mobile-app.

Upcoming Machine Learning Event

 

 

new-york

 

H20.ai – The Open Tour – NYC – July 19/20, 2016

Two packed days of demos, keynotes and training classes on this very cool open source machine learning framework with over 15 speakers.   I have used this against HDP, HDP Spark 1.6 and a standalone Spark 1.6 cluster and it performed very well.  Download H2O for Hadoop or Sparkling Water for Spark here.   The product includes an awesome UI / data scientist notebook for rapid development of models.  I will be attending and report on the interesting talks.   H2O is a very interesting open source machine learning/deep learning framework and UI that works on top of Hadoop, Spark or stand-alone.   One unique feature is it’s ability to generate a POJO from a model that can then be used in regular Java programs or in a Hive UDF.

For more information, see the presentation my friend, Dr. Fogelson, did in Princeton on using H2O for Predicting Repeat Shoppers.

Contact me for a 20% discount.

Image title

H2O supports all the machine learning algorithms you would expect like GBM, Decision Trees, K-Means, Deep Learning, Naïve Bayes and more.   H2O is very mature and has been in production for years.   H2O is certified on the Hortonworks HDP platform.

This tutorial is pretty awesome as you can build a POJO and then use it as a Hive UDF.

Check out the project.  They also have awesome tutorials to get you started.

Take a look at this very cool Visual Introduction to Machine Learning.

MQTT, IOT, Scala/Java Tools, Reactive Programming, NIFI

MQTT

https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala

https://github.com/richards-tech/RTMQTT

 

Parallel HTTP Client

http://www.parallec.io/

 

REST Client at Scale

http://www.parallec.io/

 

Testing

http://gatling.io/#/download

 

IoT NEST Protocol

https://github.com/openthread/openthread

 

MQTT for NIFI

https://github.com/richards-tech/RTNiFiStreamProcessors

https://www.baldengineer.com/mqtt-tutorial.html

https://github.com/gm-spacagna/lanzarote-awesomeness

 

Data Science Manifesto

http://www.datasciencemanifesto.org/

 

Spark Scala Tutorial

https://github.com/tspannhw/spark-scala-tutorial/tree/master/tutorial#code/src/main/scala/sparkworkshop/Intro1-script.scala

 

Setup up structor for 6 nodes

https://github.com/hortonworks/structor

 

SPARK WITH ORC

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_spark-guide/content/ch_orc-spark.html

https://github.com/DhruvKumar/spark-workshop

 

https://github.com/abajwa-hw/hdp-datascience-demo

 

Machine Learning, ODPi, Deduping with Scala, OCR

ODPi for Hadoop Standards:   The ODPi + ASF to consolidate Hadoop and all the versions.   Too many custom distributions with various versions of the 20 or so tools that make up Apache Big Data.   To be able to move between HDP, CDH, IBM, Pivotal and MapR seemless would be awesome.  For now HDP, Pivotal and IBM are part of the ODPi.

Structured Data:  Connecting Modern Relational Database and Hadoop is always an architectural challenge that requires decisions, EnterpriseDB (Postgresql) has an interesting article on that.   It let’s you read HDFS/Hive tables from EDB with SQL.  (Github)

Semistructured Data:  Using Apache NIFI with Tesseract for OCR:   HP and Google have been fine-tuning Tesseract for awhile to handle OCR.   Using dataflow technology from the NSA, you can automate OCR tasks on Mac.   Pretty Cool.  On my machine, I needed to install a few things first:

Tesseract-OCR FAQ

Searching Through PDFs with Tesseract with Apache SOLR

Atlas + Ranger for Tag Based Policies in Hadoop:  Using these new but polished Apache projects for managing everyting around security policies in the Hadoop ecosystem.   Add to that a cool example with Apache SOLR.

Anyone who hasn’t tried Pig yet, might want to check out this cool tutorial.  Using PIG for NY Exchange Data.   Pig will work on Tez and Spark, so it’s a tool Data Analysts should embrace.

It’s hard to think of Modern Big Data Applications without thinking of Scala.   A number of interesting resources have come out after Scala Days NYC.

Java 8 is still in the race for developing Modern Data applications with a number of projects around Spring and CloudFoundry including  Spring Cloud Stream which lets you connect microservices with Kafka or RabbitMQ and you can run this on Apache YARN.  Also see this article.

For those of you lucky enough to have a Community Account at DataBricks cloud, you can check out the new features of Spark 2.0 on display in that platform before release. 

An interesting topic for me is Fuzzy Matching, I’ve seen a few interesting videos and githubs on that:

Am I the only person trying to remove duplicates from data?   CSV Data?   People?    

I have also been looking for some good resources on NLP (Natural Language Processing).   There’s some interesting text problems I am looking at.   

The Best New Apache Spark Presentations

 

 

 

 

 

 

 

 

 

 

 

SMACK

 

 

Related Githubs

https://github.com/socrata-platform/socrata-http

 

 

Upcoming Meetups

Deep dive Avro and Parquet – Read Avro/Write Parquet using Kafka and Spark

Tuesday, Apr 5, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

49 Hadoop Attending

Agenda1) Avro and Parquet – When and Why to use which format? 2) Data modeling – Avro and Parquet schema 3) Workshop – Read Avro input from Kafka – Transform data in Spark – Write data frame to Parquet – Read back from ParquetSpeakersTimothy Spann – Sr. Solutions Architect, airisDATA Srinivas Daruna – Data Engineer, airisDATA

Check out this Meetup →

 

Scala + Spark SQL Workshop

Thursday, Mar 10, 2016, 7:00 PM

NJ Big Data and Hadoop Meetup
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400 Hamilton Township, NJ

42 Hadoop Attending

Agenda1) Scala and Spark – Why functional paradigm? – Fn prog fundamentals – A Prog feature and hands-on (e.g functions, collections, pattern matching, implicits – speaker choice) – Tie it back to Spark2) Spark SQL – data frames and data sets – logical and physical plan – hands-on workshopSpeakers Rajiv Singla – Data Engineer, airisDATA Kristin…

Check out this Meetup →

NJ Data Science – Apache Spark

Princeton, NJ
360 Data Scientists

Large Scale Data Analysis to improve Business Profitability• Framework – Apache Spark, Hadoop,• Machine learning – SparkML, H20, R• Graph Processing – GraphX, Titan, Neo4J…

Check out this Meetup Group →

 

Workshop – How to Build Recommendation Engine using Spark 1.6 and HDP

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

53 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a predictive analytics (MLlib) …

Check out this Meetup →

 

 

 

 

NJ Data Science Meetup – How to Build Data Analytics Applications with Spark and Hortonworks

Workshop – How to build data Analytics app & Reco Engine using Spark + Horton

Thursday, Mar 17, 2016, 7:00 PM

Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544 Princeton, NJ

16 Data Scientists Attending

Agendaa) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.b) Follow along – Build a Recommendation Engine – This will show how to build a whole web app with predictive…

Check out this Meetup →

Scala Days 2016 NYC

Scala Days 2016 Schedule have been announced!

Highlights:

Beyond Shuffling: Scaling Apache Spark
by Holden Karau @holdenkarau

Scala: The Unpredicted Lingua Franca for Data Science
by Andy Petrella @noootsab and Dean Wampler@deanwampler

Build a Recommender System in Apache Spark and Integrate It Using Akka
by Willem Meints @willem_meints

Implementing Microservices with Scala and Akka
by Vaughn Vernon @VaughnVernon

Microservices based off Akka cluster at iHeartRadio
by Kailuo Wang @kailuowang

Building a High-Performance Database with Scala, Akka & Spark
by Evan Chan @evanfchan

Large scale graph analysis using Scala and Akka
by Ben Fonarov @chuwiey

Distributed Real-Time Stream Processing: Why and How
by Petr Zapletal @petr_zapletal

Deep Learning and NLP with Spark
by Andy Petrella @noootsab

 

Fans of Scala, Spark, Big Data, Machine Learning, Real-time computing, Stream processing, functional programming and reactive programming all have great talks to choose from.   Tons of great speakers including the developer of Spark Notebook, top people in Scala and a good representation from leading industry users.

IMG_20150929_082701

Spark Summit East 2016 Schedule Released

Head on over to Spark Summit East 2016.   The schedule of talks is now up.

The highlights include a talk on Spark 2.0, giving us a peak into the future.

The developer tracks look pretty amazing to me:

There are also great enterprise track talks:

 

Scala Days New York 2016

Scala Days

May 9th-13th, 2016 New York

Scala Days will be in NYC on May 9th through May 11th, 2016.  It will be conveniently located mid-town at AMA Executive Conference Center.   If you are a Scala developer, Spark developer, a Java developer looking to learn or a Big Data enthusiast this will be the place to be in May.  More than a few presentations from the last few Scala Days have entered my rotation of must read.

I will be providing in-depth coverage of the event.   This along with Spark Summit will be on my must do list for 2016.

The two days immediately following the event will be some awesome training opportunities for Scala, Spark, SMACK Stack, Microservices, Reactive programming and Akka.    The price goes up January 20th and then again on March 16th.   So put in those requests ASAP.

The program will be posted in February, but it’s guaranteed to be interesting with topics on Scala, Akka and Spark pretty much assured.   Typesafe is involved, so you know there will be good quality content.

Check out some videos from the 2015 Scala Days in San Francisco.

Bold Radius had an awesome talk in 2015 at the event on Akka.   Be on hand so you  can ask the experts and speakers questions and get more in-depth knowledge on these advanced topics.

Looking at previous agendas like the one in Amsterdam will be a good preview for you.   Looking at the topics like “Scala – The Real Spark of Data Science”, give you an idea that this will be a very worthwhile conference.

I hope to see you there, I’ll post some more details when they come in.

Check with your local meetup for a discount code.   If you are in New Jersey, see me at the NJ Data Science / Spark Meetup and I’ll get you a code.

If you are curious about Scala, check out Typesafe Activator and SBT to quickly get up and running with the full development tool kit in a rapid manner.

 

 

 

Free Hadoop, Spark, Big Data Training

Free Hadoop Training List

http://bigdatauniversity.com/courses/introduction-to-solr/

Spark Fundamentals I

 

Spark Fundamentals II

Text Analytics Essentials

Hadoop Fundamentals I

Big Data Fundamentals

Hadoop Developer Day Event

Introduction to Pig

 

http://www.coreservlets.com/hadoop-tutorial/

Accessing Hadoop Data Using Hive

 

Using HBase for Real-time Access to your Big Data – Version 2

 

Introduction to Scala

https://www.udemy.com/hadoopstarterkit/?dtcode=VAN2bNB4geMV

https://www.udemy.com/big-data-basics-hadoop-mapreduce-hive-pig-spark/?dtcode=6oLh2qk4geNz

https://www.udemy.com/data-analytics-using-hadoop-in-2-hours/?dtcode=Wa8s2vV4geNW

https://www.udemy.com/data-analytics-using-hadoop-in-2-hours/learn/

https://www.udemy.com/hadoopstarterkit/learn

https://developer.yahoo.com/hadoop/

http://adooputorialraining180.teachable.com/courses/hadoop-free-course-training-tutorial

http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training

https://www.mapr.com/services/mapr-academy/hadoop-essentials

https://www.mapr.com/services/mapr-academy/Developing-Hadoop-Applications

https://www.mapr.com/training/hadoop-demand-training/dev-325

https://www.mapr.com/services/mapr-academy/developing-hbase-applications-basics-training-course-on-demand

https://www.mapr.com/services/mapr-academy/developing-hbase-applications-advanced-training-course-on-demand

https://www.udemy.com/hadoopstarterkit/learn/

https://www.mapr.com/services/mapr-academy/apache-spark-essentials

https://www.mapr.com/services/mapr-academy/build-monitor-apache-spark-applications

https://www.mapr.com/services/mapr-academy/apache-hive-essentials-training-course-on-demand

https://www.mapr.com/services/mapr-academy/apache-pig-essentials-training-course-on-demand

https://www.mapr.com/services/mapr-academy/apache-drill-training-course-on-demand

https://www.mapr.com/services/mapr-academy/apache-drill-architecture-training-course-on-demand

MapReduce and YARN

https://www.udemy.com/big-data-basics-hadoop-mapreduce-hive-pig-spark/

https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

Pivotal HDB

http://academy.pivotal.io/course/141303

http://www.cloudera.com/content/www/en-us/resources/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html

 

Tools

Cloudera VM Download

http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-5.html

Cask VM Download

http://www.cloudera.com/content/www/en-us/downloads/cdap.html

 

 

Tools for Troubleshooting, Installation and Setup of Apache Spark Big Data Environments

Validate that you connectivity and no firewall issues when you are starting.   Conn Check is an awesome tool for that.

You may need to setup a number of servers at once, checkout Sup.

First get the 1.8 of the JDK.  Apache Spark works best with Scala, Java and Python.  Get the version of Scala you may need.   Scala Version 2.10 is the standard version and used for the precompiled downloads.   You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself, good idea to have.   Install Python 2.6 for PySpark.  Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark://10.13.196.41:7077 --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:

Security
http://mig.mozilla.org/

Machine Learning
http://systemml.apache.org/

OCR
https://github.com/tesseract-ocr/tesseract

 

Apache Spark Recent Links

http://www.slideshare.net/search/slideshow?searchfrom=header&q=spark

Apache Blur

https://databricks.com/spark-training-resources

https://databricks-training.s3.amazonaws.com/index.html

http://www.infoworld.com/article/2854894/application-development/spark-and-storm-for-real-time-computation.html

http://www.infoq.com/news/2014/01/Spark-Storm-Real-Time-Analytics

http://www.infoq.com/presentations/apache-spark-big-data

Setting up a Standalone Apache Spark 1.5.1 Cluster on Ubuntu

Install All The Things

sudo apt-get install git -y
sudo apt-add-repository ppa:webupd8team/java -y
sudo apt-get update -y
sudo apt-get install oracle-java8-installer -y
sudo apt-get install oracle-java8-set-default 
sudo apt-get install maven gradle -y
sudo apt-get install sbt -y
sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz
sudo tar -xvf spark*.tgz
sudo chmod 755 spark*
sudo apt-get update
sudo apt-get install -y openjdk-7-jdk
sudo apt-get install -y autoconf libtool
sudo apt-get -y install build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
CODENAME=$(lsb_release -cs)

# Add the repository
echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
 sudo tee /etc/apt/sources.list.d/mesosphere.list
sudo apt-get -y update
sudo apt-get -y install mesos

I also installed Apache Mesos for clustering for future upgrade from Spark standalone cluster.

 

For standalone Spark cluster, I used:   spark-1.5.1-bin-hadoop2.6

conf/spark-env.sh
#!/usr/bin/env bash

export SPARK_LOCAL_IP=MYIP

To Start A Node

sbin/start-slave.sh masterIP:7077

Links

Installing Other Tools and Servers on Ubuntu

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo apt-get install -y mongodb-org=3.0.4 mongodb-org-server=3.0.4 mongodb-org-shell=3.0.4 mongodb-org-mongos=3.0.4 mongodb-org-tools=3.0.4
sudo service mongod start
sudo tail -5000 /var/log/mongodb/mongod.log

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-14-04

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-redis

sudo apt-get install build-essential
sudo apt-get install tcl8.5
sudo wget http://download.redis.io/releases/redis-stable.tar.gz
sudo tar xzf redis-stable.tar.gz
cd redis-stable
make
make test
sudo make install
cd utils
sudo ./install_server.sh
sudo service redis_6379 start
redis-cli

http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/

http://www.scala-lang.org/download/2.11.7.html

sudo wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.deb
sudo dpkg -i scala-2.11.7.deb

http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Linux.html

echo "deb http://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-get update
sudo apt-get install sbt
sudo apt-get install unzip
curl -s get.gvmtool.net | bash
source "/root/.gvm/bin/gvm-init.sh"
gvm install gradle

 

Github Examples for SpringXD

Example Spring XD Scripts

  • stream create –name rq –definition “rabbit –outputType=text/plain | jdbc –columns=’message,host’ –initializeDatabase=true” –deploy

Links

Top 12 Tutorials and Workshops for Hadoop and related big data technologies

 

 

 

Spark Summit East 2015 NYC Report

Apache Spark Summit East 2015 opened really smoothly with everything professionally run and well organized.   The event opened Wednesday at 9am with keynotes from Databricks in the Grand Ballroom.  A number of great talks on Spark continued with everyone assembled in the one room.

 

Presentations Available

 Some interesting facts:

  • 2014 – 150 committers to Spark
  • 2015 – 500 committers to Spark
  • The code based doubled in that time and now has over 500 active production deployments.
  • Apache Spark is the most active big data product, more than storm or map reduce
  • Apache Spark is the most active Apache project
  • On-Disk Sort Record to sort 100TB – Daytona GraySort Benchmark – Spark set the 2014 record using only 207 machines and 23 minutes using the public cloud, that destroyed Yahoo’s hadoop record.

Added in 2014

  • Added Spark SQL
  • Java 8 syntax
  • Python streaming
  • GraphX
  • Random forest
  • Streaming mlib

New direction for 2015

  • Data Science (high level interfaces similar to single-machine tools)
  • like R and Python tools to make them clustered
  • Platform Interfaces
  • plug in data sources and algorithms
  • ex: cassandra, aws, hadoop
  • DataFrames, in Spark 1.3, optimized automatically via SparkSQL
  • API similiar to data frames in R and Pandas
  • Compiled to java bytes codes (faster than scala api)
  • Machine Learning Pipelines
  • SciKit-Learn API like
  • featurization, evaluation, parameter search
  • tokenizer
  • hashingTF
  • model evaluation
  • Spark interface in R

 

MongoDB with Spark

 

External Data Sources

returns data frames usable in spark apps or SQL

pushes logic into sources

ex: cassandra, hbase, hive, parquet, postgresql, json, mysql, elasticsearch

Select * from mysql_users u JOIN
hive_logs h where u.lang = en

CAYdMzIUsAACJzV CAYdMzIUsAACJzV CAYW8aqUQAEfJJE

CAYWOKAUkAE3b8F CAZk16IVEAAbZEH CAZk09lU0AAGoLZ CAZkn2VUkAAnI4q CAZkoVMUIAARUqf CAZko4qUIAAXtvf CAZkpYKVIAAZ4Pz CAZkgVTUIAArOwn CAZkgxmUUAABMJn CAZkhUWVIAAn3fs CAZkhzZUsAApJ1a CAZkXLpWwAA1CjK CAZkXl1XIAAxtlc CAZkYBuXEAAuyJm CAZkYfQWsAAlz99 CAZkXLpWwAA1CjK (1) CAZkXl1XIAAxtlc (1) CAZkYBuXEAAuyJm (1) CAZkYfQWsAAlz99 (1) CAZkNQWU8AEfKtS CAZkNxRVIAAzCNb CAZkOSfUcAAQ2mL CAZkOzuUYAA8iup CAZkFRkUMAEajW8 CAZkFCWUwAATdQE CAZkF-QUMAA3pjt CAZkGe5UUAENucW CAZj5trXIAItIoY CAZj6K1W4AAW9ev CAZj6YwWAAANzUe CAZj6tGWwAA4gwB CAZjucDWAAAMVR5 CAZju1LWYAAOM6o CAZjvOwW4AA5vdL CAZjvlOWQAAj37F CAZiXVCWEAIZTK9 CAZiYSvWUAAmPM9 CAZiX3hWQAEZWpZ CAZiYguWwAAbhsq CAZdLELUsAAr6D4 CAYdMkbUIAAVJuc CAY2z4ZUkAEY7wO CAY0gTsUgAAujbR CAYz1DKUsAEhs0u CAYv2oTUkAAr3rX CAYvxqzUIAA4a0J CAYvnA9UQAAYE0c CAYkDohWMAAFJbh CAYdMkbUIAAVJuc

 

Notes:

 

can join between data sources, query federation, minimize amount of work done to get data out of datasources in Spark 1.3

JDBC datasource

community 3rd party packages

 

bin/spark-shell –packages databricks/spark-csv:0.2

ex: spark-avro, spark-csv, sparking-water, ….

Spark Stack Diagram

Scala, Java, Python, R

Data Frames ML Pipelines

Spark SQL, Spark Streaming, Mlib, Graphx

Spark Core

Data Sources

Goal: unified engine across data sources, workloads and environments

Harnessing the power of spark with databricks cloud

accelerating spark adoption

certifying applications and distributions

are free and successful

75 applications

(alpine, elasticsearch, typesafe,)

11 distributions

(Hortonworks, pivotal, ibm, oracle, mapr, sap, datastax, bluedata, stratio, transwarp, guavus)

online courses

Intro to big data with Spark

Scalable Machine Learning

46000 registered

july 2014 – databricks cloud → 3,500 registered users

nov 2014 – launched limited

100 companies using databricks cloud

big data projects are hard

setup and maintain cluster (6-9 months)

data preparation (ingestion, etl) (months)

productize (weeks to months)

exploration / kpi

reports / dashboards (weeks)

insights

statistics, machine learning, graphing, iterative

productionize those learnings

zero management, real-time, unified platform, accelerate from months to days

Open Platform

Hosted on AWS

Spark, Spark Cluster Manager

Workspace (notebooks, dashboards, jobs)

seconds to build/destroy/scale/share clustered

notebooks – interactive visualization

one api, one engine for all workloads (batch, streaming, interactive, ml, graph computations, real-time)

One set of tools

publish dashboard from notebook to production with 1 click

data sources (s3, kafka, kinesis, redshift, hdfs, cassandra, mysql)

external packages (jars, libraries)

download code and run in any spark distributions

ODBC driver to bi tools like tableau, qlik

Example

MyFitnessPal / UnderArmour

36 million recipes

14.5 billion logged foods

5 million food items

80 million registered users

using Spark and DataBricks for 1 year

Spark project for suggested serving sizes, search, food data cleaning

Will use Spark for Ad-targeting/recommendation systems, deep-dive into customer understanding, large-scale ETL

Automatic Lab

IOT

Flood buoys, red light cameras, car.

Rob Fergusson – connected car

wireless tether to phone from device. Drive smarter. Safer. Notifying. Car analysis.

Connect to other things.

Lot of data in the car

Organized app data (postgresql)

noisey time series readings from car

some in amazon redshift

terabytes

Redshift, where good data goes to die

Developers didnt know the tools, afraid of losing data, production only

unloaded into CSV to S3

pg_dump to S3

then pulled into spark

deduped data

they analyzed fuel efficiency

collaberative data democracy

engineering expenses are the most, not data storage

will eventually open to 3rd parties for spark databricks cloud

CLOSEST PARTNERS

analytics bi tool

ZoomData

BI tool integrated with Databricks Cloud

Spark – Pivot, Sort, Calculations, Filters, Joins

@zoomdata

zoomdata.cloud.databricks.com

showing 1 billion rows

almost instant performance for query changes with 1 billion rows

3 second attention span for people

scalable, secure, ldap, make available

zoomdata connectors (impala, solr, mongo, oracle, sqlserver, …)

real-time / live data

zoomdata.com/databricks

uncharted (oculus info inc)

PanTera Tool

built on spark/databricks cloud

plot millions/billions of records on maps

generating views on demand

web map zoom/pan to blocks

can see one side of the street

can include social database

All Spark Summit Registered people will be access to Databricks Cloud next week!

 

palantir (ex-paypal guys)

searching for illegal trading

huge diverse data

trader oversight

improve the risk model

find outlier behavior

compare self vs cohort

have to determine clustering of people by database

generate alerts

improve interface to display them

data integration, analytics, decisions

Land Data

logs, jdbc, streams

get parquet, avro files into S3 or HDFS

data versioning

differentials / snapshot / append / update

Spark Transform, Spark SQL

Pandas Dataframe, Python Script

SchemaRDD

Goldman Sachs

Data Analytics Platform

Matt glickman

embracing spark

scalable data analytics platform for the enterprise

they saw it at Strata+Hadoop World NY oct 2014

intutive bindings to scala, java, python, r

relational, functional, iterative api into lazy ealuation data pipeline

storage agnostic

lambda closures

power of distributed computing

using scala

Elasticity in 3 dimensions

data storage

compute

users

power of spark is the api abstractions (RDD, dataframe)

Spark is becoming the lingua franca of big data analytics

contribute to open source

Step 1 is DataLake

How to consume and curate

Spark RDD DataFrames

DAG of all datasets

store curated data back to Data Lake

spin up cpu segregated clusters on demand

Embed spark driver in JVM applications like a scala library

use existing JVM IDE

sparkcontext

add jar

get classesfromclasspath

share spark clusters

Dynamically deploy code to cluster at run-time with lambda closures

enable debugging real-time of code on a distributed clustered

can add breakpoints in lambda spark

can see the data set

Provision machines to run spark

library synchronization

run on same cluster as HDFS?

Cloud data services vs internal

ETL is a big problem

Moving data is hard

scalable storage

data from external vendors

reconstruction vendor databases in data lakes

Cloud Data Exchange

ETL once loading of vendor data by vendors

scalable compute near database

vendors can provision their data

enterprises load their data securely

this would be key for managed cloud data service

run Spark and SQL on this database

use Spark Client API as the new JDBC/ODBC

use Spark APIs

risk of not contributing

accelerated move of enterprise data to the public cloud

Peter Wang, Continuum Analytics

Python and Spark

NumPy, SciPy

PySpark

Anaconda Cluster

Blaze

python > 15 year history in scientific computing

high performance for clustering and linear algebra

pandas

python for data analysis book

ipython

GPU, distributed linear algebra, plotting, deep learning

streamparse → Storm

DeliRoll Architecture – 1 billion rows a second with python beating redshift

AdRoll

./bin/pyspark

help(pyspark)

SQL, Streaming, MLLib

Difficulties

Package Management

outside normal java build toolchain

rapid iteration issues

devops vs data scientist

versioning, deployments

pip install, conda install

operations

Anaconda Cluster (free)

manages python, java, r, js packages across clustered

EC2, digital ocean, on prem

isolates, cross platforms

manage runtime state across the cluster

python science distribution plus clustering

2 commands to create cluster and submit a job

profile file

destroy cluster with 1 command

Conda is the package manager for Anaconda Python

(language agnostic: directories, interpreter, Spark, Node.js, Java)

sandboxed with containers or virtualization, single directory, linux, windows, mac; versioned

conda environment/sandbox

 

 

3rd platform

 

Sparksql

Hivecontext

Dataframes replace schemardd

 

Builtin

Json

Jdbc

Parquet

Hive

Myswl

Hdfs

S3

Postgresql

 

External

Avro

Csv

Hbase

 

Dataframeapi

 

Aggregation

Filter

 

Spark api expressive lambda

Rdd

 

Data Frames

 

Table

Groupby

 

Faster than rdd

 

Sqlcontext.load(“a”,json)

Udf support

 

Faster

 

Partition parquet automatic

Working on data skipping via min and max summaries

 

Pushing predicates to jdbc

Happen late

Across function

 

 

Machine learning pipeline

 

Jdbc uses thrift hive metastores

 

Data store uses own for parquet

 

Cassandra spark

 

Spark kafka cassandra akka

 

Helena

Helenaedelson slideshare

Fault tolerance

Batch and stream

 

Cassandra

Bigtable

Amaxon dynamo paper

 

On spark

Gossip consensus paxos

 

Cql

Datastax

C*

Spark Cassandra connector

Scala api and java

Time series

 

Data locality aware

Savetocassandra

Keyspace table

 

Twitterutils

 

Streaming context

Cassandrarow

Tuples

Udts

Timeseries

Primary key has year month day hour

With clustering order by

Timeuuid

 

Data model like query

 

Akka supervisor strategy

Actors

Actor supervisor hierarchy

Reactive streams

Kafkastreamsave tocassandra

 

Killrweather on github (Spark, Cassandra, Kafka)

Graphx facebook spark

PageRank algo

 

Github databricks reference-apps

 

Graphx

 

Ratings

Users

Products

 

Collaborative filtering

 

Bipartite graph

Mllib

How do you store graph

Complex pipeline

Community detection

Hyperlinks

Page rank

Tables and graphs

 

Graph processing in table oriented spark

Graphx api

 

Property graph

Vertex property

 

Create a graph. Scala

 

Vertices

Edges

Rdds to create a graph

Triplets.  Edges with vertex props

Sendmsg along edges

Page rank built-in

Apache Spark Quick Guide

Coding Examples in Java and Scala

Spark Streaming

Spark SQL

Spark Best Practices

https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details

Spark SQL DataFrames

Spark Packages

Spark Videos

Learning Spark / Tutorials / Workshops

Spark Summit Content

Coding Guides

Tachyon

Spark and NOSQL

File Types

Resources

Spring XD

http://docs.spring.io/spring-xd/docs/current-SNAPSHOT/reference/html/#spark-streaming

Related

Scala

Reactive

Microservices / 12 Factor Apps

Top Resources for Apache Spark in 2014

 

Running Apache Spark on YARN

Spark Logging
Spark Configuration File
Submitting applications to Spark
YARN needs HADOOP environment variable.
For PreLoading SPARK Runtime JAR, add to HDFS.
hdfs dfs -copyFromLocal lib/spark-assembly-1.1.0-hadoop2.3.0.jar /yarn
hdfs dfs -ls /user/gpadmin
export YARN_CONF_DIR=/my dir
export SPARK_HOME=/usr/share/spark
export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf
hdfs dfs -mkdir -p /user/spark/share/lib
hdfs dfs -put $SPARK_HOME/assembly/lib/spark-assembly_*.jar
/user/spark/share/lib/spark-assembly.jar
export SPARK_JAR=hdfs://pivhdsne.localdomain:8020/user/spark/share/lib/spark-assembly.jar
/usr/share/spark/bin/spark-submit  --num-executors 10  --master yarn-cluster
  --class org.apache.spark.examples.SparkPi
  /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
/usr/share/spark/bin/spark-shell --master yarn-client 
--spark.yarn.jar hdfs://pivhdsne.localdomain:8020/ --verbose
Useful Links
Checking Logs for a YARN App (such as a SPARK job)
yarn logs -applicationId application_1418749874519_0001

 

Spring XD Big Update

https://spring.io/blog/2014/11/19/spring-xd-1-1-m1-and-1-0-2-released

Spring XD 1.1.M1 is a game changer.  Once this is final, it will be the ultimate big data ingestion, export, real-time analytics, batch workflow orchestration and streaming tool.

Loading Tilde Delimited Files into HAWQ Tables with Mixed Columns

XD Job and Stream with SQL

Caveat:  The complete field lists are abbreviated for sake of space, you have to list all the fields you are working with.

First we create a simple filejdbc Spring Job that loads the raw tilde delimited file into HAWQ.   These fields all come in as TEXT fields, which could be okay for some purposes, but not our needs.   We also create a XD stream with a custom sink (see the XML, no coding) that runs a SQL command to insert from this table and convert into other HAWQ types (like numbers and time). We trigger the secondary stream to run via a command line REST POST, but we could have used a timed trigger or many other ways (automated, scripted or manual) to kick that off.  You could also just create a custom XD job that did casting of your types and some manipulation or done it with a Groovy script transform.   There’s many options in XD.

Update:    Spring XD has added a JDBC source, so you can avoid this step of Job plus Stream.   I will add a new blog entry when that version of Spring XD is GA.  Spring XD is constantly evolving.

jobload.xd

job create loadjob --definition "filejdbc --resources=file:/tmp/xd/input/files/*.* --names=time,userid,dataname,dataname2,
dateTimeField, lastName, firstName, city, state, address1, address2 --tableName=raw_data_tbl --initializeDatabase=true 
--driverClassName=org.postgresql.Driver --delimiter=~ --dateFormat=yyyy-MM-dd-hh.mm.ss --numberFormat=%d 
--username=gpadmin --url=jdbc:postgresql:gpadmin" --deploy
stream create --name streamload --definition "http | hawq-store" --deploy
job launch jobload
clear
job list
stream list

1) Job loads file into a Raw HAWQ table with all text columns.
2) Stream is triggered by web page hit or command line call
(needs hawq-store). This does inserts into the real table and truncates the temp one.

triggerrun.sh (BASH shell script for testing)

curl -s -H "Content-Type: application/json" -X POST -d "{id:5}" http://localhost:9000

added spring-integration-jdbc jar to /opt/pivotal/spring-xd/xd/lib

hawq-store.xml (Spring Integration / XD Configuration)

/opt/pivotal/spring-xd/xd/modules/sink/hawq-store.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:int="http://www.springframework.org/schema/integration"
 xmlns:int-jdbc="http://www.springframework.org/schema/integration/jdbc"
 xmlns:jdbc="http://www.springframework.org/schema/jdbc"
 xsi:schemaLocation="http://www.springframework.org/schema/beans
 http://www.springframework.org/schema/beans/spring-beans.xsd
 http://www.springframework.org/schema/integration
 http://www.springframework.org/schema/integration/spring-integration.xsd
 http://www.springframework.org/schema/integration/jdbc
 http://www.springframework.org/schema/integration/jdbc/spring-integration-jdbc.xsd">
<int:channel id="input" />
<int-jdbc:store-outbound-channel-adapter
 channel="input" query="insert into real_data_tbl(time, userid, firstname, ...) select cast(time as datetime), 
cast(userid as numeric), firstname, ... from dfpp_networkfillclicks" data-source="dataSource" />
<bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
 <property name="driverClassName" value="org.postgresql.Driver"/>
 <property name="url" value="jdbc:postgresql:gpadmin"/>
 <property name="username" value="gpadmin"/>
 <property name="password" value=""/>
</bean>
</beans>

createtable.sql  (HAWQ Table)

CREATE TABLE raw_data_tbl
 (
 time text,
 userid text ,
...
  somefield text
 )
 WITH (APPENDONLY=true)
 DISTRIBUTED BY (time);


Spark on Tachyon on Pivotal HD 2.0 (Hadoop 2.2)

The Future Architecture of a Data Lake in Memory Data Exchange Platform using Tachyon and Apache Spark

Tachyon Resources
Big Data Mini-Course Tachyon
Tachyon on Redhat

Spark Resources
Data Exploration with Spark

tachfile tachy2 sparkjob

Run

 scala> var file = sc.textFile("tachyon://localhost:19998/xd/load/test.json")
14/10/15 21:11:23 INFO MemoryStore: ensureFreeSpace(69856) called with curMem=208659, maxMem=308713881
14/10/15 21:11:23 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 68.2 KB, free 294.1 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at <console>:12

scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
14/10/15 21:11:26 INFO : getFileStatus(/xd/load/test.json): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/xd/load/test.json TPath: tachyon://localhost:19998/xd/load/test.json
14/10/15 21:11:26 INFO FileInputFormat: Total input paths to process : 1
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at reduceByKey at <console>:14

scala> counts.saveAsTextFile("tachyon://localhost:19998/result")
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result does not exist)/result
14/10/15 21:12:26 INFO : File does not exist: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/10/15 21:12:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/10/15 21:12:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/10/15 21:12:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/10/15 21:12:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : mkdirs(tachyon://localhost:19998/result/_temporary/0, rwxrwxrwx)
14/10/15 21:12:26 INFO SparkContext: Starting job: saveAsTextFile at <console>:17
14/10/15 21:12:26 INFO DAGScheduler: Registering RDD 12 (reduceByKey at <console>:14)
14/10/15 21:12:26 INFO DAGScheduler: Got job 0 (saveAsTextFile at <console>:17) with 2 output partitions (allowLocal=false)
14/10/15 21:12:26 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at <console>:17)
14/10/15 21:12:26 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Missing parents: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14), which has no missing parents
14/10/15 21:12:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14)
14/10/15 21:12:26 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:0 as 2090 bytes in 2 ms
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:1 as 2090 bytes in 0 ms
14/10/15 21:12:26 INFO Executor: Running task ID 1
14/10/15 21:12:26 INFO Executor: Running task ID 0
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:230135+230136
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:0+230135
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : Folder /mnt/ramdisk/tachyonworker/users/1 was created!
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:27 INFO : Canceled output of block 48318382080, deleted local file /mnt/ramdisk/tachyonworker/users/1/48318382080
14/10/15 21:12:27 INFO Executor: Serialized size of result for 0 is 786
14/10/15 21:12:27 INFO Executor: Serialized size of result for 1 is 786
14/10/15 21:12:27 INFO Executor: Sending result for 0 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 1 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 0
14/10/15 21:12:27 INFO Executor: Finished task ID 1
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 0 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 1 in 411 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 1)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Stage 1 (reduceByKey at <console>:14) finished in 0.419 s
14/10/15 21:12:27 INFO DAGScheduler: looking for newly runnable stages
14/10/15 21:12:27 INFO DAGScheduler: running: Set()
14/10/15 21:12:27 INFO DAGScheduler: waiting: Set(Stage 0)
14/10/15 21:12:27 INFO DAGScheduler: failed: Set()
14/10/15 21:12:27 INFO DAGScheduler: Missing parents for Stage 0: List()
14/10/15 21:12:27 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17), which is now runnable
14/10/15 21:12:27 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:0 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:0 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:1 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:1 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO Executor: Running task ID 2
14/10/15 21:12:27 INFO Executor: Running task ID 3
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/10/15 21:12:27 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/10/15 21:12:27 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/10/15 21:12:27 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3/part-00001, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value.
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2/part-00000, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/56908316672 was created!
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/54760833024 was created!
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000001 does not exist)/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001)
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000000 does not exist)/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000)
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000001_3' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000001_3: Committed
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000000_2' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000000_2: Committed
14/10/15 21:12:27 INFO Executor: Serialized size of result for 3 is 825
14/10/15 21:12:27 INFO Executor: Serialized size of result for 2 is 825
14/10/15 21:12:27 INFO Executor: Sending result for 3 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 2 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 2
14/10/15 21:12:27 INFO Executor: Finished task ID 3
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 3 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 1)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 2 in 415 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 0)
14/10/15 21:12:27 INFO DAGScheduler: Stage 0 (saveAsTextFile at <console>:17) finished in 0.415 s
14/10/15 21:12:27 INFO SparkContext: Job finished: saveAsTextFile at <console>:17, took 0.952281177 s
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00001 TPath: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00001 does not exist)/result/part-00001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001/part-00001, tachyon://localhost:19998/result/part-00001)
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00000 TPath: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00000 does not exist)/result/part-00000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000/part-00000, tachyon://localhost:19998/result/part-00000)
14/10/15 21:12:27 INFO : delete(tachyon://localhost:19998/result/_temporary, true)
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_SUCCESS, rw-r--r--, true, 65536, 1, 33554432, null)

[pivhdsne:tachyon-0.5.0]$ hadoop fs -ls /xd/
Found 8 items
-rwxrwxrwx 3 gpadmin hadoop 460179 2014-10-15 19:54 /xd/bigfile.txt
drwxr-xr-x - root hadoop 0 2014-09-24 16:02 /xd/demorabbittapG
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w1
drwxrwxrwx - root hadoop 0 2014-10-14 15:34 /xd/w2
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w3
drwxrwxrwx - root hadoop 0 2014-10-14 15:33 /xd/w4
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w5
drwxrwxrwx - root hadoop 0 2014-10-14 16:57 /xd/w6
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /xd/load
449.48 KB 10-15-2014 17:17:03:489 Not In Memory /xd/load/test.json
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /result
244.98 KB 10-15-2014 21:12:27:354 In Memory /result/part-00001
243.57 KB 10-15-2014 21:12:27:356 In Memory /result/part-00000
0.00 B 10-15-2014 21:12:27:625 In Memory /result/_SUCCESS
[pivhdsne:tachyon-0.5.0]$ ls -lt /mnt/ramdisk/tachyonworker/users/1
total 0