Tools for Troubleshooting, Installation and Setup of Apache Spark Big Data Environments

Validate that you connectivity and no firewall issues when you are starting.   Conn Check is an awesome tool for that.

You may need to setup a number of servers at once, checkout Sup.

First get the 1.8 of the JDK.  Apache Spark works best with Scala, Java and Python.  Get the version of Scala you may need.   Scala Version 2.10 is the standard version and used for the precompiled downloads.   You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself, good idea to have.   Install Python 2.6 for PySpark.  Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark://10.13.196.41:7077 --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:

Security
http://mig.mozilla.org/

Machine Learning
http://systemml.apache.org/

OCR
https://github.com/tesseract-ocr/tesseract

 

Leave a Reply