Spark on Tachyon on Pivotal HD 2.0 (Hadoop 2.2)

The Future Architecture of a Data Lake in Memory Data Exchange Platform using Tachyon and Apache Spark

Tachyon Resources
Big Data Mini-Course Tachyon
Tachyon on Redhat

Spark Resources
Data Exploration with Spark

tachfile tachy2 sparkjob

Run

 scala> var file = sc.textFile("tachyon://localhost:19998/xd/load/test.json")
14/10/15 21:11:23 INFO MemoryStore: ensureFreeSpace(69856) called with curMem=208659, maxMem=308713881
14/10/15 21:11:23 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 68.2 KB, free 294.1 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at <console>:12

scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
14/10/15 21:11:26 INFO : getFileStatus(/xd/load/test.json): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/xd/load/test.json TPath: tachyon://localhost:19998/xd/load/test.json
14/10/15 21:11:26 INFO FileInputFormat: Total input paths to process : 1
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at reduceByKey at <console>:14

scala> counts.saveAsTextFile("tachyon://localhost:19998/result")
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result does not exist)/result
14/10/15 21:12:26 INFO : File does not exist: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/10/15 21:12:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/10/15 21:12:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/10/15 21:12:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/10/15 21:12:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : mkdirs(tachyon://localhost:19998/result/_temporary/0, rwxrwxrwx)
14/10/15 21:12:26 INFO SparkContext: Starting job: saveAsTextFile at <console>:17
14/10/15 21:12:26 INFO DAGScheduler: Registering RDD 12 (reduceByKey at <console>:14)
14/10/15 21:12:26 INFO DAGScheduler: Got job 0 (saveAsTextFile at <console>:17) with 2 output partitions (allowLocal=false)
14/10/15 21:12:26 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at <console>:17)
14/10/15 21:12:26 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Missing parents: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14), which has no missing parents
14/10/15 21:12:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14)
14/10/15 21:12:26 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:0 as 2090 bytes in 2 ms
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:1 as 2090 bytes in 0 ms
14/10/15 21:12:26 INFO Executor: Running task ID 1
14/10/15 21:12:26 INFO Executor: Running task ID 0
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:230135+230136
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:0+230135
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : Folder /mnt/ramdisk/tachyonworker/users/1 was created!
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:27 INFO : Canceled output of block 48318382080, deleted local file /mnt/ramdisk/tachyonworker/users/1/48318382080
14/10/15 21:12:27 INFO Executor: Serialized size of result for 0 is 786
14/10/15 21:12:27 INFO Executor: Serialized size of result for 1 is 786
14/10/15 21:12:27 INFO Executor: Sending result for 0 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 1 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 0
14/10/15 21:12:27 INFO Executor: Finished task ID 1
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 0 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 1 in 411 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 1)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Stage 1 (reduceByKey at <console>:14) finished in 0.419 s
14/10/15 21:12:27 INFO DAGScheduler: looking for newly runnable stages
14/10/15 21:12:27 INFO DAGScheduler: running: Set()
14/10/15 21:12:27 INFO DAGScheduler: waiting: Set(Stage 0)
14/10/15 21:12:27 INFO DAGScheduler: failed: Set()
14/10/15 21:12:27 INFO DAGScheduler: Missing parents for Stage 0: List()
14/10/15 21:12:27 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17), which is now runnable
14/10/15 21:12:27 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:0 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:0 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:1 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:1 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO Executor: Running task ID 2
14/10/15 21:12:27 INFO Executor: Running task ID 3
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/10/15 21:12:27 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/10/15 21:12:27 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/10/15 21:12:27 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3/part-00001, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value.
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2/part-00000, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/56908316672 was created!
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/54760833024 was created!
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000001 does not exist)/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001)
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000000 does not exist)/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000)
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000001_3' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000001_3: Committed
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000000_2' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000000_2: Committed
14/10/15 21:12:27 INFO Executor: Serialized size of result for 3 is 825
14/10/15 21:12:27 INFO Executor: Serialized size of result for 2 is 825
14/10/15 21:12:27 INFO Executor: Sending result for 3 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 2 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 2
14/10/15 21:12:27 INFO Executor: Finished task ID 3
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 3 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 1)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 2 in 415 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 0)
14/10/15 21:12:27 INFO DAGScheduler: Stage 0 (saveAsTextFile at <console>:17) finished in 0.415 s
14/10/15 21:12:27 INFO SparkContext: Job finished: saveAsTextFile at <console>:17, took 0.952281177 s
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00001 TPath: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00001 does not exist)/result/part-00001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001/part-00001, tachyon://localhost:19998/result/part-00001)
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00000 TPath: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00000 does not exist)/result/part-00000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000/part-00000, tachyon://localhost:19998/result/part-00000)
14/10/15 21:12:27 INFO : delete(tachyon://localhost:19998/result/_temporary, true)
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_SUCCESS, rw-r--r--, true, 65536, 1, 33554432, null)

[pivhdsne:tachyon-0.5.0]$ hadoop fs -ls /xd/
Found 8 items
-rwxrwxrwx 3 gpadmin hadoop 460179 2014-10-15 19:54 /xd/bigfile.txt
drwxr-xr-x - root hadoop 0 2014-09-24 16:02 /xd/demorabbittapG
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w1
drwxrwxrwx - root hadoop 0 2014-10-14 15:34 /xd/w2
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w3
drwxrwxrwx - root hadoop 0 2014-10-14 15:33 /xd/w4
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w5
drwxrwxrwx - root hadoop 0 2014-10-14 16:57 /xd/w6
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /xd/load
449.48 KB 10-15-2014 17:17:03:489 Not In Memory /xd/load/test.json
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /result
244.98 KB 10-15-2014 21:12:27:354 In Memory /result/part-00001
243.57 KB 10-15-2014 21:12:27:356 In Memory /result/part-00000
0.00 B 10-15-2014 21:12:27:625 In Memory /result/_SUCCESS
[pivhdsne:tachyon-0.5.0]$ ls -lt /mnt/ramdisk/tachyonworker/users/1
total 0

Leave a Reply