Spring XD Setup

Spring XD Setup

Reference:  

http://docs.spring.io/spring-xd/docs/current/reference/html/#_redhat_centos_installation

https://github.com/spring-projects/spring-xd/wiki/Running-Distributed-Mode

https://github.com/spring-projects/spring-xd/wiki/XD-Distributed-Runtime

Requirements:

  •      Apache Keeper 3.4.6
  •      Redis
  •      An RDBMS (MySQL, Postgresql, ….)

Enrichers:

  •      GemFire (highly recommended for in-memory data grid)
  •      GemFire XD (highly recommended for in-memory database)
  •      RabbitMQ (highly recommended)
  •      Yarn

GemFire as Sources/Sinks

http://docs.spring.io/spring-xd/docs/current/reference/html/#gemfire-source

http://docs.spring.io/spring-xd/docs/current/reference/html/#gemfire-continuous-query

http://docs.spring.io/spring-xd/docs/current/reference/html/#gemfire-server  

GemFire XD can be used as a via JDBC Sink.

Assuming you have a Hadoop Infrastructure (Preferably PHD)

Apache Zookeeper is already installed, but you may want to configure you cluster.   http://myjeeva.com/zookeeper-cluster-setup.html   http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html

Installing MySQL

           Sudo yum install mysql-server

           http://dev.mysql.com/downloads/repo/yum/

Install Postgresql (Don’t install on the same machine as GreenPlum)

           Sudo yum install postgresql-server

Installing Redis

http://redis.docs.pivotal.io/doc/2x/index.html#getting-started/src/install.html#topic_q3g_vzs_yn

           wget -q -O – http://packages.gopivotal.com/pub/rpm/rhel6/app-suite/app-suite-installer | sh

           (See RabbitMQ)

           sudo yum install pivotal-redis

           sudo service pivotal-redis-6379 start

           sudo chkconfig —level 35 pivotal-redis-6379 on

Installing RabbitMQ

http://rabbitmq.docs.pivotal.io/doc/33/index.html#getstart/src/install-getstart.html#install-RHEL

           sudo wget -q -O – packages.gopivotal.com | sh

       sudo wget -q -O – http://packages.gopivotal.com/pub/rpm/rhel6/app-suite/app-suite-installer | sh

Depending on permissions, you may have to send that to a file, chmod 700 it and run it via sudo ./installer.sh

           sudo yum search pivotal

         pivotal-rabbitmq-server.noarch : The RabbitMQ server

    sudo yum install pivotal-rabbitmq-server

sudo rabbitmq-plugins enable rabbitmq_management    

    this may conflict with ports if you are running other things

sudo /sbin/service rabbitmq-server start

Install Spring-XD

sudo wget -q -O – http://packages.gopivotal.com/pub/rpm/rhel6/app-suite/app-suite-installer sh sudo yum install spring-xd

Pivotal Recommendation

It is also recommended to deploy XD nodes and DataNode within the same container and use data partitioning.

Setting Your Jobs Database

    Change the datasource, pick one of the below for easiest setup.

    /opt/pivotal/spring-xd/xd/config

         #spring:

#  datasource:

#    url: jdbc:mysql://mysqlserver:3306/xdjobs

#    username: xdjobsschema

#    password: xdsecurepassword

#    driverClassName: com.mysql.jdbc.Driver

#    validationQuery: select 1

#Config for use with Postgres – uncomment and edit with relevant values for your environment

#spring:

#  datasource:

#    url: jdbc:postgresql://postgresqlserver:5432/xdjobs

#    username: xdjobsschema

#    password: xdsecurepassword

#    driverClassName: org.postgresql.Driver

#    validationQuery: select 1

Test that Spring-XD Single Node Works

    cd /opt/pivotal/springxd/xd/bin

    ./xd-singlenode —hadoopDistro phd20

Test that Spring-XD Shell Works

    cd /opt/pivotal/springxd/shell/bin

    ./xd-shell—hadoopDistro phd20

Set the Environment Variable for Spring XD

export XD_HOME=/opt/pivotal/spring-xd/xd

For Default Access, I use:

/opt/pivotal/spring-xd/shell/bin/xd-shell —hadoopDistro phd20

For Testing Containers and Admin Servers for Distributed Spring XD (DIRT)

sudo service spring-xd-admin start

sudo service spring-xd-container start

For Testing Spring XD

http://blog.gopivotal.com/pivotal/products/spring-xd-for-real-time-analytics

https://github.com/spring-projects/spring-xd-samples

Some Spring XD Shell Commands for Testing

had config fs —namenode hdfs://pivhdsne:8020

admin config server http://localhost:9393

runtime containers

runtime modules
hadoop fs ls /xd/

stream create ticktock —definition “time | log”

stream deploy ticktock

stream list

Check the Web UI

http://localhost:9393/admin-ui/#/streams/definitions

Monitoring RabbitMQ with Nagios

Nagios

 

Nagios Configuration

  1. Install Pre-Requirements

Perl must be installed (see the Nagios installation guide) as well as CPAN and a few Perl modules from CPAN (http://www.cpan.org/).  The command cpan once installed will install these from the network. This is very commonly available in servers for other administration purposes.   The modules could also be downloaded and compiled directly, but that’s not as easy.

sudo yum install cpan
sudo cpan
install Math::Calc::Units
install Config::Tiny
install JSON
install Math::Calc::Units
install Monitoring::Plugin
install LWP::UserAgent
  1. Accept any other plugins that may be required or prompted.

Download the Nagios RabbitMQ Plugin as a zip or through git clone.

https://github.com/jamesc/nagios-plugins-rabbitmq

  1. Modify Nagios configuration files
    • Copy Nagios rabbitmq plugins from the nagios-plugins-rabbitmq/scripts folder to /usr/lib64/nagios/plugins/:

List of Plugins

  • check_rabbitmq_aliveness
  • check_rabbitmq_objects
  • check_rabbitmq_partition
  • check_rabbitmq_server
  • check_rabbitmq_watermark
  • check_rabbitmq_connections
  • check_rabbitmq_overview
  • check_rabbitmq_queue
  • check_rabbitmq_shovels
Make sure perl scripts have execute permission:   chmod 755 *
  1. Register the RabbitMQ Configuration Files

Edit /etc/nagios/nagios.cfg.   Your configuration file for Nagios could be named differently or located elsewhere depending on how/who installed it.

In the section OBJECT CONFIGURATION FILE(S), add the following, or add the new sections to existing services and command files:

cfg_file=/etc/nagios/objects/rabbitmq-services.cfg
cfg_file=/etc/nagios/objects/rabbitmq-commands.cfg
  1. Set Services
    1. Open/etc/nagios/objects/rabbitmq-services.cfg with a text editor.
# RabbitMQ Check
define service {
            hostgroup_name                   all-servers
            use                              generic-service
            service_description              NAGIOS::RabbitMQ Check
            servicegroups                    NAGIOS
            check_command                    check_rabbitmq_overview
            normal_check_interval            5
            retry_check_interval             0.5
            max_check_attempts               3
}

 

  1. Set Commands
    1. Open/etc/nagios/objects/rabbitmq-commands.cfg with a text editor. You will need to add the other commands listed if you want to display those.
define command {
     command_name   check_rabbitmq_overview
     command_line   $USER1$/./check_rabbitmq_overview -H $HOSTADDRESS$
}
  1. Restart Nagios
    1. Restart services
sudo service nagios restart
sudo service httpd restart

Trial Run

[baconserver:plugins]$ ./check_rabbitmq_queue -H localhost --queue spring-boot
RABBITMQ_QUEUE OK - messages OK (3002) messages_ready OK (3002) messages_unacknowledged OK (0) consumers OK (0) | messages=3002;; messages_ready=3002;; messages_unacknowledged=0;; consumers=0;;
[baconserver:plugins]$ ./check_rabbitmq_aliveness -H localhost
RABBITMQ_ALIVENESS OK - vhost: /
[baconserver:plugins]$ ./check_rabbitmq_connections -H localhost
RABBITMQ_CONNECTIONS OK - connections OK (0) connections_notrunning OK (0) receive_rate OK (0) send_rate OK (0) | connections=0;; connections_notrunning=0;; receive_rate=0;; send_rate=0;;
[baconserver:plugins]$ ./check_rabbitmq_objects -H localhost
RABBITMQ_OBJECTS OK - Gathered Object Counts | vhost=0;; exchange=11;; binding=9;; queue=2;; channel=0;;
[baconserver:plugins]$ ./check_rabbitmq_overview -H localhost
RABBITMQ_OVERVIEW OK - messages OK (6004) messages_ready OK (6004) messages_unacknowledged OK (0) | messages=6004;; messages_ready=6004;; messages_unacknowledged=0;
[baconserver:plugins]$ ./check_rabbitmq_server -H pivhdsne
RABBITMQ_SERVER OK - Memory OK (52.84%) Process OK (0.02%) FD OK (0.00%) Sockets OK (0.00%) | Memory=52.84%;80;90 Process=0.02%;80;90 FD=0.00%;80;90 Sockets=0.00%;80;90
[baconserver:plugins]$ ./check_rabbitmq_watermark -H pivhdsne
RABBITMQ_WATERMARK OK - mem_alarm disk_free_alarm

nagiosrabbitmqcheck

rabbitmqconsole rabbitmqconsole

Reference

This is the plugin used in this install.

These articles were very helpful:

Perl Pre-Requisite

https://www.monitoring-plugins.org/

Spring XD Big Update

https://spring.io/blog/2014/11/19/spring-xd-1-1-m1-and-1-0-2-released

Spring XD 1.1.M1 is a game changer.  Once this is final, it will be the ultimate big data ingestion, export, real-time analytics, batch workflow orchestration and streaming tool.

Loading Tilde Delimited Files into HAWQ Tables with Mixed Columns

XD Job and Stream with SQL

Caveat:  The complete field lists are abbreviated for sake of space, you have to list all the fields you are working with.

First we create a simple filejdbc Spring Job that loads the raw tilde delimited file into HAWQ.   These fields all come in as TEXT fields, which could be okay for some purposes, but not our needs.   We also create a XD stream with a custom sink (see the XML, no coding) that runs a SQL command to insert from this table and convert into other HAWQ types (like numbers and time). We trigger the secondary stream to run via a command line REST POST, but we could have used a timed trigger or many other ways (automated, scripted or manual) to kick that off.  You could also just create a custom XD job that did casting of your types and some manipulation or done it with a Groovy script transform.   There’s many options in XD.

Update:    Spring XD has added a JDBC source, so you can avoid this step of Job plus Stream.   I will add a new blog entry when that version of Spring XD is GA.  Spring XD is constantly evolving.

jobload.xd

job create loadjob --definition "filejdbc --resources=file:/tmp/xd/input/files/*.* --names=time,userid,dataname,dataname2,
dateTimeField, lastName, firstName, city, state, address1, address2 --tableName=raw_data_tbl --initializeDatabase=true 
--driverClassName=org.postgresql.Driver --delimiter=~ --dateFormat=yyyy-MM-dd-hh.mm.ss --numberFormat=%d 
--username=gpadmin --url=jdbc:postgresql:gpadmin" --deploy
stream create --name streamload --definition "http | hawq-store" --deploy
job launch jobload
clear
job list
stream list

1) Job loads file into a Raw HAWQ table with all text columns.
2) Stream is triggered by web page hit or command line call
(needs hawq-store). This does inserts into the real table and truncates the temp one.

triggerrun.sh (BASH shell script for testing)

curl -s -H "Content-Type: application/json" -X POST -d "{id:5}" http://localhost:9000

added spring-integration-jdbc jar to /opt/pivotal/spring-xd/xd/lib

hawq-store.xml (Spring Integration / XD Configuration)

/opt/pivotal/spring-xd/xd/modules/sink/hawq-store.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:int="http://www.springframework.org/schema/integration"
 xmlns:int-jdbc="http://www.springframework.org/schema/integration/jdbc"
 xmlns:jdbc="http://www.springframework.org/schema/jdbc"
 xsi:schemaLocation="http://www.springframework.org/schema/beans
 http://www.springframework.org/schema/beans/spring-beans.xsd
 http://www.springframework.org/schema/integration
 http://www.springframework.org/schema/integration/spring-integration.xsd
 http://www.springframework.org/schema/integration/jdbc
 http://www.springframework.org/schema/integration/jdbc/spring-integration-jdbc.xsd">
<int:channel id="input" />
<int-jdbc:store-outbound-channel-adapter
 channel="input" query="insert into real_data_tbl(time, userid, firstname, ...) select cast(time as datetime), 
cast(userid as numeric), firstname, ... from dfpp_networkfillclicks" data-source="dataSource" />
<bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
 <property name="driverClassName" value="org.postgresql.Driver"/>
 <property name="url" value="jdbc:postgresql:gpadmin"/>
 <property name="username" value="gpadmin"/>
 <property name="password" value=""/>
</bean>
</beans>

createtable.sql  (HAWQ Table)

CREATE TABLE raw_data_tbl
 (
 time text,
 userid text ,
...
  somefield text
 )
 WITH (APPENDONLY=true)
 DISTRIBUTED BY (time);


Spark on Tachyon on Pivotal HD 2.0 (Hadoop 2.2)

The Future Architecture of a Data Lake in Memory Data Exchange Platform using Tachyon and Apache Spark

Tachyon Resources
Big Data Mini-Course Tachyon
Tachyon on Redhat

Spark Resources
Data Exploration with Spark

tachfile tachy2 sparkjob

Run

 scala> var file = sc.textFile("tachyon://localhost:19998/xd/load/test.json")
14/10/15 21:11:23 INFO MemoryStore: ensureFreeSpace(69856) called with curMem=208659, maxMem=308713881
14/10/15 21:11:23 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 68.2 KB, free 294.1 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at <console>:12

scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
14/10/15 21:11:26 INFO : getFileStatus(/xd/load/test.json): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/xd/load/test.json TPath: tachyon://localhost:19998/xd/load/test.json
14/10/15 21:11:26 INFO FileInputFormat: Total input paths to process : 1
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at reduceByKey at <console>:14

scala> counts.saveAsTextFile("tachyon://localhost:19998/result")
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result does not exist)/result
14/10/15 21:12:26 INFO : File does not exist: tachyon://localhost:19998/result
14/10/15 21:12:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/10/15 21:12:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/10/15 21:12:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/10/15 21:12:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/10/15 21:12:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/10/15 21:12:26 INFO : getWorkingDirectory: /
14/10/15 21:12:26 INFO : mkdirs(tachyon://localhost:19998/result/_temporary/0, rwxrwxrwx)
14/10/15 21:12:26 INFO SparkContext: Starting job: saveAsTextFile at <console>:17
14/10/15 21:12:26 INFO DAGScheduler: Registering RDD 12 (reduceByKey at <console>:14)
14/10/15 21:12:26 INFO DAGScheduler: Got job 0 (saveAsTextFile at <console>:17) with 2 output partitions (allowLocal=false)
14/10/15 21:12:26 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at <console>:17)
14/10/15 21:12:26 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Missing parents: List(Stage 1)
14/10/15 21:12:26 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14), which has no missing parents
14/10/15 21:12:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[12] at reduceByKey at <console>:14)
14/10/15 21:12:26 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:0 as 2090 bytes in 2 ms
14/10/15 21:12:26 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:26 INFO TaskSetManager: Serialized task 1.0:1 as 2090 bytes in 0 ms
14/10/15 21:12:26 INFO Executor: Running task ID 1
14/10/15 21:12:26 INFO Executor: Running task ID 0
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:230135+230136
14/10/15 21:12:26 INFO HadoopRDD: Input split: tachyon://localhost:19998/xd/load/test.json:0+230135
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : open(tachyon://localhost:19998/xd/load/test.json, 65536)
14/10/15 21:12:26 INFO : Folder /mnt/ramdisk/tachyonworker/users/1 was created!
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : /mnt/ramdisk/tachyonworker/users/1/48318382080 was created!
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Try to find remote worker and read block 48318382080 from 0, with len 460271
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : Block locations:[NetAddress(mHost:localhost, mPort:-1)]
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:26 INFO : May stream from underlayer fs: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data/45
14/10/15 21:12:27 INFO : Canceled output of block 48318382080, deleted local file /mnt/ramdisk/tachyonworker/users/1/48318382080
14/10/15 21:12:27 INFO Executor: Serialized size of result for 0 is 786
14/10/15 21:12:27 INFO Executor: Serialized size of result for 1 is 786
14/10/15 21:12:27 INFO Executor: Sending result for 0 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 1 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 0
14/10/15 21:12:27 INFO Executor: Finished task ID 1
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 0 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 1 in 411 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ShuffleMapTask(1, 1)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Stage 1 (reduceByKey at <console>:14) finished in 0.419 s
14/10/15 21:12:27 INFO DAGScheduler: looking for newly runnable stages
14/10/15 21:12:27 INFO DAGScheduler: running: Set()
14/10/15 21:12:27 INFO DAGScheduler: waiting: Set(Stage 0)
14/10/15 21:12:27 INFO DAGScheduler: failed: Set()
14/10/15 21:12:27 INFO DAGScheduler: Missing parents for Stage 0: List()
14/10/15 21:12:27 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17), which is now runnable
14/10/15 21:12:27 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[15] at saveAsTextFile at <console>:17)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:0 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:0 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO TaskSetManager: Starting task 0.0:1 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
14/10/15 21:12:27 INFO TaskSetManager: Serialized task 0.0:1 as 11437 bytes in 0 ms
14/10/15 21:12:27 INFO Executor: Running task ID 2
14/10/15 21:12:27 INFO Executor: Running task ID 3
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO BlockManager: Found block broadcast_2 locally
14/10/15 21:12:27 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/10/15 21:12:27 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/10/15 21:12:27 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/10/15 21:12:27 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : getWorkingDirectory: /
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3/part-00001, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value.
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2/part-00000, rw-r--r--, true, 65536, 1, 33554432, org.apache.hadoop.mapred.Reporter$1@f06b03a)
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/56908316672 was created!
14/10/15 21:12:27 INFO : /mnt/ramdisk/tachyonworker/users/1/54760833024 was created!
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3 TPath: tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001 TPath: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000001 does not exist)/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000001_3, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001)
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/_temporary/0/task_201410152112_0000_m_000000 does not exist)/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/_temporary/attempt_201410152112_0000_m_000000_2, tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000)
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000001_3' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000001_3: Committed
14/10/15 21:12:27 INFO FileOutputCommitter: Saved output of task 'attempt_201410152112_0000_m_000000_2' to tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO SparkHadoopWriter: attempt_201410152112_0000_m_000000_2: Committed
14/10/15 21:12:27 INFO Executor: Serialized size of result for 3 is 825
14/10/15 21:12:27 INFO Executor: Serialized size of result for 2 is 825
14/10/15 21:12:27 INFO Executor: Sending result for 3 directly to driver
14/10/15 21:12:27 INFO Executor: Sending result for 2 directly to driver
14/10/15 21:12:27 INFO Executor: Finished task ID 2
14/10/15 21:12:27 INFO Executor: Finished task ID 3
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 3 in 413 ms on localhost (progress: 1/2)
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 1)
14/10/15 21:12:27 INFO TaskSetManager: Finished TID 2 in 415 ms on localhost (progress: 2/2)
14/10/15 21:12:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/10/15 21:12:27 INFO DAGScheduler: Completed ResultTask(0, 0)
14/10/15 21:12:27 INFO DAGScheduler: Stage 0 (saveAsTextFile at <console>:17) finished in 0.415 s
14/10/15 21:12:27 INFO SparkContext: Job finished: saveAsTextFile at <console>:17, took 0.952281177 s
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000001
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00001): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00001 TPath: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00001 does not exist)/result/part-00001
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00001
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000001/part-00001, tachyon://localhost:19998/result/part-00001)
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result TPath: tachyon://localhost:19998/result
14/10/15 21:12:27 INFO : listStatus(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/_temporary/0/task_201410152112_0000_m_000000
14/10/15 21:12:27 INFO : getFileStatus(tachyon://localhost:19998/result/part-00000): HDFS Path: /home/gpadmin/research/tachyon-0.5.0/underfs/result/part-00000 TPath: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : FileDoesNotExistException(message:Failed to getClientFileInfo: /result/part-00000 does not exist)/result/part-00000
14/10/15 21:12:27 INFO : File does not exist: tachyon://localhost:19998/result/part-00000
14/10/15 21:12:27 INFO : rename(tachyon://localhost:19998/result/_temporary/0/task_201410152112_0000_m_000000/part-00000, tachyon://localhost:19998/result/part-00000)
14/10/15 21:12:27 INFO : delete(tachyon://localhost:19998/result/_temporary, true)
14/10/15 21:12:27 INFO : create(tachyon://localhost:19998/result/_SUCCESS, rw-r--r--, true, 65536, 1, 33554432, null)

[pivhdsne:tachyon-0.5.0]$ hadoop fs -ls /xd/
Found 8 items
-rwxrwxrwx 3 gpadmin hadoop 460179 2014-10-15 19:54 /xd/bigfile.txt
drwxr-xr-x - root hadoop 0 2014-09-24 16:02 /xd/demorabbittapG
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w1
drwxrwxrwx - root hadoop 0 2014-10-14 15:34 /xd/w2
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w3
drwxrwxrwx - root hadoop 0 2014-10-14 15:33 /xd/w4
drwxrwxrwx - root hadoop 0 2014-10-14 18:33 /xd/w5
drwxrwxrwx - root hadoop 0 2014-10-14 16:57 /xd/w6
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /xd/load
449.48 KB 10-15-2014 17:17:03:489 Not In Memory /xd/load/test.json
[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /result
244.98 KB 10-15-2014 21:12:27:354 In Memory /result/part-00001
243.57 KB 10-15-2014 21:12:27:356 In Memory /result/part-00000
0.00 B 10-15-2014 21:12:27:625 In Memory /result/_SUCCESS
[pivhdsne:tachyon-0.5.0]$ ls -lt /mnt/ramdisk/tachyonworker/users/1
total 0

Trying out Tachyon on Hadoop 2.2

Watch the Videos

Tachyon Resources:
Tachyon with Spark
Interview with Tachyon Lead
Reliable File Sharing in Memory
Haoyuan Li (Project Lead and Co-Creator)
Future Architecture of a Data Lake in Memory
Tachyon: Memory Throughput I/O for Cluster Computing Frameworks
Tachyon meetup
Github for Haoyuan Li
Tachyon Github
xPatterns on Spark/Shark/Tachyon/Mesos
Amplab
Article on Tachyon

Pivotal hosted a talk on Tachyon (github), so I wanted to try it out first.  Turns out it is really easy to install and use, so here’s how.  I am running using the local directions.

To Install and run Tachyon on PIVHDSNE_VMWARE_VM-2.0.0-52 (Pivotal Hadoop Single Node VM 64-bit RHEL 6)

Download it

wget https://github.com/amplab/tachyon/releases/download/v0.5.0/tachyon-0.5.0-bin.tar.gz

Build It

mvn -Dhadoop.version=2.2.0 clean package

make sure it’s built with Apache Hadoop 2.2 (PHD 2.0)

Format, Run it and Run Some Tests

[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon format
Connection to localhost... Formatting Tachyon Worker @ pivhd.localdomain Removing local data under folder: /mnt/ramdisk/tachyonworker/ Connection to localhost closed. Formatting Tachyon Master @ localhost Formatting JOURNAL_FOLDER: /home/gpadmin/research/tachyon-0.5.0/libexec/../journal/ Formatting UNDERFS_DATA_FOLDER: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/data Formatting UNDERFS_WORKERS_FOLDER: /home/gpadmin/research/tachyon-0.5.0/libexec/../underfs/tmp/tachyon/workers

[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon-start.sh local
Killed 1 processes
Killed 1 processes
Connection to localhost... Killed 0 processes
Connection to localhost closed.
Formatting RamFS: /mnt/ramdisk (1gb)
Starting master @ localhost
Starting worker @ pivhd.localdomain

./bin/tachyon runTests

make sure everything is working, takes some time

Results of One of the Tests

/home/gpadmin/research/tachyon-0.5.0/bin/tachyon runTest Basic THROUGH
/BasicFile_THROUGH has been removed
2014-10-15 16:41:03,987 INFO   (TachyonFS.java:connect) - Trying to connect master @ localhost/127.0.0.1:19998
2014-10-15 16:41:04,017 INFO   (MasterClient.java:getUserId) - User registered at the master localhost/127.0.0.1:19998 got UserId 14
2014-10-15 16:41:04,018 INFO   (TachyonFS.java:connect) - Trying to get local worker host : localhost
2014-10-15 16:41:04,025 INFO   (TachyonFS.java:connect) - Connecting local worker @ localhost/127.0.0.1:29998
2014-10-15 16:41:04,044 INFO   (CommonUtils.java:printTimeTakenMs) - createFile with fileId 26 took 58 ms.
2014-10-15 16:41:04,093 INFO   (TachyonFile.java:readRemoteByteBuffer) - Try to find and read from remote workers.
2014-10-15 16:41:04,093 INFO   (TachyonFile.java:readRemoteByteBuffer) - readByteBufferFromRemote() [NetAddress(mHost:localhost, mPort:-1)]
2014-10-15 16:41:04,107 INFO   (TachyonFS.java:createAndGetUserTempFolder) - Folder /mnt/ramdisk/tachyonworker/users/14 was created!
2014-10-15 16:41:04,110 INFO   (BlockOutStream.java:<init>) - /mnt/ramdisk/tachyonworker/users/14/27917287424 was created!
Passed the test!

Checks the logs directory

-rw-r--r-- 1 gpadmin gpadmin 44958 Oct 15 17:18 worker.log@127.0.0.1_10-15-2014
-rw-r--r-- 1 gpadmin gpadmin  9610 Oct 15 17:18 master.log@127.0.0.1_10-15-2014
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:17 user.log@127.0.0.1_10-15-2014
-rw-r--r-- 1 gpadmin gpadmin   778 Oct 15 17:17 user.log@127.0.0.1_10-15-2014_9
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:16 user.log@127.0.0.1_10-15-2014_8
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:16 user.log@127.0.0.1_10-15-2014_7
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:13 user.log@127.0.0.1_10-15-2014_6
-rw-r--r-- 1 gpadmin gpadmin   778 Oct 15 17:13 user.log@127.0.0.1_10-15-2014_5
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:13 user.log@127.0.0.1_10-15-2014_4
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:13 user.log@127.0.0.1_10-15-2014_3
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:12 user.log@127.0.0.1_10-15-2014_2
-rw-r--r-- 1 gpadmin gpadmin   500 Oct 15 17:12 user.log@127.0.0.1_10-15-2014_1
-rw-r--r-- 1 gpadmin gpadmin  2337 Oct 15 16:00 master.log@127.0.0.1_10-15-2014_1

Test Via URL (http://localhost:19999/home)

Tachyon UI

Logs

image004

Create a Directory

 [pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs mkdir /xd
Successfully created directory /xd

View File System

[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /
80.00 B   10-15-2014 16:41:01:053  In Memory      /BasicFile_MUST_CACHE
0.00 B    10-15-2014 16:41:01:560                 /BasicRawTable_MUST_CACHE
80.00 B   10-15-2014 16:41:02:069  In Memory      /BasicFile_TRY_CACHE
0.00 B    10-15-2014 16:41:02:532                 /BasicRawTable_TRY_CACHE
80.00 B   10-15-2014 16:41:03:032  In Memory      /BasicFile_CACHE_THROUGH
0.00 B    10-15-2014 16:41:03:531                 /BasicRawTable_CACHE_THROUGH
80.00 B   10-15-2014 16:41:04:042  In Memory      /BasicFile_THROUGH
0.00 B    10-15-2014 16:41:04:528                 /BasicRawTable_THROUGH
80.00 B   10-15-2014 16:41:05:055  In Memory      /BasicFile_ASYNC_THROUGH
0.00 B    10-15-2014 16:41:05:525                 /BasicRawTable_ASYNC_THROUGH
0.00 B    10-15-2014 16:53:56:062                 /xd

Load a JSON File

[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs copyFromLocal /tmp/xd/test.json /xd/load/test.json
Copied /tmp/xd/test.json to /xd/load/test.json

[pivhdsne:tachyon-0.5.0]$ ./bin/tachyon tfs ls /xd/load
449.48 KB 10-15-2014 17:17:03:489  In Memory      /xd/load/test.json

image006

image008

The Future Architecture of a Data Lake in Memory

The Future Architecture of a Data Lake in Memory (Using Tachyon and Spark) @ Wed 15 Oct 2014 NYC @ Pivotal Labs

I am looking forward to this one tomorrow.

New York Meetup
Title: Evolution of Data Architectures: Pivotal’s Data Lake Vision for 2015
Where: 625 Avenue of Americas, 2nd Floor, New York, NY
When: Weds, October 15th, 2014, 7 PM

 

Pivotal Software NYC

New York, NY
160 Members

Pivotal’s Cloud Foundry * Pivotal Gemfire * HAWQ * Greenplum MPP * Rabbit, Redis, Cloud Foundry, Spring, Grails, Groovy, & other resident experts! Pivotal employees includ…

Check out this Meetup Group →

 

RabbitMQ Performance

Monitoring Hadoop

Hadoop Vaidya for Map Reduce
http://pivotalhd.docs.pivotal.io/doc/2010/PivotalHadoopEnhancements.html#PivotalHadoopEnhancements-Vaidya

Pivotal VRP for Hawq and Greenplum
http://pivotalvrp.docs.pivotal.io/index.html
http://pivotalvrp.docs.pivotal.io/doc/6000/pdf/PivotalVRP-6.0-User-Guide.pdf

Pivotal Command Center

Lipstick for Pig

Nagios and Ganglia
http://www.analyticsworkbench.com/faq
http://www.slideshare.net/emcacademics/pivotal-operationalizing-1000-node-hadoop-cluster-analytics-workbench

Ganglia for Hadoop

AppDynamics for PHD

Nagios for Hadoop

JMX