Loading Tilde Delimited Files into HAWQ Tables with Mixed Columns

XD Job and Stream with SQL

Caveat:  The complete field lists are abbreviated for sake of space, you have to list all the fields you are working with.

First we create a simple filejdbc Spring Job that loads the raw tilde delimited file into HAWQ.   These fields all come in as TEXT fields, which could be okay for some purposes, but not our needs.   We also create a XD stream with a custom sink (see the XML, no coding) that runs a SQL command to insert from this table and convert into other HAWQ types (like numbers and time). We trigger the secondary stream to run via a command line REST POST, but we could have used a timed trigger or many other ways (automated, scripted or manual) to kick that off.  You could also just create a custom XD job that did casting of your types and some manipulation or done it with a Groovy script transform.   There’s many options in XD.

Update:    Spring XD has added a JDBC source, so you can avoid this step of Job plus Stream.   I will add a new blog entry when that version of Spring XD is GA.  Spring XD is constantly evolving.

jobload.xd

job create loadjob --definition "filejdbc --resources=file:/tmp/xd/input/files/*.* --names=time,userid,dataname,dataname2,
dateTimeField, lastName, firstName, city, state, address1, address2 --tableName=raw_data_tbl --initializeDatabase=true 
--driverClassName=org.postgresql.Driver --delimiter=~ --dateFormat=yyyy-MM-dd-hh.mm.ss --numberFormat=%d 
--username=gpadmin --url=jdbc:postgresql:gpadmin" --deploy
stream create --name streamload --definition "http | hawq-store" --deploy
job launch jobload
clear
job list
stream list

1) Job loads file into a Raw HAWQ table with all text columns.
2) Stream is triggered by web page hit or command line call
(needs hawq-store). This does inserts into the real table and truncates the temp one.

triggerrun.sh (BASH shell script for testing)

curl -s -H "Content-Type: application/json" -X POST -d "{id:5}" http://localhost:9000

added spring-integration-jdbc jar to /opt/pivotal/spring-xd/xd/lib

hawq-store.xml (Spring Integration / XD Configuration)

/opt/pivotal/spring-xd/xd/modules/sink/hawq-store.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:int="http://www.springframework.org/schema/integration"
 xmlns:int-jdbc="http://www.springframework.org/schema/integration/jdbc"
 xmlns:jdbc="http://www.springframework.org/schema/jdbc"
 xsi:schemaLocation="http://www.springframework.org/schema/beans
 http://www.springframework.org/schema/beans/spring-beans.xsd
 http://www.springframework.org/schema/integration
 http://www.springframework.org/schema/integration/spring-integration.xsd
 http://www.springframework.org/schema/integration/jdbc
 http://www.springframework.org/schema/integration/jdbc/spring-integration-jdbc.xsd">
<int:channel id="input" />
<int-jdbc:store-outbound-channel-adapter
 channel="input" query="insert into real_data_tbl(time, userid, firstname, ...) select cast(time as datetime), 
cast(userid as numeric), firstname, ... from dfpp_networkfillclicks" data-source="dataSource" />
<bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
 <property name="driverClassName" value="org.postgresql.Driver"/>
 <property name="url" value="jdbc:postgresql:gpadmin"/>
 <property name="username" value="gpadmin"/>
 <property name="password" value=""/>
</bean>
</beans>

createtable.sql  (HAWQ Table)

CREATE TABLE raw_data_tbl
 (
 time text,
 userid text ,
...
  somefield text
 )
 WITH (APPENDONLY=true)
 DISTRIBUTED BY (time);


Leave a Reply