fluentd PostgreSQL hstore plugin

May be you already know the fluentd ,a very good log collector daemon.

fluentd sends logs with JSON format and have output plugins such as mongod or Amazon S3.

And also, I write the plugin it can output to the PostgreSQL hstore.

HStore is an extension of PostgreSQL which can store information with Key-Value. More about PostgreSQL hstore, see PostgreSQL document .

Here is the repository.

Install plugin

Just type gem install. easy.

% gem install fluent-plugin-pghstore

apache log

For example, if you want to apache access log which is collected using in_tail plugin to postgresql, write this config.

<source>
  type tail
  path /var/log/apache/access_log_sym
  tag apache.access
  format apache
</source>

<match apache.*>
  type pghstore
  database test
</match>

It generates,

      tag       |          time          |      record
----------------+------------------------+---------------------------------------
{apache,access} | 2012-04-01 22:55:15+09 | "code"=>"200",
"host"=>"XXX.XXX.XXX.XXX", "path"=>"/", "size"=>"2608",
   "user"=>"-", "agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X
   10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.83 Safari/535.11", "method"=>"GET",
  "referer"=>"-"

fluetnd tag is splitted with "." and stored in the array. yes, PostgreSQL can manage array.

Ehe record column is hstore type. Above example is a little bit difficult to read, but all key and value is stored.

http

Next, add this config,

<source>
  type http
  port 9880

and type curl,

curl -F 'json={"log":"hoge"}' "http://localhost:9880/apache.curl"

then

      tag       |          time          |      record
----------------+------------------------+---------------------------------------
{apache,access} | 2012-04-01 22:55:15+09 | "code"=>"200",
"host"=>"XXX.XXX.XXX.XXX",
{apache,curl}   | 2012-04-01 23:28:44+09 | "log"=>"hoge"

Other schema data is inserted to the record column. hstore can add key dynamically, plugin-hstore can manage any type of input plugin.

What do you want to do?

After data inserted into the PostgreSQL, you can everything using SQL.

Getting UserAgent

SELECT
  COUNT(*) AS c,
  record->'agent'
FROM apache_log
GROUP BY record->'agent'
ORDER BY c;

Access count last 10 minitues

SELECT count(*) FROM apache_log WHERE time > (CURRENT_TIMESTAMP -
interval '10 min')

Statuc code last 10 minitues

SELECT
  count(CASE WHEN record->'code' = '200' THEN 1 ELSE NULL END) AS
  OK_200,
  count(CASE WHEN record->'code' = '301' THEN 1 ELSE NULL END) AS
  MOVED_301,
  count(CASE WHEN record->'code' = '302' THEN 1 ELSE NULL END) AS
  FOUND_302,
  count(CASE WHEN record->'code' = '304' THEN 1 ELSE NULL END) AS
  NOTMODIFIED_304,
  count(CASE WHEN record->'code' = '401' THEN 1 ELSE NULL END) AS
  UNAUTHORIZED_401
FROM apache_log
WHERE time > (CURRENT_TIMESTAMP - interval '10 min')

Limitation

However, nested JSON is not allowed such as...

'json={"log":"hoge", "nest":{"a":"hoge", "b":"hige"}}'

This is difficult because hstore itself is not allowed nest. If you really want to use such kind of structure, you may wait the JSON type in the PostgreSQL 9.2.

Since this plugin uses only one connection, there are possiblity to drop the under highly-loaded circumstance. fluentd itself has retry function, I think it never happen, but not sure.

Finally

mongodb is good but PostgreSQL is also good.