Influx db performance evaluation

I tried influxdb and tried it.

Actually, the interval between 1 test and cluster test is about 1 month apart. Moreover, it is not a neat performance evaluation, so please make it a reference level.

1 unit test

  • DigitalOcean 1GB 30GB (SSD)

I tried it with 512 MB at the beginning, but since I was killed by OOM Killer, I made it 1 GB.

Registration

I sent the following data with HTTP POST from another node of 10 million lines (6 GB).

The batch size is 30 and we decided to register 30 pieces at the same time.

{"int": -74, "str": "ɊƭŷćҏŃȅƒŕƘȉƒŜőȈŃɊҏŷ","uint":3440,"time":1386688205}

By the way, the character string is the unicode character generated by rotunicode. It generates 100 characters.

The result of dstat at that time was this. The load side is completely defeated.

----total-cpu-usage---- -dsk/total- -net/total- ---paging-----system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 52  19  27   0   2   0|   0  1508k|2313k  140k|   0     0 |5180  10k
 40  18  40   0   1   0|   0     0 |1853k  112k|   0     0 |4922  9888
 41  19  36   2   1   0|   0  2740k|1894k  113k|   0     0 |4928  9944
 46  18  34   1   1   0|   0  1504k|2009k  121k|   0     0 |4752  9516
 42  19  38   0   1   0|   0     0 |1830k  110k|   0     0 |5050  10k
 44  20  34   0   2   0|   0     0 |2022k  121k|   0     0 |5536  11k

I thought it would be good as I thought it would have become like this.

88   8   0   3   1   0|   0  6124k|4806k  131k|   0     0 |2655  4280
87   8   0   3   2   0|   0  7232k|4785k  129k|   0     0 |2185  3364
54  11   0  34   1   0|   0  2136k|4784k  129k|   0     0 |4640  8752

It took like this for such a long time.

real    56m35.234s
user    16m19.658s

At this point, the DB was 2.7 GB.

To say that 10 million rows is 56 minutes means simply dividing 2800 lines per second. However, since it is actually sending 30 lines at once, it may be 93 qps.

As you can see from the result of dstat above, CPU seems to be rate limiting. Because Digital Ocean is an SSD, I / O was not a problem, so I / O may be a problem if it is an ordinary HDD.

Query

It is total time of 10 times in each.

Select * from stress limit 1000
Real 0.417 sec
Select max (int) from stress
Real 15m 50.002s
Select max (int) from stress where time <'2013 - 12 - 12'
236400 records real 0 m 10.699 s

Select count (int) from stress where '2013-12-10 23: 00: 00' <time and time <'2013-12-11 00: 00: 00'

7202 record real 0 m 0. 454 s

By the way, since the CPU at the time of max was 99%, I think that the CPU is a bottleneck.

Cluster system

Speaking of influx db is cluster. Let's build a cluster this time.

I did not write the way of influxdb 's clustering in the document, but I could make it by setting the following setting to config.toml.

  • The first one comments out seed-servers
  • The second one is seed-servers = ["(first address): 8090"]

was. When cluster is assembled, it appears in "cluster" of WebUI.

I do not know whether to add the second address at the third machine, but I think that it is probable that you can add it. The point is that only the first one is special.

This area, this メール was found finally in.

result

I threw it under the same conditions as before. In addition, only one POST destination.

real    35m25.989s
user    14m33.846s

I will not go halfway, but it got fairly quick. It is 4700 lines per second, it feels like 156 qps.

At this point the disk capacity was both 1.4 GB. In other words, it will be divided almost equally.

In addition, 18 shards were made.

Query

Select * from stress limit 1000;
Real 0 m 3.746 s
Select max (int) from stress where time <'2013 - 12 - 12'
236400 records real 0 m 11.530 s

Why is getting late as a whole. In particular, the first does not cross the shard, and there is no shard on a server different from the server that threw the query.

About shard

From influx db 0.5 the data is placed as described below.

https://groups.google.com/forum/#!msg/influxdb/3jQQMXmXd6Q/cGcmFjM- f8YJ

That is,

  • By looking at the value of the incoming value, the shard is divided every fixed time (the initial setting is seven days). Leveldb comes out for each shard
  • The server in charge is decided for each shard
  • The shard is copied by the replication factor. That is, the number of shard's servers will increase
  • Shard is further divided by the value split
  • Which split is entered is determined by the hash value (database, seriesName)
  • If you set the value split-random, it will be decided as random (it can also be specified by regex?)

To be derived from now,

  • The speed of one query does not change. As it seems, since only one machine is querying, it is decided by leveldb after all
  • The size (period) of shard is also important, it is quick if it is within 1 shard, but it becomes late when query is crossing shard.
  • Increasing the replication factor increases copy, so the parallel reading performance increases. However, the number of writes also increases
  • If split is set, write performance will be improved because it can be further distributed within shard. However, since it will be split, the query crossing it will probably be slow

Is it such a feeling? I have not confirmed properly, so there may be mistakes.

Regarding writing this time, it seems that shards allocated to the other one did not write to disk, so it became faster by that much.

However, I do not know that the query that should not cross the shard got delayed. We will wait for others to verify that.

So, it is not very neat verification, but in the atmosphere.