Influx db performance evaluation
I tried influxdb and tried it.
Actually, the interval between 1 test and cluster test is about 1 month apart. Moreover, it is not a neat performance evaluation, so please make it a reference level.
1 unit test
- DigitalOcean 1GB 30GB (SSD)
I tried it with 512 MB at the beginning, but since I was killed by OOM Killer, I made it 1 GB.
Registration
I sent the following data with HTTP POST from another node of 10 million lines (6 GB).
The batch size is 30 and we decided to register 30 pieces at the same time.
{"int": -74, "str": "ɊƭŷćҏŃȅƒŕƘȉƒŜőȈŃɊҏŷ","uint":3440,"time":1386688205}
By the way, the character string is the unicode character generated by rotunicode. It generates 100 characters.
The result of dstat at that time was this. The load side is completely defeated.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-----system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
52 19 27 0 2 0| 0 1508k|2313k 140k| 0 0 |5180 10k
40 18 40 0 1 0| 0 0 |1853k 112k| 0 0 |4922 9888
41 19 36 2 1 0| 0 2740k|1894k 113k| 0 0 |4928 9944
46 18 34 1 1 0| 0 1504k|2009k 121k| 0 0 |4752 9516
42 19 38 0 1 0| 0 0 |1830k 110k| 0 0 |5050 10k
44 20 34 0 2 0| 0 0 |2022k 121k| 0 0 |5536 11k
I thought it would be good as I thought it would have become like this.
88 8 0 3 1 0| 0 6124k|4806k 131k| 0 0 |2655 4280
87 8 0 3 2 0| 0 7232k|4785k 129k| 0 0 |2185 3364
54 11 0 34 1 0| 0 2136k|4784k 129k| 0 0 |4640 8752
It took like this for such a long time.
real 56m35.234s
user 16m19.658s
At this point, the DB was 2.7 GB.
To say that 10 million rows is 56 minutes means simply dividing 2800 lines per second. However, since it is actually sending 30 lines at once, it may be 93 qps.
As you can see from the result of dstat above, CPU seems to be rate limiting. Because Digital Ocean is an SSD, I / O was not a problem, so I / O may be a problem if it is an ordinary HDD.
Query
It is total time of 10 times in each.
- Select * from stress limit 1000
- Real 0.417 sec
- Select max (int) from stress
- Real 15m 50.002s
- Select max (int) from stress where time <'2013 - 12 - 12'
- 236400 records real 0 m 10.699 s
Select count (int) from stress where '2013-12-10 23: 00: 00' <time and time <'2013-12-11 00: 00: 00'
7202 record real 0 m 0. 454 s
By the way, since the CPU at the time of max was 99%, I think that the CPU is a bottleneck.
Cluster system
Speaking of influx db is cluster. Let's build a cluster this time.
I did not write the way of influxdb 's clustering in the document, but I could make it by setting the following setting to config.toml.
- The first one comments out seed-servers
- The second one is seed-servers = ["(first address): 8090"]
was. When cluster is assembled, it appears in "cluster" of WebUI.
I do not know whether to add the second address at the third machine, but I think that it is probable that you can add it. The point is that only the first one is special.
This area, this メール was found finally in.
result
I threw it under the same conditions as before. In addition, only one POST destination.
real 35m25.989s
user 14m33.846s
I will not go halfway, but it got fairly quick. It is 4700 lines per second, it feels like 156 qps.
At this point the disk capacity was both 1.4 GB. In other words, it will be divided almost equally.
In addition, 18 shards were made.
Query
- Select * from stress limit 1000;
- Real 0 m 3.746 s
- Select max (int) from stress where time <'2013 - 12 - 12'
- 236400 records real 0 m 11.530 s
Why is getting late as a whole. In particular, the first does not cross the shard, and there is no shard on a server different from the server that threw the query.
About shard
From influx db 0.5 the data is placed as described below.
https://groups.google.com/forum/#!msg/influxdb/3jQQMXmXd6Q/cGcmFjM- f8YJ
That is,
- By looking at the value of the incoming value, the shard is divided every fixed time (the initial setting is seven days). Leveldb comes out for each shard
- The server in charge is decided for each shard
- The shard is copied by the replication factor. That is, the number of shard's servers will increase
- Shard is further divided by the value split
- Which split is entered is determined by the hash value (database, seriesName)
- If you set the value split-random, it will be decided as random (it can also be specified by regex?)
To be derived from now,
- The speed of one query does not change. As it seems, since only one machine is querying, it is decided by leveldb after all
- The size (period) of shard is also important, it is quick if it is within 1 shard, but it becomes late when query is crossing shard.
- Increasing the replication factor increases copy, so the parallel reading performance increases. However, the number of writes also increases
- If split is set, write performance will be improved because it can be further distributed within shard. However, since it will be split, the query crossing it will probably be slow
Is it such a feeling? I have not confirmed properly, so there may be mistakes.
Regarding writing this time, it seems that shards allocated to the other one did not write to disk, so it became faster by that much.
However, I do not know that the query that should not cross the shard got delayed. We will wait for others to verify that.
So, it is not very neat verification, but in the atmosphere.