sql - How to increase performance of the query which counts the records by a specific time interval? -
i use postgresql 9.5. below can find structure of table, query , result of query. increase performance of query. query counts records specific time interval, example: 250 milliseconds
, 1 second
, 22 minutes
, 2 days , 30 minutes
, etc.
the query fast large intervals 60 minutes
small intervals 4 seconds
it's slow.
the important things:
- i work large database (20 million rows , more in query use part of database using
where
clause, example: 1 million or more). - there
id_user_table
,sip
columns inwhere
clause. in cases,where
clause colud include of columns of table, depends on user's choice. at moment i've created b-tree index on
starttime
column:create index starttime_interval on data_store (starttime);
do know ways increase performance of query?
for example, means of:
- creating indexes on columns (which indexes? , how create them?),
- improving query,
- changing settings in postgresql,
- or somethings else.
here's structure of table:
column_name | udt_name | length | is_nullable | key ---------------+-------------+--------+-------------+-------- id | int8 | | no | pk id_user_table | int4 | | no | fk starttime | timestamptz | | no | time | float8 | | no | sip | varchar | 100 | no | dip | varchar | 100 | no | sport | int4 | | yes | dport | int4 | | yes | proto | varchar | 50 | no | totbytes | int8 | | yes | info | text | | yes | label | varchar | 10 | no |
simple select * data_store id_user_table=1 , sip='147.32.84.138' order starttime
returns this:
id | id_user_table | starttime | sip | other columns... -----+---------------+----------------------------+---------------+-------------------- 185 | 1 | 2011-09-12 15:24:03.248+02 | 147.32.84.138 | ... 189 | 1 | 2011-09-12 15:24:03.256+02 | 147.32.84.138 | ... 312 | 1 | 2011-09-12 15:24:06.112+02 | 147.32.84.138 | ... 313 | 1 | 2011-09-12 15:24:06.119+02 | 147.32.84.138 | ... 450 | 1 | 2011-09-12 15:24:09.196+02 | 147.32.84.138 | ... 451 | 1 | 2011-09-12 15:24:09.203+02 | 147.32.84.138 | ... 452 | 1 | 2011-09-12 15:24:09.21+02 | 147.32.84.138 | ...
here's query 4 seconds time intervals:
with generate_period as( select generate_series(date_trunc('second',min(starttime)), date_trunc('second',max(starttime)), interval '4 second') tp data_store id_user_table=1 , sip='147.32.84.138' --other restrictions ), data_series as( select date_trunc('second', starttime) starttime, count(*) ct data_store id_user_table=1 , sip='147.32.84.138' --other restrictions group 1 ) select gp.tp starttime-from, gp.tp + interval '4 second' starttime-to, coalesce(sum(ds.ct),0) ct generate_period gp left join data_series ds on date_trunc('second',ds.starttime) >= gp.tp , date_trunc('second',ds.starttime) < gp.tp + interval '4 second' group 1 order 1;
here's result of query:
starttime-from | starttime-to | ct ------------------------+------------------------+--------- 2011-09-12 15:24:03+02 | 2011-09-12 15:24:07+02 | 4 2011-09-12 15:24:07+02 | 2011-09-12 15:24:11+02 | 3 2011-09-12 15:24:11+02 | 2011-09-12 15:24:15+02 | 0 ... | ... | ...
here's result of explain analyze
received in pgadmin 4 seconds time intervals:
sort (cost=7477837.88..7477838.38 rows=200 width=16) (actual time=1537280.238..1537289.519 rows=60141 loops=1) sort key: gp.tp sort method: external merge disk: 1792kb cte generate_period -> aggregate (cost=166919.73..166924.74 rows=1000 width=8) (actual time=752.301..823.022 rows=60141 loops=1) -> seq scan on data_store (cost=0.00..163427.57 rows=698431 width=8) (actual time=0.034..703.845 rows=679951 loops=1) filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) rows removed filter: 4030687 cte data_series -> groupaggregate (cost=242521.00..250085.18 rows=186076 width=8) (actual time=1233.414..1341.701 rows=57555 loops=1) group key: (date_trunc('second'::text, data_store_1.starttime)) -> sort (cost=242521.00..244267.08 rows=698431 width=8) (actual time=1233.407..1284.110 rows=679951 loops=1) sort key: (date_trunc('second'::text, data_store_1.starttime)) sort method: external sort disk: 11960kb -> seq scan on data_store data_store_1 (cost=0.00..165173.65 rows=698431 width=8) (actual time=0.043..886.224 rows=679951 loops=1) filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) rows removed filter: 4030687 -> hashaggregate (cost=7060817.31..7060820.31 rows=200 width=16) (actual time=1537215.586..1537240.698 rows=60141 loops=1) group key: gp.tp -> nested loop left join (cost=0.00..6957441.76 rows=20675111 width=16) (actual time=1985.731..1536921.862 rows=74443 loops=1) join filter: ((date_trunc('second'::text, ds.starttime) >= gp.tp) , (date_trunc('second'::text, ds.starttime) < (gp.tp + '00:00:04'::interval))) rows removed join filter: 3461357700 -> cte scan on generate_period gp (cost=0.00..20.00 rows=1000 width=8) (actual time=752.303..910.810 rows=60141 loops=1) -> cte scan on data_series ds (cost=0.00..3721.52 rows=186076 width=16) (actual time=0.021..3.716 rows=57555 loops=60141) planning time: 0.258 ms execution time: 1537389.102 ms
update
here's query without with cte
, date_trunc()
expression maybe query easier optimize:
select gp.tp starttime_from, gp.tp + interval '4 second' starttime_to, count(ds.id) (select generate_series(min(starttime), max(starttime), interval '4 second') tp data_store id_user_table=1 , sip='147.32.84.138' --other restrictions ) gp left join data_store ds on ds.starttime >= gp.tp , ds.starttime < gp.tp + interval '4 second' , id_user_table=1 , sip='147.32.84.138' --other restrictions group gp.tp order gp.tp;
the above query faster first query. @ moment b-tree index on starttime
column works it's still not enough. if set 100 milliseconds
time intervals, i've still wait long. 100 milliseconds
range minimal time interval user can set. i've added b-tree index on sip
column doesn't help.
here's result of explain analyze
received in pgadmin 100 ms time intervals:
sort (cost=14672356.96..14672357.46 rows=200 width=16) (actual time=9380.768..9951.074 rows=2405621 loops=1) sort key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval)) sort method: external merge disk: 79880kb -> hashaggregate (cost=14672346.81..14672349.31 rows=200 width=16) (actual time=6199.538..7232.962 rows=2405621 loops=1) group key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval)) -> nested loop left join (cost=2.02..14284329.59 rows=77603444 width=16) (actual time=0.321..4764.648 rows=3006226 loops=1) -> result (cost=1.58..6.59 rows=1000 width=0) (actual time=0.295..159.147 rows=2405621 loops=1) initplan 1 (returns $0) -> limit (cost=0.43..0.79 rows=1 width=8) (actual time=0.208..0.208 rows=1 loops=1) -> index scan using starttime_interval on data_store (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.204..0.204 rows=1 loops=1) index cond: (starttime not null) filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) rows removed filter: 144 initplan 2 (returns $1) -> limit (cost=0.43..0.79 rows=1 width=8) (actual time=0.050..0.050 rows=1 loops=1) -> index scan backward using starttime_interval on data_store data_store_1 (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.049..0.049 rows=1 loops=1) index cond: (starttime not null) filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) rows removed filter: 23 -> index scan using starttime_interval on data_store ds (cost=0.44..13508.28 rows=77603 width=16) (actual time=0.002..0.002 rows=0 loops=2405621) index cond: ((starttime >= (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))) , (starttime < ((generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00 (...) filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) rows removed filter: 2 planning time: 1.299 ms execution time: 11641.154 ms
as wrote in comment can using multicolumn index:
create index my_index on data_store (id_user_table, sip, starttime);
this should remove filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))
execution plan (and since every such filter executed in loop savings might quite high).
i've prepared alternative query:
select min + (max - min) * (least - 1) starttime_from, min + (max - min) * least starttime_to, count ( select min, max, count(1), least( width_bucket( extract(epoch starttime)::double precision, extract(epoch min)::double precision, extract(epoch max)::double precision, ceil(extract(epoch (max - min))/extract(epoch query_interval))::integer ), ceil(extract(epoch (max - min))/extract(epoch query_interval))::integer ) ( select *, max(starttime) on (), min(starttime) on (), '4 second'::interval query_interval data_store ) subquery2 group least, min, max ) subquery1;
it should avoid nested loop , guess faster. might need adjustment fit desired result (some date truncation?).
Comments
Post a Comment