sql - How to increase performance of the query which counts the records by a specific time interval? -


i use postgresql 9.5. below can find structure of table, query , result of query. increase performance of query. query counts records specific time interval, example: 250 milliseconds, 1 second, 22 minutes, 2 days , 30 minutes, etc.

the query fast large intervals 60 minutes small intervals 4 seconds it's slow.

the important things:

  • i work large database (20 million rows , more in query use part of database using where clause, example: 1 million or more).
  • there id_user_table , sip columns in where clause. in cases, where clause colud include of columns of table, depends on user's choice.
  • at moment i've created b-tree index on starttime column:

    create index starttime_interval on data_store (starttime); 

do know ways increase performance of query?

for example, means of:

  • creating indexes on columns (which indexes? , how create them?),
  • improving query,
  • changing settings in postgresql,
  • or somethings else.

here's structure of table:

  column_name  |   udt_name  | length | is_nullable |  key ---------------+-------------+--------+-------------+-------- id             |    int8     |        |     no      |   pk id_user_table  |    int4     |        |     no      |   fk starttime      | timestamptz |        |     no      | time           |   float8    |        |     no      | sip            |   varchar   |  100   |     no      | dip            |   varchar   |  100   |     no      | sport          |    int4     |        |     yes     | dport          |    int4     |        |     yes     | proto          |   varchar   |   50   |     no      | totbytes       |    int8     |        |     yes     | info           |    text     |        |     yes     | label          |   varchar   |   10   |     no      | 

simple select * data_store id_user_table=1 , sip='147.32.84.138' order starttime returns this:

  id | id_user_table |          starttime         |      sip      |  other columns... -----+---------------+----------------------------+---------------+--------------------  185 |       1       | 2011-09-12 15:24:03.248+02 | 147.32.84.138 |        ...  189 |       1       | 2011-09-12 15:24:03.256+02 | 147.32.84.138 |        ...  312 |       1       | 2011-09-12 15:24:06.112+02 | 147.32.84.138 |        ...  313 |       1       | 2011-09-12 15:24:06.119+02 | 147.32.84.138 |        ...   450 |       1       | 2011-09-12 15:24:09.196+02 | 147.32.84.138 |        ...  451 |       1       | 2011-09-12 15:24:09.203+02 | 147.32.84.138 |        ...   452 |       1       | 2011-09-12 15:24:09.21+02  | 147.32.84.138 |        ... 

here's query 4 seconds time intervals:

with generate_period as(      select generate_series(date_trunc('second',min(starttime)),                             date_trunc('second',max(starttime)),                             interval '4 second') tp     data_store      id_user_table=1 , sip='147.32.84.138' --other restrictions  ), data_series as(      select date_trunc('second', starttime) starttime, count(*) ct     data_store       id_user_table=1 , sip='147.32.84.138' --other restrictions     group  1  )  select gp.tp starttime-from,         gp.tp + interval '4 second' starttime-to,         coalesce(sum(ds.ct),0) ct  generate_period gp left join data_series ds on date_trunc('second',ds.starttime) >= gp.tp                          , date_trunc('second',ds.starttime) < gp.tp + interval '4 second' group 1 order 1; 

here's result of query:

      starttime-from    |      starttime-to      |   ct ------------------------+------------------------+---------  2011-09-12 15:24:03+02 | 2011-09-12 15:24:07+02 |    4  2011-09-12 15:24:07+02 | 2011-09-12 15:24:11+02 |    3  2011-09-12 15:24:11+02 | 2011-09-12 15:24:15+02 |    0            ...          |           ...          |   ... 

here's result of explain analyze received in pgadmin 4 seconds time intervals:

sort  (cost=7477837.88..7477838.38 rows=200 width=16) (actual time=1537280.238..1537289.519 rows=60141 loops=1)   sort key: gp.tp   sort method: external merge  disk: 1792kb   cte generate_period     ->  aggregate  (cost=166919.73..166924.74 rows=1000 width=8) (actual time=752.301..823.022 rows=60141 loops=1)           ->  seq scan on data_store  (cost=0.00..163427.57 rows=698431 width=8) (actual time=0.034..703.845 rows=679951 loops=1)                 filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))                 rows removed filter: 4030687   cte data_series     ->  groupaggregate  (cost=242521.00..250085.18 rows=186076 width=8) (actual time=1233.414..1341.701 rows=57555 loops=1)           group key: (date_trunc('second'::text, data_store_1.starttime))           ->  sort  (cost=242521.00..244267.08 rows=698431 width=8) (actual time=1233.407..1284.110 rows=679951 loops=1)                 sort key: (date_trunc('second'::text, data_store_1.starttime))                 sort method: external sort  disk: 11960kb                 ->  seq scan on data_store data_store_1  (cost=0.00..165173.65 rows=698431 width=8) (actual time=0.043..886.224 rows=679951 loops=1)                       filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))                       rows removed filter: 4030687   ->  hashaggregate  (cost=7060817.31..7060820.31 rows=200 width=16) (actual time=1537215.586..1537240.698 rows=60141 loops=1)         group key: gp.tp         ->  nested loop left join  (cost=0.00..6957441.76 rows=20675111 width=16) (actual time=1985.731..1536921.862 rows=74443 loops=1)               join filter: ((date_trunc('second'::text, ds.starttime) >= gp.tp) , (date_trunc('second'::text, ds.starttime) < (gp.tp + '00:00:04'::interval)))               rows removed join filter: 3461357700               ->  cte scan on generate_period gp  (cost=0.00..20.00 rows=1000 width=8) (actual time=752.303..910.810 rows=60141 loops=1)               ->  cte scan on data_series ds  (cost=0.00..3721.52 rows=186076 width=16) (actual time=0.021..3.716 rows=57555 loops=60141) planning time: 0.258 ms execution time: 1537389.102 ms 

update

here's query without with cte , date_trunc() expression maybe query easier optimize:

select gp.tp starttime_from,         gp.tp + interval '4 second' starttime_to,         count(ds.id) (select generate_series(min(starttime), max(starttime), interval '4 second') tp       data_store       id_user_table=1 , sip='147.32.84.138' --other restrictions      ) gp      left join data_store ds       on ds.starttime >= gp.tp , ds.starttime < gp.tp + interval '4 second'         , id_user_table=1 , sip='147.32.84.138' --other restrictions group gp.tp order gp.tp; 

the above query faster first query. @ moment b-tree index on starttime column works it's still not enough. if set 100 milliseconds time intervals, i've still wait long. 100 milliseconds range minimal time interval user can set. i've added b-tree index on sip column doesn't help.

here's result of explain analyze received in pgadmin 100 ms time intervals:

sort  (cost=14672356.96..14672357.46 rows=200 width=16) (actual time=9380.768..9951.074 rows=2405621 loops=1)   sort key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))   sort method: external merge  disk: 79880kb   ->  hashaggregate  (cost=14672346.81..14672349.31 rows=200 width=16) (actual time=6199.538..7232.962 rows=2405621 loops=1)         group key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))         ->  nested loop left join  (cost=2.02..14284329.59 rows=77603444 width=16) (actual time=0.321..4764.648 rows=3006226 loops=1)               ->  result  (cost=1.58..6.59 rows=1000 width=0) (actual time=0.295..159.147 rows=2405621 loops=1)                     initplan 1 (returns $0)                       ->  limit  (cost=0.43..0.79 rows=1 width=8) (actual time=0.208..0.208 rows=1 loops=1)                             ->  index scan using starttime_interval on data_store  (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.204..0.204 rows=1 loops=1)                                   index cond: (starttime not null)                                   filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))                                   rows removed filter: 144                     initplan 2 (returns $1)                       ->  limit  (cost=0.43..0.79 rows=1 width=8) (actual time=0.050..0.050 rows=1 loops=1)                             ->  index scan backward using starttime_interval on data_store data_store_1  (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.049..0.049 rows=1 loops=1)                                   index cond: (starttime not null)                                   filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))                                   rows removed filter: 23               ->  index scan using starttime_interval on data_store ds  (cost=0.44..13508.28 rows=77603 width=16) (actual time=0.002..0.002 rows=0 loops=2405621)                     index cond: ((starttime >= (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))) , (starttime < ((generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00 (...)                     filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text))                     rows removed filter: 2 planning time: 1.299 ms execution time: 11641.154 ms 

as wrote in comment can using multicolumn index:

create index my_index on data_store (id_user_table, sip, starttime); 

this should remove filter: ((id_user_table = 1) , ((sip)::text = '147.32.84.138'::text)) execution plan (and since every such filter executed in loop savings might quite high).

i've prepared alternative query:

select     min + (max - min) * (least - 1) starttime_from,     min + (max - min) * least starttime_to,     count (     select         min,         max,         count(1),         least(             width_bucket(                 extract(epoch starttime)::double precision,                 extract(epoch min)::double precision,                 extract(epoch max)::double precision,                 ceil(extract(epoch (max - min))/extract(epoch query_interval))::integer             ),             ceil(extract(epoch (max - min))/extract(epoch query_interval))::integer         )     (         select             *,             max(starttime) on (),             min(starttime) on (),             '4 second'::interval query_interval         data_store     ) subquery2     group least, min, max ) subquery1; 

it should avoid nested loop , guess faster. might need adjustment fit desired result (some date truncation?).


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -