[postgis-devel] GiST Sorting

Tue Nov 30 08:42:55 PST 2021

I have to agree with Paul on this.  Indexes are usually created once but
queries are done all the time.  The only benefit would be to a spatial
dataset that is constantly updated.  Even then the update performance
increase (if it happens) might be small compared to the query performance.
Most applications would do a massive bulk load (OSM data) then index, then
constant queries.

On Tue, Nov 30, 2021 at 8:08 AM Paul Ramsey <pramsey at cleverelephant.ca>
wrote:

>
>
> > On Nov 29, 2021, at 11:03 PM, Marco Boeringa <marco at boeringa.demon.nl>
> wrote:
> >
> > Hi Paul,
> >
> > Thanks for these insights. In fact, seeing some of the similar
> discussions in the past about build time versus query performance when Han
> was still working on this project, was one of the reasons I initially asked
> about this here on the list, and asked about real world experiences with
> the new versions of PostgreSQL and PostGIS.
> >
> > However, your remark about "flat trees" raises another question for me:
> >
> > - Is the issue you describe data size depended or not? E.g. does the 50%
> penalty on query performance possibly only occur on small datasets, but not
> - or much less so - on something like a Planet size extract of
> OpenStreetMap? Or is this a universal issue?
>
> Note that, with optimal page filling (which the flatter trees get close
> to) you can fit a 250000 record table into an index with only 2 levels.
> These aren't just a little flat, they are really flat.
> I also tested with a 2.7M record table and found the sort-support index
> ran 30% slower. My mental model, which may be wrong, says that the lower
> levels of the tree tend to be the most full, so "smaller" tables (under a
> quarter million records!) bear the brunt of the downgrade, but even larger
> tables see it.
>
> > Having a 50% decrease in query performance of primarily small datasets
> might be acceptable,
>
> Uh, 100% abosutely not. Querying is something the database does all day,
> every day, so every inefficiency is multiplied over all the queries that
> are run on the table For All Time. Bulk indexing is something that happens
> Once.
>
> There is no way that we should push this kind of regression out as the
> default setting. Either we should make a whole other "fast to build but
> slow to query" opclass or just leave the sort-support out of the opclass
> and let people with the "index build time is my constraint" people manually
> add the sort support function to the default opclass.
>
> P
>
> > if it means much larger datasets can be indexed much faster and have
> minimal impact of the same issue. That said, I understand your caution if
> the 50% decrease in query performance is a universal issue with the new
> implementation.
> >
> > Marco
> >
> > Op 30-11-2021 om 00:08 schreef Paul Ramsey:
> >> Sorry to be the rain on the parade, but I think that the results of
> testing on performance for the new index sorting mean we should NOT be
> enabling it by default for this release.
> >>
> >> Basically we have multiple tests now that show that the index is slower
> for query performance, and we even have a convincing working theory for why
> (the improved packing into pages/nodes means that we have very flat trees
> which are expensive to query).
> >>
> >> I think releasing with a "new improved" version that degrades most
> people's systems by 50% is not a good look. We can leave all the code in
> there and just take the sorting function out of the opclass. That way
> people with "build an index and use it once" use cases can still turn the
> feature on, but people with normal use patterns aren't going to silently
> get a kick in the teeth when they upgrade to 3.2.
> >>
> >> If you would like to replicate my results, I provide them below.
> >>
> >> P.
> >>
> >>
> >> Data: https://www.dropbox.com/s/e75g1y9ua1da6qu/roads_rdr.sql.gz?dl=0
> >>
> >> create index roads_rdr_idx on roads_rdr using gist (geom);
> >>
> >> 13/3.2/CREATE 1546
> >> 13/3.1/CREATE 1683
> >> 14/3.2/CREATE 200
> >> 14/3.1/CREATE 1705
> >>
> >> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
> >>
> >> 13/3.2/QUERY  4600
> >> 13/3.1/QUERY 4540
> >> 14/3.2/QUERY 11200
> >> 14/3.1/QUERY 4700
> >>
> >> select pg_relation_size('roads_rdr_idx');
> >>
> >> 13/3.2/IDXSIZE  5414912
> >> 13/3.1/IDXSIZE 5496832
> >> 14/3.2/IDXSIZE 2940928
> >> 14/3.1/IDXSIZE 5439488
> >> _______________________________________________
> >> postgis-devel mailing list
> >> postgis-devel at lists.osgeo.org
> >> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20211130/0183f619/attachment-0001.html>