[postgis-users] Performance compared to large shapefiles?

Wed Apr 25 12:52:48 PDT 2007

--- Jeff Dege <jdege at korterra.com> wrote:

> Someone pointed me to PostGIS as being a tool worth considering.
> 
> We've done our mapping so far with various extensions built on top of
> UMN MapServer (ka-map and openlayers).  The GIS data we've been storing
> in shapfiles.
> 
> We're finding it very difficult to manage acceptable performance when
> working with large shapefiles, where large means >500MB, >3 million
> features.  We've been splitting these both by feature type (pulling the
> features we display at wider zooms into separate files) and by geography
> (tiling).  It's tedious, time-consuming, and performance still isn't
> what we'd wish.
> 
> The docs for PostGIS say that we can expect access times to be about 10%
> greater than working with shapefiles, due to the overhead of
> establishing a database, etc.
> 
> Is this constant?
> 
> What I'd like to be true is that PostGIS would offer indexing
> possibilities that would allow for faster access to subsets of large
> sets of geographic data than we're getting with shapefiles.

We have PostGIS tables with millions of records. I have found that plotting the
entire dataset is prety much always faster with shapefiles (sometimes
significantly so) but plotting subsets is often faster with PostGIS (also
significantly so).

You can optmise shapefiles by tiling & indexing, but I don't believe this is as
effective, and is certainly more cumbersome, than using a "real" database to
manage & query your spatial data. 

We have no performance issues with PostGIS, and find it far more powerful &
flexible as a data repository.

For example, zoom layers can be generated as:
 "select simplify(lake_geom,1000) from lakes where area(lake_geom) > 50,000;" 
 "select simplify(lake_geom,100) from lakes where area(lake_geom) > 5,000;"
 "select lake_geom from lakes;"

What these do is generate simplified (point reduced) polygons of specifically
sized lakes, to be invoked depending on the zoom scale. The on-the-fly data
reduction this affords provides overall rendering gains which outweigh the
initial processing overhead, at least in our experience. Overall, much faster
than using shapefiles.

Thus you have the ability to use functions & operators on the fields which
isn't possible with shapefiles. You can also construct indices on the results
of these functions, or pre-generate the required features (feature versions) to
tweak things a bit more, so each lake feature (record) can have the area stored
& indexed, & also have the simplified geometries stored in the same table, so
you could build a table (for example) like:

lakes
-----------
gid
name
description
altitude
depth
volume
area
geom_full
geom_1000
geom_100

with three different versions of the geometry ready to use. With shapefiles,
you can generate 3 different shapefiles, but the DBF files etc also need to be
duplicated & it is a more complicated approach.

We also abstract the database server(s) from the web/map server, so that adding
hardware increases performance, as data can be managed on different boxes, but
seamlessly queried over the LAN, as appropriate. This is not simple with a
file/filesystem approach such as using shapefiles, where you can add spindles
to an array, but adding cpu & memory is often harder. 

(a mapfile can have a psql query for a layer/class run on any available host,
so data doesn't need to be local, & only the results of thw query are passed
over the LAN)

In short, I believe PostGIS offers significant advantages over shapefiles
generally, and when properly implemented, can also be faster.

 Cheers,

  Brent Wood