[postgis-users] Best Practices for Configuring Large Data Sets

Fri May 28 08:36:54 PDT 2010

Hi Bill,

I have not done much like you describe, but am working with Postgis tables with around 155,000,000 polygons in them, & have got very reasonable performance out of it.

General comments, some probably obvious, but here anyway...

I'm running a quad core AMD 965 with 16Gb memory & my db in a raid 0 partition (software raid under Linux) across fastish 1Tb WD black hard drives.

You are moving lots of data for a query on such datasets so fast disk, spatial queries are also fpu/cpu intensive. Get a balanced system, but disk I/O will generally be the bottleneck. A fast cpu will still be idling waiting for data as the most likely bottleneck.

Use a 64 bit O/S to allow full use of as much memory as you can. With DDR3 now providing fast & affordable 4Gb modules, at least 16Gb is good.

Use latest versions of Postgres/Postgis. They have improved indexing & prepared queries which are around 10x faster than previous ones for many spatial queries.

Use tablespaces. Put indexes & data on separate physical disks (or arrays). Heads jumping around of drives is the slowest part of the equation.

Use aspatial columns/indices where possible & pre-populate. eg: for all records (section, city, county, etc) have an indexed column for state. Select section where state='ID' will be much faster than where ST_contains(state.geom, section.geom)

If you often have queries like "W of Mississippi" then add an indexed boolean column as a flag for this & pre-populate, again "select ... where [not]  w_mis and ..." is much quicker than "select ... where <one of several geometry based approaches> and ..."

Table partitioning might help, but configure the server to use them effectively.  Maybe the section table partitioned by state, but configure the server to use table partitions effectively. (there is documentation on this, but other ways can be more effective)

If you do a lot of section-in-state queries, then having a clustered index on section.state should help.

Use functional & conditional indexes (see this post for details:
http://postgis.refractions.net/pipermail/postgis-users/2006-February/011198.html)

Yes, using pre configured bounding boxes will help, any way to quickly reduce the result set to be further processed can make big differences.

Don't trust the query planner. While it usually does a pretty good job, with complex spatial queries it can get confused. "explain" is your friend, even if it does talk in some arcane jargon at times. If a query is taking longer than you expected, it can often be tweaked to give faster results.

Store ALL your data in a single projection/coord system. You don't want any unnecessary transforms.

Use 2D geometries. I don't know what the overhead in a few bytes extra per coord, but why suffer it if you don't need to.

HTH,

Brent Wood

--- On Sat, 5/29/10, Bill Thoen <bthoen at gisnet.com> wrote:

> From: Bill Thoen <bthoen at gisnet.com>
> Subject: [postgis-users] Best Practices for Configuring Large Data Sets
> To: postgis-users at postgis.refractions.net
> Date: Saturday, May 29, 2010, 2:49 AM
> I've got a database of Public Land
> Survey System (PLSS) data  for most  of the
> contiguous United States, and I need to make it accessible
> to spatial queries using PostgreSQL and PostGIS. My basic
> question is how to physically configure it so that queries
> over any area are as efficient as possible.
>
> Right now it is all stored in Shapefiles,  physically
> separated by state and within each state it's divided into
> three layers: township boundaries, section boundaries and
> quarter-section boundaries. To give you some idea of the
> scale of the entities, townships are square-ish polygons 6
> miles on a side and contain 36 sections. Sections are approx
> 1 sq mi in size. The files take up some space too  (for
> example, the data for Colorado is 862Mb and Wyoming is
> 763Mb.)
>
> I need to use these data to perform queries on other
> national data sets to provide results for requests like:
> "Produce a count of all active mining claims west of the
> Mississippi (west of longitude 96W is good enough) by PLSS
> section that intersect any of the Inventoried Roadless Areas
> (IRAs)."
>
> I think I know how to do queries like this, but it would be
> nice to not have to do them for each state. OTOH, if I
> combine the data so that it doesn't break at the state
> borders, every query is going to involve whopper-sized
> tables and the system might be too slow. I've thought about
> writing the queries starting out by filtering on an "Area of
> Interest" rectangle first, taking advantage of spatial
> indexing, but I have no feel yet for whether that will
> quickly enough reduce the load so that the queries don't
> take months to execute.
>
> So if you have experience with this sort of thing, could I
> get your advice on how to balance files sizes to optimize
> performance and convenience?  Also if you know of any
> "red flag" conditions that I should watch out for where
> things become unstable or performance goes to pot (like MS
> Access' 2Gb limit. You DON'T want to get near that!) I'd
> appreciate knowing about them before stepping into them.
>
> TIA,
> - Bill Thoen
>
> -- - Bill Thoen  GISnet - www.gisnet.com
>  303-786-9961
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>