[postgis-users] Performace problems with 250.000.000 points

Mark Cave-Ayland mark.cave-ayland at ilande.co.uk
Tue Jul 24 08:31:06 PDT 2007


On Tue, 2007-07-24 at 15:26 +0200, Boehm, Andreas wrote:
> Hi list,
> 
> working with an huge data table I've massive performance problems.
> 
> Here's the scenario:
> There is one table trees, containing about random 250.000.000 points
> within the square (x = 100000 / y = 100000). SRID is -1. A second table
> land_parcel contains about 900.000 polygons (1 ring with 17 points).
> These polygons are uniformly distributed within the square from lower
> left to upper right.
> 
> The task is to read all the trees within a specific land_parcel.
> Something like
> 	select x(t.geom), y(t.geom) 
> 	from land_parcel as l, trees as t 
> 	where (l.lp_id = 123456) 
> 	and (l.geom && t.geom) and Contains(l.geom, t.geom)
> 
> The problem is that the execution of this query takes ages. Although the
> result set is very small (approximately 25 trees). 
> In the first attempts I measured an average computation time of 366 s
> for selecting one random land_parcel. The maximum computation time I
> measured was about 700 s. In the meantime I've got better results (see
> below).
> 
> My assumption is that the bottle neck is the geo index. Reading the 100
> first land_parcels in a loop I got the result for the 1st land_parcel
> after 12 s. The times for the following land_parcels are decreasing (7
> s, 6 s, ... 0.5 s). 
> There is an other thing, too. If you once picked a random land_parcel
> the computation for a neighbour land_parcel is much faster.
> 
> PostgreSQL is running on a windows machine (Win 2003 Server) with an
> Intel Core2 CPU 6700, 2.66 GHz, 2 GB RAM and a 1.1 TB RAID 5. CPU usage
> is low, I/O traffic is height, mem usage is about 80 MB. 
> 
> I've already walked through Mark's tuning guide. Here are some snippets
> of the configuration file:
> 	shared_buffers = 64MB
> 	effective_cache_size = 196MB
> 	random_page_cost = 4
> I've altered the tables:
> 	ALTER TABLE trees ALTER COLUMN geom SET STATISTICS 1000;
> 	ALTER TABLE trees ALTER COLUMN tr_id SET STATISTICS 1000;
> 	ALTER TABLE land_parcel ALTER COLUMN geom SET STATISTICS 1000;
> 	ALTER TABLE land_parcel ALTER COLUMN lp_id SET STATISTICS 1000;
> I've done several vacuum analyzes.
> I've done a reindex on trees and land_parcels.
> I've re-clusterd the table trees
> 	CLUSTER trees_index ON trees;
> 
> The clustering was successful. The average time for one land_parcel is
> 9.7 s. But this is still to slow. Running the scenario in ArcSDE /
> Oracle the average computation time is approximately 1.4 s.
> 
> Has anybody still an idea? Or are 250.000.000 points just too many
> items?
> Thanks
> 
> Andreas


Hi Andreas,

Glad you found the tuning guide useful ;)

Firstly break the query down into its two main indexable components and
check the speed and number of rows returned for each, e.g.


SELECT * FROM land_parcel WHERE lp_id = 123456;

then:

SELECT * FROM trees WHERE geom && (SELECT geom FROM land_parcel WHERE
lp_id = 123456);


to see if that works any better (I suspect the second query will be
faster since it eliminates the nested loop in your plan). Hopefully this
should be enough information to get you on your way, otherwise we need
to see the EXPLAIN ANALYZE output rather than just EXPLAIN. This is
because the former contains extra cost information indicating exactly
why the planner made its current choice of plan.


Kind regards,

Mark.

-- 
ILande - Open Source Consultancy
http://www.ilande.co.uk





More information about the postgis-users mailing list