Deleting duplicate data with PostgreSQL
Here is an update to a query I posted a while back for detecting duplicate data. It allows you to select all but one of the resulting duplicates, for easy deletion. It only works on PostgreSQL, but is pretty neat. It uses a window function!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
DELETE FROM users USING ( SELECT id, first_value(id) OVER ( PARTITION BY name ORDER BY created_at DESC ) first_id FROM users WHERE name IN ( SELECT name FROM users GROUP BY name HAVING count(name) > 1 ) ) dups WHERE dups.id != dups.first_id AND users.id = dups.id; |
The order by is optional, but handy if you need to select a particular row rather than just an arbitrary one. You need an extra sub-query because you can’t have window functions in a where clause.
For more tasty PostgreSQL tricks, check out my Meet PostgreSQL screencast, a steal at only $12 plug plug plug.
PostgreSQL 9 and ruby full text search tricks
I have just released an introduction to PostgreSQL screencast, published through PeepCode. It is over an hour long and covers a large number of juicy topics:
- Setup full text search
- Optimize search with triggers and indexes
- Use Postgres with Ruby on Rails 3
- Optimize indexes by including only the rows that you need
- Use database standards for more reliable queries
- Write powerful reports in only a few lines of code
- Convert an existing MySQL application to use Postgres
It’s a steal at only $12. You can buy it over at PeepCode.
In it, I introduce full text search in postgres, and use a trigger to keep a search vector up to date. I’m not going to cover that here, but the point I get to is:
1 2 3 4 |
CREATE TRIGGER posts_search_vector_refresh BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(search_vector, 'pg_catalog.english', body, title); |
That is good for simple models, but what if you want to index child models as well? For instance, we want to include comment authors in the search index. I rolled up my sleeves an came up with this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CREATE OR REPLACE FUNCTION search_trigger() RETURNS trigger AS $$ DECLARE search TEXT; child_search TEXT; begin SELECT string_agg(author_name, ' ') INTO child_search FROM comments WHERE post_id = new.id; search := ''; search := search || ' ' || coalesce(new.title); search := search || ' ' || coalesce(new.body); search := search || ' ' child_search; new.search_index := to_tsvector(search); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER posts_search_vector_refresh BEFORE INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE search_trigger(); |
Getting a bit ugly eh. It might be nice to move that logic back into ruby land, but we have the problem that we need to call a database function to convert our search document into the correct data-type. In this case, a quick work around is to store a search_document in a text field on the model, then use a trigger to only index that field into our search_vector field. The search_document field can then easily be set from your ORM.
Of course, any self-respecting rubyist should hide all this complexity behind a neat interface. I have come up with one using DataMapper that automatically adds the required triggers and indexes via auto-migrations. You use it thusly:
1 2 3 4 5 6 7 8 9 10 |
class Post include DataMapper::Resource include Searchable property :id, Serial property :title, String property :body, Text searchable :title, :body # Provides Post.search('keyword') end |
You can find the Searchable module code over on github. In it you can also find a fugly proof-of-concept for a DSL that generates the above SQL for indexing child models using DataMapper’s rich property model. It worked, but I’m not using it in any production code so I can hardly recommend it. Maybe you want to have a play though.