We use the PostgreSQL database extensively to handle a fairly large amount of data. Our largest single database is over 25G in size, with a fair amount of transactions going through it daily. As such, we've had to do a lot of optimization over time. One of our experiments was turning fsync off on one of our non-critical databases. In retrospect, this probably was not that great an idea ...
This database was a non-critical but fairly write intensive database. It logged a lot of information, largely in the form of inserts. Inserts in postgres can be a bit slow sometimes since a insertions tends to lock the same section of the index until the insert is complete, forcing all inserts to go in sequence. Updates are usually a lot faster if you're updating different rows since they don't all rely on the same section of the index and can often be done simultaneously.
The fsync option slows this down even further, since postgres then waits for the data to be flushed to disk successfully before continuing on with the next operation. Not a problem for low traffic databases but if you attempt to insert hundreds of transactions a second, the milliseconds spent waiting for the disk to write the data completely really hurts. fsync ensures data integrity but at the price of speed, especially in the case of unexpected power failure.
Since this was a non-critical database and losing data wasn't really a problem (we could either recreate it or live without it), we turned fsync off on this database. All went well for months, until we actually did suffer a power failure. During the busiest period possible. Good old Murphy.
At any rate, once we brought everything back up, things seemed to work as usual ... for about 30 minutes. Then we realized our servers were frequently losing connection to this particular database. Investigations revealed that the postgres processes were terminating themselves with messages like "Error: out of memory" or complaints about data inconsistency. Yep, we got our first corrupted Postgres database. The first one I've encountered in over 7 years of using this database.
I have to admit, I had very little clue on how to recover a corrupted database and each database was corrupted slightly differently. Initially it appeared only the indexes were damaged and a reindex removed most of the problems. Later we found that there was some damage to the tables themselves (took a long time to find that) and we attempted to restore through a backup. The Write Ahead Log (WAL) backup proved to be useless. Those were corrupted or inconsistent. Strangely enough, the database could still do a pg_dump, so we just dumped out all the data and reloaded it back in the database. This ultimately fixed everything.
Morale of story - don't turn fsync off unless you really know what you're doing, including how to detect database corruption and fix it. Our biggest problem was that postgres, unlike MySQL, does not scream "Table/database corruption!" immediately. It took us a while to determine what the problem was. Then again, unless you turn off fsync, it is probably something that almost never happens on postgres. I've had tons of corrupted MySQL databases. This is my first corrupted postgres database.
No comments:
Post a Comment