Sunday 23 September 2007

Language learning on FOSS vs OSX

My primary desktop is OSX and for the most part, it suits my purpose well. 99% of the time, everything just works, there are few hassles and things behave the way you expect them to. The Ubuntu releases are close but not quite there yet for me.

As great a desktop as OSX is though, it only shines in the most commonly used functions. In many niche areas, FOSS can provide superior solutions by virtue of the huge variety of applications, developers and freedom to develop on it. One such area that comes to mind these days is language learning.

I'm barely able to read in Chinese and trying to work my way through a website or document is pretty much impossible without a dictionary. I particularly need a dictionary that can immediately translate words I highlight on screen. On OSX, the options are pretty limited. Most solutions are shareware or paid software. Right now I'm using TranslateIt! which works fairly well. I highlight the word(s) I want translated, hit a keyboard combination and TranslateIt! pops up, hopefully with the translations. It is one extra step though. The paid version for TranslateIt! includes the functionality to get immediate translation after I highlight it.

My colleagues on FOSS desktops don't need to pay for this. StarDict comes with most distributions and does the highlight/translate thing right out of the box. It's invaluable in a mixed Chinese/English environment like Exoweb. Besides the software, the power of FOSS shows up in StarDict's dictionaries, which are varied and extremely useful. So much so that TranslateIt! actually uses StarDict dictionaries and all the translations I am using are StarDict's. Without StarDict, TranslateIt! would actually be useless to me.

My favorite StarDict dictionaries include (all zh_CN -> en):

  • cedict-gb dictionary (has pinyin and tone marks. Must have)
  • langdao-ce-gb (a much larger vocabulary but translations sometimes not precise. No pinyin)
  • Chinese idiom dictionary (a dictionary translating chinese idioms. Unfortunately the translations are chinese->chinese and some colleagues have said its translations are suspect)

Two are under the GPL while the cedict library is under its own license similar to cc-by-nc.

While I'm sticking to TranslateIt! until such time StarDict works natively under OSX, I would simply be unable to read anything at all in Chinese without StarDict's dictionaries.

We Make It Up As We Go

Every now and then I find myself repeating the same thing in weekly chats, so I try to note them down in a blog post so the Exowebers that actually read the Exoweb Planet have a chance to see it. This particular one has to do with how we developed our work processes and best practices. There is no mystical method, no profound MBA insights or deep pondering. Quite simply, we make it up as we go.

That's not to say all our current processes are arbitrary. Every process or practice we have, we evolved to overcome a perceived problem. We experiment with different processes and those that work, we kept. Those that failed we learned from and moved on. The entire organization is a work in process, continuously trying to improve itself.

There are a few core values we do have, which very much reflect those of agile methodology: People over process, working code over documentation, etc. We also believe very much in making this a great place to work. Since you spend a huge portion of your waking hours at work, why wouldn't you want to make it as pleasant, as fun, as challenging a place as possible?

From these core values we simply figure things out as we go. For example, our current process of code reviews was triggered by the realization that our code quality wasn't good enough, that too much bad code and bugs were leaking into production systems. We tried the NASA style group code reviews but found that much too heavy. We then tried having a couple of core code reviewers doing all the work, but found that it did not scale and that the benefits of code review actually were disproportionately accumulating with the core reviewers. Our current method is a lightweight team review process that seems to combine the benefits of code review while reducing the cost. We are likely to make more tweaks and changes in the future, depending on future needs and ideas.

What does this mean for Exowebers? Most important of all, it means that it is the past ideas of all of us that have created the great environment we have. It means it is your future ideas that will make us an even better place to work. You need to pay attention to your environment, be willing to question our processes and methodologies, and contribute new ideas when they come to mind. No process is sacred. Given good enough reason, visible enough benefits, anything can be changed.

You can contribute ideas openly, by floating RFCs by email, or you can quietly suggest them to your team leaders in weekly chats. Whatever method you choose, it is important that the ideas are communicated and considered. Only then can we as a company improve. Only then can we make this place an even better place to work.

Yes, this includes higher salaries too. If you have ideas on how to make us more profitable, we can all share in the profits in the form of higher salaries :)

What Happens When You Turn Fsync Off On Postgres

We use the PostgreSQL database extensively to handle a fairly large amount of data. Our largest single database is over 25G in size, with a fair amount of transactions going through it daily. As such, we've had to do a lot of optimization over time. One of our experiments was turning fsync off on one of our non-critical databases. In retrospect, this probably was not that great an idea ...

This database was a non-critical but fairly write intensive database. It logged a lot of information, largely in the form of inserts. Inserts in postgres can be a bit slow sometimes since a insertions tends to lock the same section of the index until the insert is complete, forcing all inserts to go in sequence. Updates are usually a lot faster if you're updating different rows since they don't all rely on the same section of the index and can often be done simultaneously.

The fsync option slows this down even further, since postgres then waits for the data to be flushed to disk successfully before continuing on with the next operation. Not a problem for low traffic databases but if you attempt to insert hundreds of transactions a second, the milliseconds spent waiting for the disk to write the data completely really hurts. fsync ensures data integrity but at the price of speed, especially in the case of unexpected power failure.

Since this was a non-critical database and losing data wasn't really a problem (we could either recreate it or live without it), we turned fsync off on this database. All went well for months, until we actually did suffer a power failure. During the busiest period possible. Good old Murphy.

At any rate, once we brought everything back up, things seemed to work as usual ... for about 30 minutes. Then we realized our servers were frequently losing connection to this particular database. Investigations revealed that the postgres processes were terminating themselves with messages like "Error: out of memory" or complaints about data inconsistency. Yep, we got our first corrupted Postgres database. The first one I've encountered in over 7 years of using this database.

I have to admit, I had very little clue on how to recover a corrupted database and each database was corrupted slightly differently. Initially it appeared only the indexes were damaged and a reindex removed most of the problems. Later we found that there was some damage to the tables themselves (took a long time to find that) and we attempted to restore through a backup. The Write Ahead Log (WAL) backup proved to be useless. Those were corrupted or inconsistent. Strangely enough, the database could still do a pg_dump, so we just dumped out all the data and reloaded it back in the database. This ultimately fixed everything.

Morale of story - don't turn fsync off unless you really know what you're doing, including how to detect database corruption and fix it. Our biggest problem was that postgres, unlike MySQL, does not scream "Table/database corruption!" immediately. It took us a while to determine what the problem was. Then again, unless you turn off fsync, it is probably something that almost never happens on postgres. I've had tons of corrupted MySQL databases. This is my first corrupted postgres database.