If you're looking for a good "real world" starting point for this type of project, I would recommend language classification. It is basically impossible to reliably determine the spoken language of a webpage (French, Russian, etc) from the HTTP headers as almost everything falsely reports itself as English.
To reliably approach the issue you need a substantial corpus of documents of the individual languages that you could compare (like both Google and Technorati have).
From there you could tie this into an API that would accept a URL and return what language the page was written in.
If you're looking for a good "real world" starting point for this type of project, I would recommend language classification. It is basically impossible to reliably determine the spoken language of a webpage (French, Russian, etc) from the HTTP headers as almost everything falsely reports itself as English.
To reliably approach the issue you need a substantial corpus of documents of the individual languages that you could compare (like both Google and Technorati have).
From there you could tie this into an API that would accept a URL and return what language the page was written in.