• Login or register

JustHackIt

Hack With New People. Submit Ideas or Your Bio.

  • Popular
  • Recent
  • Submit
  • 1 point by michaelbuckbee 1 year ago on web page / site classification 0 children

    If you're looking for a good "real world" starting point for this type of project, I would recommend language classification. It is basically impossible to reliably determine the spoken language of a webpage (French, Russian, etc) from the HTTP headers as almost everything falsely reports itself as English.

    To reliably approach the issue you need a substantial corpus of documents of the individual languages that you could compare (like both Google and Technorati have).

    From there you could tie this into an API that would accept a URL and return what language the page was written in.

    • parent
    • reply
  • Widget
  • Recent Comments
  • Leaders
Powered by