-
I'm looking to create a website classification tool which will take web pages (downloaded from the internet) and try to classify them based on learning patterns or some other method not determined as yet. I think there are Bayes algorithms, but wanting to explore this with someone to see if we can create something generic that can be plugged into other tools. If interested, let me know
Done this, using a neural network - trained it on existing classified url blacklists. Seems to work well - what to do with it?
Sounds interesting :) I was looking to create something from scratch but would love to talk more and see how far you've got and the tech/language used - One example would be for the classification of news or blog postings or the ability to find 'similar'. Not sure how much processing this would require, though I'm looking at AWS (already done some stuff with it) and distributed code across their instances for the processing.
Check out OpenCalais from Thompson-Reuters; open web content analysis and classification... RDF, semantics, etc. I have begun to use it in some of my Drupal projects.
pretty interesting work!
--
matt j. sorenson
If you're looking for a good "real world" starting point for this type of project, I would recommend language classification. It is basically impossible to reliably determine the spoken language of a webpage (French, Russian, etc) from the HTTP headers as almost everything falsely reports itself as English.
To reliably approach the issue you need a substantial corpus of documents of the individual languages that you could compare (like both Google and Technorati have).
From there you could tie this into an API that would accept a URL and return what language the page was written in.
If you are thinking of doing text classification, be sure to check out CRM 114 or the Lua port of it. It is quite good at this sort of thing.
Also, there is a good perl library, bow.
Thanks for this. This is exactly what I was looking at, so good to hear its one I should definitely look at further
I'm thinking of two different approaches:
- one should be easy to implement and only involves the use of the Semantic Hacker API (http://www.semantichacker.com/) ;
- the other could use LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis). There are already a few open source implementations of this algotithm.
What kind of classification were you thinking? Sounds like an interesting toy project.
Someone mentioned CRM 114 which I had been looking at and using this to classify based on category - for example, sports news, business, tech - that kind of thing
@chrisdew: show your work? I'd love to see more about this.
I'll post some code, when I get home tomorrow evening, UK time.
Chris, I'm in the UK also if you want to chat any time
I've started to put some code up on http://www.finalcog.com - apologies for not organising pretty printing yet. If you want to look at it, you'll need to look at the html source - or wait for me to get formatted code displaying nicely in drupal.
Hi Josh. The idea, right now, is embryonic. My thoughts are that there are a lot of applications for something like this and I think a product could be created that others could tap into for their own needs. I've already done some research on it - will try to get the URLs and post them on here later but I'm looking for people to collaborate with to create something. Oh, Python is my preferred language also - thought I'd mention it :)
Dude, check out:
http://justhackit.slinkset.com/links/Providing_a_place_for_computers_to_track_meaning_and_reason_?c=0
It's Python coded and we already do some of what you mention. Wanna collaborate?
email me! :)
luke at thoughttrail dot com