• Login or register

JustHackIt

Hack With New People. Submit Ideas or Your Bio.

  • Popular
  • Recent
  • Submit
  • web page / site classification

    I'm looking to create a website classification tool which will take web pages (downloaded from the internet) and try to classify them based on learning patterns or some other method not determined as yet. I think there are Bayes algorithms, but wanting to explore this with someone to see if we can create something generic that can be plugged into other tools. If interested, let me know

    3 points by inovade 1 year ago
    • 15 comments
  • 2 points by chrisdew 1 year ago 1 child

    Done this, using a neural network - trained it on existing classified url blacklists. Seems to work well - what to do with it?

    • link
    • reply
    • 1 point by inovade 1 year ago 0 children

      Sounds interesting :) I was looking to create something from scratch but would love to talk more and see how far you've got and the tech/language used - One example would be for the classification of news or blog postings or the ability to find 'similar'. Not sure how much processing this would require, though I'm looking at AWS (already done some stuff with it) and distributed code across their instances for the processing.

      • link
      • reply
  • 1 point by emjayess 1 year ago 0 children

    Check out OpenCalais from Thompson-Reuters; open web content analysis and classification... RDF, semantics, etc. I have begun to use it in some of my Drupal projects.

    pretty interesting work!

    --
    matt j. sorenson

    • link
    • reply
  • 1 point by michaelbuckbee 1 year ago 0 children

    If you're looking for a good "real world" starting point for this type of project, I would recommend language classification. It is basically impossible to reliably determine the spoken language of a webpage (French, Russian, etc) from the HTTP headers as almost everything falsely reports itself as English.

    To reliably approach the issue you need a substantial corpus of documents of the individual languages that you could compare (like both Google and Technorati have).

    From there you could tie this into an API that would accept a URL and return what language the page was written in.

    • link
    • reply
  • 1 point by antiismist 1 year ago 1 child

    If you are thinking of doing text classification, be sure to check out CRM 114 or the Lua port of it. It is quite good at this sort of thing.

    Also, there is a good perl library, bow.

    • link
    • reply
    • 1 point by inovade 1 year ago 0 children

      Thanks for this. This is exactly what I was looking at, so good to hear its one I should definitely look at further

      • link
      • reply
  • 1 point by mesca 1 year ago 0 children

    I'm thinking of two different approaches:
    - one should be easy to implement and only involves the use of the Semantic Hacker API (http://www.semantichacker.com/) ;
    - the other could use LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis). There are already a few open source implementations of this algotithm.

    • link
    • reply
  • 1 point by kieranbenton 1 year ago 1 child

    What kind of classification were you thinking? Sounds like an interesting toy project.

    • link
    • reply
    • 1 point by inovade 1 year ago 0 children

      Someone mentioned CRM 114 which I had been looking at and using this to classify based on category - for example, sports news, business, tech - that kind of thing

      • link
      • reply
  • 1 point by Josh the Jenius 1 year ago 5 children

    @chrisdew: show your work? I'd love to see more about this.

    • link
    • reply
    • 1 point by chrisdew 1 year ago 2 children

      I'll post some code, when I get home tomorrow evening, UK time.

      • link
      • reply
      • 1 point by inovade 1 year ago 1 child

        Chris, I'm in the UK also if you want to chat any time

        • link
        • reply
        • 1 point by chrisdew 1 year ago 0 children

          I've started to put some code up on http://www.finalcog.com - apologies for not organising pretty printing yet. If you want to look at it, you'll need to look at the html source - or wait for me to get formatted code displaying nicely in drupal.

          • link
          • reply
    • 1 point by inovade 1 year ago 1 child

      Hi Josh. The idea, right now, is embryonic. My thoughts are that there are a lot of applications for something like this and I think a product could be created that others could tap into for their own needs. I've already done some research on it - will try to get the URLs and post them on here later but I'm looking for people to collaborate with to create something. Oh, Python is my preferred language also - thought I'd mention it :)

      • link
      • reply
      • 1 point by LukeStanley 1 year ago 0 children

        Dude, check out:

        http://justhackit.slinkset.com/links/Providing_a_place_for_computers_to_track_meaning_and_reason_?c=0

        It's Python coded and we already do some of what you mention. Wanna collaborate?

        email me! :)

        luke at thoughttrail dot com

        • link
        • reply
  • Widget
  • Recent Comments
  • Leaders
Powered by