Dapper: Web Scraping gets a 2.0
I just took a look at Dapper - the demo shows what’s basically Web Scraping 2.0. By looking at a few pages from a site, Dapper can analyze the page structure and let the user point and click fields, instead of the old school way of viewing the source and finding patterns to match against.
I don’t imagine this is in their business plan, but I’d love to see this built into an API for various languages. Currently Dapper can generate XML, JSON or YAML (among other output formats), but you’re still reliant on their server, which may not be appropriate for internal apps within a company.
It’d also be interesting to see how adaptable an algorithm could be against changes to the markup. I remember a company that offered “view your other accounts” features for banking websites, and they basically had to hand-code scraping algorithms for a huge array of bank and investment accounts that kept breaking as companies changed their layouts. I think at one point they had someone working pretty much full time checking and re-checking sites. Just like with example based machine translation, could automated markup analysis help with site changes?