Screen Scraper Tricks: Extracting Data from Difficult Websites

Presented at DEF CON 17 (2009), Aug. 2, 2009, 2 p.m. (50 minutes)

Screen scrapers and data mining bots often encounter problems when extracting data from modern websites. Obstacles like AJAX discourage many bot writers from completing screen scraping projects. The good news is that you can overcome most challenges if you learn a few tricks. This session describes the (sometimes mind numbing) roadblocks that can come between you and your ability to apply a screen scraper to a website. You'll discover simple techniques for extracting data from websites that freely employ DHTML, AJAX, complex cookie management as well as other techniques. Additionally, you will also learn how "agencies" create large scale CAPTCHA solutions. All the tools discussed in this talk are available for free, offer complete customization and run on multiple platforms.

Presenters:

  • Michael Schrenk
    Michael Schrenk is a webbot developer and the author of "Webbots, Spiders, and Screen Scrapers" (2007, No Starch Press). He has also written for ComputerWorld, php|architect and Web Techniques magazines. Mike also gave presentations at DEF CON X, XI and XV. He works for a wide range of clients across North American as well as in Russia, Spain and The Netherlands. Stop by www.schrenk.com and say hello.

Links: