Monday, May 23, 2011

Lanapsoft BotDetect CAPTCHA OCR

I have been working on OCR'ing a new captcha.

At first glance this one appears much more difficult than the one I solved earlier (from d2jsp.org) but after a few lines of pre-processing the image is cleaned up very nicely.





 

   From this pre-processing we are left with a single artifact to deal with, that centered vertical line. Since this artifact was in the same place every time, I simply just trained the OCR engine to recognize the letters with the line through them. I used a data set of 333 images to do the initial training. 

I ran another data set of 333 new images through to see what kind of recognition rates I was getting. I needed to train 28 of the 333 images, or 8.408% of them.

On the third and final data set I ran an addition 333 new CAPTCHAs through. I had to train 19 of these, giving me a 5.706% training rate for the third data set.

More to come later... 

PS: I'm new to this whole blogging thing, does anyone know if it is possible to use LaTeX in these posts? A plugin of some sort? 

Captcha OCR #1

I have decided to start building a library of captchas that I have successfully written OCR algorithms for. I am planning to use this blog to help me document my steps along the way and keep a sort of running tally of my progress.

The first captcha I decided to do was one from d2jsp.org. It was definitely one of the easier captchas, as it had a constant font, no permutations, no artifacts, and no color differentiation.




My testing has given me 100% accuracy in reading these captchas so far. These captchas required no pre-processing and were just run, as is, through my OCR engine.