Monday, May 23, 2011

Lanapsoft BotDetect CAPTCHA OCR

I have been working on OCR'ing a new captcha.

At first glance this one appears much more difficult than the one I solved earlier (from d2jsp.org) but after a few lines of pre-processing the image is cleaned up very nicely.





 

   From this pre-processing we are left with a single artifact to deal with, that centered vertical line. Since this artifact was in the same place every time, I simply just trained the OCR engine to recognize the letters with the line through them. I used a data set of 333 images to do the initial training. 

I ran another data set of 333 new images through to see what kind of recognition rates I was getting. I needed to train 28 of the 333 images, or 8.408% of them.

On the third and final data set I ran an addition 333 new CAPTCHAs through. I had to train 19 of these, giving me a 5.706% training rate for the third data set.

More to come later... 

PS: I'm new to this whole blogging thing, does anyone know if it is possible to use LaTeX in these posts? A plugin of some sort? 

1 comment: