BookWorm - the BongOCR

Thursday, September 26, 2013

Datasets : for the love of OCR

I am back to post some information about the data I used.
There is a lot of training data available 'out there'. Initially, I looked into the FIRE dataset from RISOT, thanks to +Abhishek Gupta , and it might be good for testing purposes. But it is difficult to be used for training, as the associated font information is not available.
However, here is a link to some data I generated/used.
Also, for testing, most of images I used were obtained by a simple Google Image search, with relevant keywords, like "bangla newspaper" or "bangla poster".
Since Google uses bubble filters, I suggest everyone to use the generic Google site to get over the country defaults, and log out of their Google accounts or use an incognito/private window. This should ensure some consistency in the search results over all platforms, geological and temporal differences, as well as user preferences. However, it is not of the utmost importance, and just a suggestion.

Friday, September 20, 2013

Self Evaluation

During the last few months, I have seen and learnt a lot. I read about software and related state of the art for my project. I exchanged emails with some pretty awesome people working in the field, some still continuing with their passion and going strong, while some caught up in other phases of life. They inspired me with their work and helped me with guidance.
I made use of knowledge acquired in the past, and gained more new knowledge that will be useful in the future. I realised how hard it could be to efficiently manage time, specially juggling between matters of health where you are helpless, and work that you desperately want to see to completion.

My project goals transformed over the course, but I believe I have a deeper understanding of the scenario now. I tried several methods/libraries, tested out my hypotheses that were sometimes right, and sometimes very wrong, and most importantly, I got to work on a project that I had designed myself and was close to my heart.
One major benefit of this project is, now I understand more about managing my projects on a bigger scale and in a better way. And I can already feel the difference it has made. Overall it has been a great life experience.

Thursday, September 19, 2013

Final report : Part 3

The filters I tested out generated some results that help in preparing a document for OCR.
In a tabular format, the results look like :

	Median Filter	Gaussian Filters	Greyscale Conversion	Bilateral Filters	Adaptive Thresholding	Resolution Changes	WhiteWashing Algo

Color Images	G	R	G	G	G	N,A	Works as intended
NewsPaper	R	G	G	G	R, S	N,A	Works as intended
Books	R	G	G	G	R, S	N,A	Works as intended
Posters	G	N, A	G	G	G, S	N,A	Works as intended

Legend :
R : Recommended
G : Good Results, Useful to have
N : Neutral
A : Adverse Effects
S : improved results with Whitewashing

Explanations :

All tests were performed on random image samples found via a basic image search on Google.
NOTE : Tests mentioned in Column N should be performed before tests in column N+1.

Median Filters:
These are useful for Color Images, and seem to improve the output in general for all documents.
In case of samples from newspapers and books, they helped greatly in removing noise.

Gaussian Smoothing:
It helped in smoothing the images before greyscale conversion, but in case of Posters, the smoothing often distorted the text as well.

Greyscale Conversion :
It is always better to do a greyscale conversion as there is no downside to it.
Often, using a filter to "skeletonize" the input resulted in improved output. But this was also affected by the resolution, and it is suggested not to use skeletonize for input with very low resolution.

Bilateral Filters and Adaptive Thresholding :
These tests were started before the mid-term, and are explained in details here and here, respectively.
However, I continued checking the combinations, and they were found to be mostly useful.

I will write one last post about me, myself, life, the universe and everything that happened this summer. So Long.

Tuesday, September 17, 2013

Final Report : Part 2

At the beginning of the project, I spent some time finalising the tools I would be using.
Having done so, I started with the pre-processor, and finally, by the mid-term, a working prototype of the Whitewashing code was completed.

During this period, I also realised how difficult it was to generate good data for training/testing purposes. They were awful in the begining, when I got the results for whole words instead of characters. But gradually the quality improved. During these trials, I tried out various methods for generating the data and eventually the box files, some of them are :

Ari's trainer helped a lot in learning about the training procedure
OCR chopper was easy to use to make box files, but not always accurate. It needed lots of manual editing.
BoxMaker was similar to OCR chopper, but more flexible in terms of size
I also tried with some data from Parichit
Open source icr , a project related to Ari's project mentioned above, was somewhat helpful.
Some amazing work at Silpa inspired and motivated me, but i ended up not using them because I felt they were not very easy to incorporate in the project.
Debayan's Tesseract Indic project has been a great help and provided me with much required guidance to get started.

After having tried all the methods, I took the decision to use Cairo-Pango in combination, and though had some problem initially, finally it worked out.

Further reading made it clear that I should always jumble up and mix the characters in a training set. I took it one step further, and decided to mix up even the sizes of the different images. And for that, I wrote another script. This helps in better training.
For the final leg of this, I was trying to make a python script, but it was not working. In the end, I switched to using ImageMagick as it does the trick with a single command. There was no point in making things more complicated than necessary.

Along with this, I continued my work on testing various filters on different types of documents. I will publish it in the next post.

Monday, September 16, 2013

Final Report : Part 1

Finally, The Summer has come to an end, and it's time to look back on the journey.
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.

My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.

Set Goals :

Pre-processor : shirorekha chopper
Post-processor : CBR-based
Output matrix : performance based comparison

Achieved Goals :

Pre-processor : Whitewashing code , a modified shirorekha chopper that paints over the shirorekha instead of chopping only at the gaps.
Data generation : It helps in generating data that may be used for testing, as well as used to make the box-files to train the system.
Output table : Several filters and their effect on different type of documents was tested, in hope of providing a guideline to better pre-process any document that needs to be OCR'ed.

Key extra takeaways :

Learnt a lot about OCR software, specifically about Tesseract.
Learnt about Pango-Cairo.
Used IPython and Notebooks with EPD. I am definitely going to use them a lot now.
Practised modular development as a single developer on this big a project for the first time.

In the next few posts, I will explain my reasoning for the changes I made to the plan along the way.

So Long.

Sunday, September 15, 2013

code clean up

It was bugging me to put up the whole code in a single file, specially after having used a very nice, modular structure in the notebooks. So I spent the day scrubbing, and finally I have split them up in manageable parts, and pushed everything to the repository.

Also included with the code is instructions on how to use them, and my personal views and suggestion at some points, in the form of README files.

In the future, I would like to continue and maybe develop a GUI to make it easier to use.

However, it's time to stop coding, and start documenting whatever is left.

Saturday, September 14, 2013

porting to py completed

It has been a long, but productive day. After a compulsory 40-hour week at school, I have finally completed porting all the code from *.ipynb to *.py and collecting all code from Windows and Fedora to one place. However, I still prefer to use the notebook format, and would recommend everyone to use the notebooks if possible.
Tomorrow, I will improve format of the final matrix, and publish it. So Long.