How to remove noise?
Show older comments
I have uploaded images:


Please tell how to remove the extra white regions, other than the text. I need only the text. Please help.
Accepted Answer
More Answers (3)
Walter Roberson
on 18 Apr 2012
0 votes
Everything in your images appears to be text. English in one patch, something Arabic-like in another patch, and some alphabet I do not recognize in a third patch.
Everything is text is some alphabet.
7 Comments
kash
on 18 Apr 2012
kash
on 18 Apr 2012
Walter Roberson
on 19 Apr 2012
All three areas in your picture are text, so none of them can be removed.
Wasps read the brick walls of my house every year and find the "enter here" signs easily. I don't know the alphabet the wasps are using, but then I don't know the Arabic-like alphabet in your pictures either.
kash
on 19 Apr 2012
Walter Roberson
on 19 Apr 2012
No, I do not know how to find word boundaries for the Arabic-like string, or for wasp-ish, or for Dalek.
kash
on 20 Apr 2012
Geoff
on 19 Apr 2012
As Walter says, everything is text. Are you planning to use character recognition on this?
In your case, I'm going to assume you want to do the filtering only on the supplied images.
I would try something like this:
1. Detect blobs of connected pixels.
2. Go through your blob list and discard any that don't meet some required criteria (width, height, width/height ratio, number of pixels).
3. You are left with a list of all blobs that meet your 'goodness' criteria, and you construct an image using the pixels of those blobs.
An easy way to detect blobs is by the Union-Find algorithm: http://en.wikipedia.org/wiki/Disjoint-set_data_structure
There might be a nicer description somewhere else with pretty diagrams... Use Google.
You move through your image and do a Union on the current pixel with:
* the pixel immediately right (x+1, y)
* the pixel immediately down (x, y+1)
* the pixel immediately down and right (x+1,y+1)
* the pixel immediately down and left (x-1, y+1)
I haven't done this for years, and my brain doesn't feel like digging it up right now. But that's somewhere to start.
6 Comments
kash
on 19 Apr 2012
Geoff
on 19 Apr 2012
What line? Which image? You gave us two images. You mean the real messy one? You mean the lines above and below "Hydraulics Laboratory"?
Image Analyst
on 20 Apr 2012
Yeah, it's possible (for this image) that color segmentation might produce a better binary image than whatever kash did. But I foresee a great deal of trouble with the mix of languages, like you said. But I'm pretty sure all the OCR apps in the File Exchange are fairly unsophisticated, and not robust, or dependent on characters that match very specific templates, or have other limitations. Robust OCR is not an easy task. I did write some simple code to get the letters based on the (admittedly poor) binary image he supplied here, and it's in my Answer in this thread.
kash
on 20 Apr 2012
Geoff
on 20 Apr 2012
Just to be clear here, are you intending to use this method on other images? Any useful method that you can apply to this image may well not work at all on other images. If you just want to do this image, is there a reason you need to "detect" the lines of text? Cos I know what I'd do... Click, click, click, click, click, click, click, click. Eight user-selected co-ordinates = two bounding polygons. Anyway, this thread seems to have gone beyond 'help with MatLab', and into 'please write my algorithm'-territory.
kash
on 20 Apr 2012
Image Analyst
on 19 Apr 2012
0 votes
Well I didn't see any white text whatsoever. Only some black text interior to some white surround. (You could make it white though with inverting, border-clearing, hole-filling, and some other tricks though.) So my output image would be all zero. I think this task, doing OCR on text with very strange fonts, graphics, noise and all kinds of other garbage in the image is beyond kash's abilities or even mine. There are whole OCR companies with dozens of people who have worked on this for decades and even their accuracy would be low for these kinds of images.
4 Comments
Walter Roberson
on 20 Apr 2012
I think kash's original image would be the easiest one to work with, http://imgur.com/77Dxh .
Isolate the blue box, convert blue to black, do the ocr on the white.
Geoff
on 20 Apr 2012
Oh! I never saw that image. I concur... Do the hard stuff BEFORE filtering out all the useful data from your image.
kash
on 20 Apr 2012
Image Analyst
on 20 Apr 2012
One way to do it is in my code that has the comment "% Crop to bounding box."
Categories
Find more on Convert Image Type in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!