In A.I., data is power.
A.I. algorithms and how they are plugged into each other is an art that is becoming known through university courses, online training, and literally by people watching YouTube videos. Artificial Intelligence is open source, and it should be. What you can do to protect your company from competition is build proprietary datasets.
There are plenty of datasets open to the public. For example, Kaggle, and other corporate or academic datasets, and many federal and municipal data sources. We use these for lots of projects, but so can everyone else. What you want is to build something "special" that I don’t have. For example, you can’t beat Google at search because they know what people search for and you don’t. The advantage is the size and depth of their dataset, rather than market share alone.
We often come across requirements to build an artificial intelligence solution where the client needs an image dataset to move forward. The client often has no dataset of images to start off with. They can’t simply use an off-the-shelf solution or API, because the typical 1000 objects in the off-the-shelf convolutional networks are not as broad as one would hope, and also a classifier that differentiates between 2 classes can be a lot more powerful than one with 1000. There are simply fewer chances to make the wrong prediction when the number of "output classes" (types of things the system can see) is small, and so these specialized models tend to work well.
What I want to walk you through today is one way that we build up these custom image datasets. Let’s talk about the case where there are only 2 classes: infected leaves and healthy leaves. The idea is to use A.I. to distinguish between healthy and sickly leaves in a field somewhere.
To start, we install images-scraper and nodejs, and we limit the images we will scrape to non-https URLs. We make sure the URL has ‘.jpg’ on the end, and that is is well formed in general. What we need next is a set of keywords to scrape, and so we use the following keyword list to start off with:
keywordList = ['healthy', 'thriving', 'growing', 'living', 'beautiful','nourishing','tasty','green']
baseKeyword = 'leaf'
From here we generate combinations of the keyword list with the base keyword. For example:
('healthy leaf', 'thriving leaf', 'growing leaf', 'living leaf', 'beautiful leaf')
Next, we pass each of the combinations into a separate nodejs thread. The scraper collects the scraped images into a base directory with subfolders for each keyword combination. We then run scripts to remove duplicate and empty files.
Here is a full example of how we scrape the images for infected leaves.
keywordList = ['sick', 'damaged', 'infected', 'dying', 'bacteria','virus','sickly','wilting']
baseKeyword = 'leaf'
import lemayScraper as ls
ls.scrapeImages(keywordList, baseKeyword)
At this point we have 2 folders, one containing thousands of images of healthy leaves (and a lot of junk) and the other containing thousands of images of infected leaves (and a lot more junk). The next task is to browse the images by hand, and delete the images that are not relevant to leaves (a baby holding an umbrella), and then go through the images again, and remove images that are the wrong type (a drawing of a leaf, a 3D render of a leaf, etc). Finally, the human operator combs through the images and adds as much effort as they feel they need for a first pass. In later stages we may choose to crop images, or do other image cleanup. At this stage the goal is simply to make sure that bad data does not filter into the training data.


At this point the project to train the artificial intelligence for image recognition can begin. We have seen lots of other requirements for collecting image data, and so the solution for your project may differ. Sometimes we look for images that are similar in terms of perceptual hash, as in the following example:

The images above are all strikingly red. Perhaps they come from a scrape for the words "red" and "stop". We can see how similar they are using a perceptual hash with the following code:
The result is the following:
ffe2c38f8983073d c3991c6e763899c3 ffe3c3c0c0c1c0ec
False
s1-s2 47
s1-s3 20
In the first line of the result we see that each image has a unique hash. The difference between the English stop sign (s1.jpg) and the interdiction sign (s2.jpg) is more than the difference between the English and French stop signs (s1.jpg and s3.jpg). That is pretty cool! It understood that stop signs are similar using math!
And so we are able to post-process the scraped images to collect the stuff that is similar to other stuff into piles (subclasses) without human intervention. We have lots of other scripts that help to clean up the scraper output. Once we see a good classification result (confusion matrix) during classifier training, we stop monkeying around. The origins of this image scraper, and some crazy ideas on how to use it are discussed in this paper.
As far as the licenses you need to use the data from web scraping, and other IP issues, that’s way beyond the scope of this article.
And so, in conclusion, scraping images to form a dataset is totally doable. We are all about fast paced development and showing value early, and scraping images is an essential part of building custom image recognition solutions.
If you liked this article on building image datasets, have a look at some of my most read past articles, like "How to Price an AI Project" and "How to Hire an AI Consultant." In addition to business-related articles, I also have prepared articles on other issues faced by companies looking to adopt deep Machine Learning, like "Machine learning without cloud or APIs."
Happy Coding!
-Daniel [email protected] ← Say hi. Lemay.ai 1(855)LEMAY-AI
Other articles you may enjoy: