Scraping Images for your Image Search Engine
The other day I was shopping online to buy a new mobile phone. Looking at multiple sites, I found that the one thing I kept referring to, was the price (of course!). But there was another aspect that I kept searching for, and that was an image of the phone I wanted. I later realized that wherever the description didn’t match the image, the trust factor was very low for me to go ahead with that seller. And the site where I could find a high-resolution images that I could zoom in and look at from multiple angles, it was the site that I stayed the longest on. If your shopping or browsing behaviour too places prominence to images, then welcome to the world of image search.
In fact, this trend is so dominating on the online ecosystem that Google, the search engine behemoth, has in place an image search too, in addition to the regular text query search. Don’t believe us? Then try dragging one of the images that you get through your regular search query into the search string to see what I mean.
See the image to the left of the text search box? That is the image that I asked Google to search, and the results were pretty accurate (that is the Asus ZenFone 3 – one of the many phones I was researching to buy).
Image Search Engines
This new form of content retrieval is made possible with the help of an image search engine. You need not depend only on text query to find information. You can also look up similar images based on the source image you provide to the search engine. This is the exact USP of an image search engine. It is defined as a search engine designed to find information based on the input of an image with a visual display of the images. The technique is mostly used by e-commerce buyers and sellers and to look up more info on the image of an unknown object or to gain crucial information on how the competitors are positioning a given product.
You might be wondering what cool algorithm or machine learning runs in the background to allow the search engine to return only the relevant and matching images. Well, most of the times it is simple; the image searches for the name and it is this name that gets collected and displayed as a search result if it matches the query image for importance. This old fashioned method is the basic way of scraping images. When doing the web scraping, the tool will check if the filename has full or a part of its filename containing the search query and will return that image.
Most developers, designers, and digital marketers follow the convention of renaming the original filename (something like IMG_10092015.jpg) to something meaningful and of consequence (something like Earl_Grey_Teabag_1332.jpg). This is to adhere to the Google algorithm mandate of providing a sensible name to an image file as one of the keys to improving the ranking signals. And this is what the image search engine will look for to provide accurate search results.
Of course, this is just one of the ways to find images using an image search engine. The two key ways in which information is searched online is –
- Metadata Search – As outlined in the above section, the image search is executed by looking up the metadata of the image. This metadata can include one or more of the keywords, caption, alt+text, or image name.
- Content based retrieval – Under this type of search, the various characteristics of the source image is used and run through computer programs and specialized software to return relevant results. Instead of the metadata, this type of search uses the content of the image for searching. This type of information search has many underlying techniques as below –
- Query approach – The user provides a source image, the program will look into characteristics like shape, color, and size.
- Semantic retrieval – The user will describe the query to find an image. This is a lesser used option because of obvious difficulties in matching image with the description given in the search query.
- Machine learning – Image search using machine learning can be boosted with the help of neural networks and deep learning.
- Third party applications – Some interesting work is happening around enhancing the image accuracy when delivering search results for an image query. A case in point is the 2006 acquisition of Neven Vision by Google.
Image scraping helps in obtaining data and image from varied sources and then migrating its metadata and image in a structured manner. Some of the common export channels include Excel, backend databases, CSV, or XML. Scraping the web for images helps multiple beneficiaries, including web developers, designers, content managers, journalists, marketing executives, or bloggers.
When using a spider to scrape images, the program will look for four key things
- Title of the page
- Publishing date
- The actual image
- The URL of the site
Interested to know what happens next? Then read on.
Analysis of the image search
Once the program has scraped an image and looked at the metadata and associated content with the image, most of the work is done. However, there still remains the important pointer of verifying the content of the image file. So suppose if you find for Superman, you will get various combinations –
- Superman in comics
- Superman in movies
- Christopher Reeves as Superman
- Henry Cavill as Superman
- Superman in movie posters
- Superman and fans
…and so on
This is the classification stage of the image search processing. The engine will throw out basic questions –
- Does the image have a face?
- Is it the front profile?
- What is the background color present?
- What is the foreground color present and what is its frequency/intensity?
- Is it a free or licensed image?
- What is the file size?
- What is the image resolution?
Some image search engines like Google go one step further and allow users to upload their own image to find.
There are various criteria to determine the degree of success and accuracy of the result shown by the image search engine. If there are any of the below, then the chances of returning accurate results go down significantly:
- Too much noise in the background
- Too many colors in either the foreground or background
- Too little detailing, or
- Lower resolution of the input image
Now we look at another method of classification i.e. clustering. This tries to put together all images with similar content in one group. So carrying forward the above example, clustering will put together all these combinations of Superman and even include related items like Superman vs. Batman or Superman cartoons. Again, this will provide accurate results only if the noise in the image is less, and resolution is high.
Scraping the images
Getting hold of a large number of images is crucial for building an image search engine. Acquiring huge amounts of data requires a scalable web scraping solution. Web scraping is the most convenient way of acquiring data from the web be it structured data, URLs or images. It is better to rely on a web scraping service provider for scraping images for your image search engine.
Before signing off
As is evident, the value provided by an image search engine goes far beyond accuracy. It helps shoppers to make an informed purchase decision and make the most of their web user experience. For e-commerce owners, it helps them gather crucial intelligence on product assortment at the rivals’ stores and keeps them up to date about the various data around a specific product. So if most of the store owners have the iPhone 6s retailing around $825 range, you would know that your store too would have to match this price in order to aid in the web traffic conversion at your e-commerce portal. This way image search also helps in pricing intelligence.
Planning to acquire data from the web? We’re here to help. Let us know about your requirements.