News Aggregation is about compiling news articles from different websites and forums together in a single database. While this has been happening for quite some time now, News Aggregators have started using different strategies like showing related news when you are viewing one, or customizing your news feed based on your past usage. But the core of modern news aggregator is web scraping and that is what we will be discussing today.
What is News Aggregation?
Most News Aggregators follow the following steps in order to get their content to the masses:
a. They gather data by crawling popular news websites. They also search for news in search engines to find important news that are covered by regional or smaller news outlets. All these information is sorted and arranged along with links
b. A small intro for every featured article is extracted from raw data. This is used as a preview, clicking on which a user will be sent to the actual website. Generally this ends up being a first paragraph. It can even be just the heading and a single line, in cases where the news is a single video clip, or something which lacks an introductory paragraph or textual data altogether
c. Related articles are clustered so as to give a user more ammo, once he starts with a particular article. Often articles are also sorted as per timeline. So suppose you are reading an article about the verdict of the court regarding a land grab issue. Links for all the articles related to the case that came out in the past might also be shown in a sidebar, for you to get the entire picture
d. Often there are more than one article on a single topic, carrying exactly same factual data. In that case, the news aggregator has to decide which article to show because giving multiple links for the same news will not be helpful. What is seen to be a decisive factor in this matter is which article has summarized the entire context better
e. You would often see that the link for a news article is accompanied not only by a small text but also an image or a graph. This visualization is a part of the work of the news aggregator, and might not be taken from the article itself. The visualisation is a simple trick. You see the graph/ photo/ cartoon and become interested in it. Then you read the brief introduction. And eventually you open the link and check out the entire article
How can Web Scraping benefit News Aggregators?
1. Collect news articles efficiently
Businesses have to focus on their main product or offering first, before they go over everything else and make things look good and stuff. For news aggregators, this is the news articles that they collect from the internet. Here scraping web would not only involve getting articles from top websites but also searching for specific keywords in local as well as smaller news media, so that the news aggregators can get more news for local people and at the same time give visibility to smaller players who are actually covering the civic and criminal investigations in certain regions responsibly.
2. Collect links of articles and videos
When you are giving a summary of a news in your news aggregating website, you must provide the link for the article in the original website as well. This link might have been scraped and stored in your database already. These links are important since on finding the summary of an article interesting, a customer might very well want to read the entire news and gain a full understanding of the present situation.
3. Build News Timelines
Often, for a single event, you will be getting more than one news articles from different news sites. If it is a big event or news, it might even happen that the latest developments keep coming in every few days or weeks. It is your responsibility to collect all these news articles, remove repetitions in case of similar articles by keeping the one with the best summary and also building a timeline of events for the entire episode, so that a person can understand how the thing happened, what actually happened, and how the authorities dealt with it, and what was the final outcome. This way, the reader gets access to a historical timeline on a newsworthy story.
4. Web scrape comments and news articles
How do you know which article is better written when you have different versions of it in similar news websites. One option is manual intervention but that can be kept aside for unique situations, since manual intervention is costly and cannot be implemented at scale. So one could build an intelligent scraping mechanism with the help of a web scraping service like PromptCloud, that would be able to detect the number of thumbs up and positive comments on an article and only deliver the ones with the best statistics.
5. Capture trends among people who read news online
Certain online news sites are more popular than the others although theoretically every website actually covers the same news. You can crawl the top news/news aggregator websites to see what is making their sites click. You can also capture customer behavior in their website by going through comments, most viewed articles and more. Systematic checks on your competitors can help you remain in business for longer.
News and Media is a big business and like any other business it needs technology to reduce operational costs and remain viable. Web scraping and intelligent systems can provide this edge to news aggregators.