Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!
News Aggregation is about compiling news articles from different websites and forums together in a single database. While this has been happening for quite some time now, News Aggregators have started using different strategies like showing related news when you are viewing one, or customizing your news feed based on your past usage. But the core of modern news aggregator is web scraping, and that is what we will be discussing today.
Most News Aggregators follow the following steps in order to get their content to the masses:
a. They gather data by crawling popular news websites. They also search for news in search engines to find important news that is covered by regional or smaller news outlets. All this information is sorted and arranged along with links.
b. A small intro for every featured article is extracted from raw data. This is used as a preview, clicking on which a user will be sent to the actual website. Generally, this ends up being the first paragraph. It can even be just the heading and a single line, in cases where the news is a single video clip or something which lacks an introductory paragraph or textual data altogether.
c. Related articles are clustered so as to give a user more ammo once he starts with a particular article. Often articles are also sorted as per timeline. So suppose you are reading an article about the verdict of the court regarding a land grab issue. Links for all the articles related to the case that came out in the past might also be shown in a sidebar for you to get the entire picture.
d. Often there is more than one article on a single topic, carrying exactly the same factual data. In that case, the news aggregator has to decide which article to show because giving multiple links for the same news will not be helpful. What is seen to be a decisive factor in this matter is which article has summarized the entire context better.
e. You would often see that the link for a news article is accompanied not only by a small text but also an image or a graph. This visualization is a part of the work of the news aggregator, and might not be taken from the article itself. The visualisation is a simple trick. You see, the graph/ photo/ cartoon and become interested in it. Then you read the brief introduction. And eventually, you open the link and check out the entire article.
Businesses have to focus on their main product or offering first before they go over everything else and make things look good and stuff. For news aggregators, this is the news articles that they collect from the internet. Here scraping web would not only involve getting articles from top websites but also searching for specific keywords in local as well as smaller news media, so that the news aggregators can get more news for local people and at the same time give visibility to smaller players who are actually covering the civic and criminal investigations in certain regions responsibly.
When you are giving a summary of news on your news aggregating website, you must provide the link for the article on the original website as well. This link might have been scraped and stored in your database already. These links are important since on finding the summary of an article interesting, a customer might very well want to read the entire news and gain a full understanding of the present situation.
Often, for a single event, you will be getting more than one news articles from different news sites. If it is a big event or news, it might even happen that the latest developments keep coming in every few days or weeks. It is your responsibility to collect all these news articles, remove repetitions in case of similar articles by keeping the one with the best summary and also building a timeline of events for the entire episode so that a person can understand how the thing happened, what actually happened, and how the authorities dealt with it, and what was the final outcome. This way, the reader gets access to a historical timeline of a newsworthy story.
How do you know which article is better written when you have different versions of it on similar news websites. One option is manual intervention, but that can be kept aside for unique situations since manual intervention is costly and cannot be implemented on a scale. So one could build an intelligent scraping mechanism with the help of a web scraping service like PromptCloud, that would be able to detect the number of thumbs up and positive comments on an article and only deliver the ones with the best statistics.
Certain online news sites are more popular than others although theoretically every website actually covers the same news. You can crawl the top news/news aggregator websites to see what is making their sites click. You can also capture customer behaviour on their website by going through comments, most viewed articles, and more. Systematic checks on your competitors can help you remain in business for a long.
News and Media is a big business and like any other business, it needs technology to reduce operational costs and remain viable. Web scraping and intelligent systems can provide this edge to news aggregators.
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.