Machine Learning or artificial intelligence are the buzzwords of our time. Upon hearing one, our mind conjures up self-driving cars, robots that can help us in our day to day life or terminators taking over the world. While all three are possible, most machine learning applications are much simpler, for example, identifying cancerous cells, or finding forest fires from satellite images, or even being able to correctly identify billboards in pictures.
However, what you must know is that for all such applications, first, a model has to be built and then applied to use cases. This model needs loads of data to be trained. The more the data, the better. Usually, you’d need millions of rows of data to train your model well enough to apply it to real-world scenarios.
Where do companies get such data? Well, do you remember how Google asks you to identify zebra crossings, cars, house fronts, traffic signals, and billboards from images to identify that you are a human when opening a site from an unsafe connection? Well, Google is just getting you to identify these data points in images so that it can train its machine learning model. But not everybody is Google and not everybody has the resources to get millions of people to help train their ML models. So what do people working on research projects or market research do when working on a problem they want to solve using ML? Where do they get the data from?
Well, this article will cover 15 data sources that are great for anyone who is working on an ML model and needs to train it on high-quality data to be able to test it on real-life scenarios.
Google Dataset Search – When it comes to searching for things on the internet, the first word that comes to our mind is Google. Hell, most of us even use Google as a way to get around the internet, because we usually forget URLs and links. While almost any dataset that is on the internet can be searched for and found using the normal Google search, finding particular datasets might not be simple since everyday people might not know the exact keywords to search for to get a dataset. This is why Google came out with their Google Dataset Search engine. The tool helps make datasets stored across the web, accessible universally. They believe that this global dataset discovery platform will have two important benefits:
DataStock – DataStock is a solution created by our team at PromptCloud, that lets you instantly download clean and ready-to-use datasets from the web. Using these large volume and clean datasets, you can perform analysis, derive insights and also train machine learning algorithms.
The process to get datasets is simple and consists of just 3 steps:
Kaggle – Most people know Kaggle as a website where people come together to solve Machine Learning problems and work on competitions to build the most accurate predictive models. Machine Learning Enthusiasts, aspiring data scientists, as well as professionals in the field come together in this platform to solve problems or optimize solutions. However, all this work also involves loads of datasets which are freely available in Kaggle Datasets. Kaggle Datasets is a great place for aspiring ML engineers to get their hands dirty on databases that are being used by other people. This way they can compare their findings with a large number of peers and get authentic feedback and more help.
UCI – The University of California, Irvine’s Machine Learning Repository, is one of the best places to start for new learners or when you want to run different algorithms and find which one is efficient or even when you want to just try out a new solution. For example, when I was learning how to use Tensorflow in Python, I used the UCI iris database to test out my code. And since it has so many different varieties of data (many of which have been used by scientists to produce research papers), there’s a lot of use cases that you can cover using them.
JobsPikr – JobsPikr is another tool that helps you get datasets (built by our team at PromptCloud). This solution serves only to clients related to the job industry. It helps job boards, recruitment agencies and other job matching applications, gather job listings, sorted and filtered by different factors like location, title, keywords, etc. It is an automated, no-maintenance, ready-to-use feed that can be customized as per your needs. It is updated with fresh data in real-time and hence is one of the favorites of the recruitment industry. It has multiple delivery options such as API, S3 and direct download.
Awesome Public Datasets – Unlike most other sources on this list, this one is a GitHub page. However, I found it while researching on financial data and quite frankly, I have never seen a data repository this big and organized. It has datasets (or links to them), under topics ranging from the Human Genome Diversity Project to the Yelp™ Dataset Challenge. So, in case you have a problem for which you can’t find a dataset, or you are looking for a dataset of a particular kind, you can always look for it in this repository.
Nature – The world’s top journal is known mainly for the famous research papers that have been published in it. Groundbreaking discoveries like the “Possible Existence of a Neutron” by J. Chadwick have been published in it. But like you might already know by now, where there are research papers and scientific discoveries, there’s bound to be data repositories. Thus, Nature’s repository provides links for multiple data sources that have been used in the past, as well as those that still need wider and deeper explorations.
CMU Datasets – Carnegie Mellon University in Pittsburg, runs some of the best Machine Learning research in the world. Along with that, they are also gathering some of the best data sources around the world, and you can even recommend them new sources or provide them with data sets that you might have mined or collected on your own. You can learn more about the datasets they are using here.
Data.gov – It is described as the home for most of the U.S. Government’s open data. You would not just find data here, but also tools and resources that would help you with your market research, or in building a new web, or mobile apps targeting a specific audience. Here are the topics that are covered-
NCES – Education is an extremely important aspect of society and since new policies are debated about every year, it’s important to have data at hand to make sure that all decisions are data-driven. Even companies working on solutions to help improve children’s education can use the data to design better solutions. For such data sources, the NCES (National Center for Education Statistics) has a huge repository.
Labeled Faces in the Wild – Facial recognition is one of the top technologies in development today. It has found many uses. It does away with the need for manual check-in at airports. It helps the police track down, dangerous criminals. It is even being put to use to track the social life of citizens in China. Much of the work is based on creating algorithms and then training them on millions of faces and hoping that the next time a new face comes, the system should be able to recognize it as a completely new face and accordingly save it in its system. Getting facial data of so many people is not so easy and hence, if you are a hobbyist writing code to build models on facial data, the “Labeled Faces of the World” repository hosted by the University of Massachusetts at Amherst, contains more than 13,000 images of people, collected from the web.
Visual Genome – Identifying objects in images is something that people have been doing for years. However identifying multiple objects, such as a human being and a motorbike and then finding the association between them is something that has not been successfully done before. For finding such association in images and training new algorithms, the Visual Genome project has created a repository with:
xView – xView is one of the largest public datasets of overhead imagery. It contains different types of terrains and locations around the world, annotated using bounding boxes. xView data also comes with a pre-trained baseline model, that has been created using TensorFlow object detection API, and it also contains an implementation using Pytorch. What this means is that when you are training a model to train on xView dataset, you can compare its efficiency with the pre-trained model to understand if you have done any better.
World Bank – The world bank is known to be the biggest monetary organization in the world. However, its databank is also one of the largest repositories of financial data of countries around the world. The data can be used by corporates wanting to expand to new regions or by researchers who want to study different financial datasets of places around the world.
Quandl – If you are looking for data on corporations or market research, Quandl is the best place to look. They describe themselves as the best source for “financial, economic, and alternative datasets, serving investment professionals”. They have different dataset options for different types of clients- Core financial data for everyone and Alternative datasets for Institutional Clients. Datasets range from End of the day US stock prices to Company hiring activity.
Data science, machine learning, and artificial intelligence are all vast fields, but the important factor in all the three fields is – data. By using the vastly available data sources today (most of which originate from the World Wide Web), you can train your model to do different tasks and make different predictions. And the only way in which you can make your predictive models and artificial intelligence algorithms better is by training them on new datasets. One thing is for sure – “Data is the new oil.”