Gregory Piatetsky-Shapiro, Ph.D., is a well known data scientist, founder of KDnuggets and co-founder of the KDD conferences. He constantly features as top influencer (LinkedIn Top Voice, 2018) in knowledge discovery, data mining and data science domains. Also, Gregory has produced over 60 publications and edited several books and collections associated with data mining and knowledge discovery.
Fascinating fact: KDnuggets was established way back in 1997!
Recently, we got in touch with him to have a quick conversation about the current and future state of data science, its implication on business growth and key challenges.
We enjoyed taking this interview, and we’re certain that our readers are going to love the insights shared by Gregory.[PromptCloud] – Generally data science is considered an intersection of mathematics, statistics and computer science. In terms of importance, how much weight would you assign to these components? And, how does domain knowledge come into play? [Gregory] There is a meme that says “a Data Scientist is someone who knows statistics better than a programmer, and codes better than statistician”. Data Science is a fundamentally interdisciplinary field, although the classic venn diagram by Drew Conway from 2010 includes 3 circles – Hacking/Coding, Math/Statistics, and Domain/Business Knowledge.
Mathematics, apart from statistics has not been generally required for Data Science. However, recently
knowledge of Algebra and Calculus – the fundamentals of math – have become much more important for understanding Deep Learning, the most successful method currently for building Machine Learning models.
Understanding domain/business is probably the most important for applied Data Science, since it helps to focus on the right questions to ask. A brilliant mathematician can build a great Deep Learning model, but what if she chooses a wrong problem to solve which does not help in real world?
Another Data Science maxim is — “More data beats better algorithms”, and my addition to it is “Better question beats more data”.[PromptCloud] Since data science is a hot topic and undoubtedly a lucrative career option, it is witnessing a huge demand in the student community. So, what are some of the specific statistical knowledge and/or software tools that companies must look for while hiring new recruits for data science? [Gregory] The most popular languages now for Data Science are Python, R, SQL, but the top tools also include Anaconda, scikit-learn, Tensorflow, Keras, Apache Spark, RapidMiner, Excel, and Tableau. My recent blog on the 6 components of open-Source Data Science/Machine Learning ecosystem has additional details. [PromptCloud] What are some of the underused data sets with limited visibility in the machine learning domain? Perhaps, the valuable data sets that data scientists or machine learning practitioners don’t know about. [Gregory] I don’t have any specific recommendations, but check data search engines like data.world, KDnuggets page, or Google Dataset Search. And remember that a valuable dataset should be connected to a valuable problem. [PromptCloud] How will data science and AI will evolve in the next 5-10 years? Can majority of the data science job role be automated? [Gregory] Currently, it is a golden age for Data Scientists with amazing and mostly free/open source tools.
However, in a recent KDnuggets Poll — “When will most expert-level Predictive Analytics/Data Science tasks currently done by human Data Scientists be automated?”, 51% of voters expect this to happen in 10 years or less .
There are already companies like DataRobot, H2O and more that offer automated solutions.
I expect that such solutions will make Data Scientists more productive but will eliminate demand for lower-level data analysts by enabling more business users have access to insights from data science. To stay relevant, higher level Data Scientists need to focus not only on latest algorithms which are easier to automate, but also on understanding the business/domain and helping ask the right question.[PromptCloud] What are some of the typical mistakes that companies make while applying data science to solve business problems? [Gregory] There are largely two issues:
1. Overfitting the data
2. Solving the wrong problem
This sums up the interview — we can see that the most prominent theme is attributed to domain knowledge since that enables the professionals to ask the right question.
Note: Gregory could not answer few other questions because of confidentiality clause signed in his previous engagements.