Is Data Science The End of Statistics? A Discussion

Here is an interesting discussion on LinkedIn, started by a provocative post "Data Science: The End of Statistics?" What is the relationship between Data Science and Statistics and in what sense are "Statistics" ending?



Here are the highlights from an interesting discussion on
KDnuggets LinkedIn group.

It was started by a provocative post by Larry Wasserman, a Professor in the Depts. of Statistics and Machine Learning at CMU, who wrote Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of "Data Science" and "Big Data" I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data - a field called statistics - is alarming.

Vincent Granville responded with a post Data Science: The End of Statistics?

Data science is more than statistics: it also encompasses computer science and business concepts, and it's far more than a set of techniques and principles. I could imagine a data scientist not having a degree - this is not possible for a statistician. ...

I am one of the guys who contributes to the adoption of the keyword data science. Ironically, I'm a pure statistician (Ph.D. in statistics, 1993 - computational statistics) although I changed a lot since 1993, I'm now an entrepreneur. The reason I tried hard to move away from being called statistician to being called something (anything) else, is because of the American Statistical Association: they killed the keyword statistician as well as limiting career prospects to future statisticians, by making it almost narrowly and exclusively associated with the pharmaceutical industry and small data (where most of its revenue comes from). They missed the boat - on purpose, I believe - of the new statistical revolution that came along with big data over the last 15 years.

This led to an interesting Data Science Venn Diagram, simplifieddiscussion.

Gregory Piatetsky-Shapiro

Saw this question on Normal Deviate blog, but I think Data Science is not the end of statistics, just like Computer Science is not the end of Math.

Timothy Vogel

knows data science cannot possibly be the end of statistics simply because so many data scientists proferring that sort of ludicrous rhetorical question seem not to know a thing about the underlying population-sample model for all estimations. Do they really think "big data" means a literal census? Are they implying that in any investigative endeavor the word "census" makes any sense? Are they seriously operating under the impression that any machine-learning routine isn't fimrly embedded in the high-grass of a statistical procedure? Yikes, if any of that is true!

Sandeep Rajput

Agree completely with Gregory.

I'll offer a slightly different perspective. Until the 1960s or even 1970s almost all Statisticians were mathematically skilled, because their entire discipline sat in the Math department. With Business booming, the discipline moved into Business School or Economics: both saw decline in mathematical rigor.

So, while the insights were obtained by skilled Mathematicians like Fisher, and applications made easier by Neyman and Pearson, what was once hard now became simple. That is science! To quote Larry Wall, "Easy things should be easy, and hard things should be possible". As Vincent notes, maybe ASA got smug.

My opinion is that the academic regression and the great utility of previous research led to a deterioration in the general public perception of Statistics. So new monikers were needed. Pattern Recognition gave way to Machine Learning. Lift the hood of most Machine Learning algorithms and you'll find the Expectation-Maximization algorithm. Yes, ML rejects or avoids sampling theory, but many of them invoke the convergence properties used in sampling theory. Advances in computing did make a lot of those algorithms feasible, no doubt about that. But the truth is not linear and straightforward.

The most cited Machine Learning text is by Hastie, Tibshirani and Friedman, all of them Statisticians. Perhaps many of us associate Statistics with "mean, median and mode" and miss the essential meaning of what a Statistic really is. The sharp decline in NSF funding over the past 20 years is also to blame, as it encourages hacks and tweaks and discourages thoughtful fundamental research. Finally, in the post-Lehmann world, science and math has taken a beating in public opinion, at least in the US.

Summary: Great success led to smugness; reduced government sponsorship of research; economic hardships eroded faith in science and math.

Translation: Cyclical, Trend and Intervention!

Gregory Piatetsky-Shapiro

Sandeep, the most important theoretical basis for Machine Learning is probably VC (Vapnik-Chervonenkis) dimension, which did not come from classical statistics. In my opinion, the biggest strength of statistics is strong theory but it is also its weakness, since machine learning is frequently working with heuristic algorithms, and statistics does not

Sandeep Rajput

Thank you Gregory. Machine Learning has its own original contributions, no doubt. Perhaps some confusion arises from what "Classical Statistics" is. For most it is the Neyman-Pearson lemma. Bayesian Statistics gels better with Machine Learning because of its "incremental" learning which is very similar to "on-line" learning. MLE was developed by Fisher who was no frequentist. Fisher Information underpins the "information-theoretic" (the third and smallest camp) school in Statistics which looks like coding theory with Minimum Descriptor Length (MDL) a frequently spotted metric, which is also a dimension relating to sufficiency of sorts (like VC dimension).

The field of "Frequentist" statistics arose to deal with 20-30 measurements gathered very painstakingly and in a time when computation was far more expensive than mathematical manipulation (just look at the Student's t-distribution formula). At some point one becomes the victim of their own success. For example, many applied R&D teams extensively use ANNs because their leaders were in grad school 20-25 years ago when ANNs were hot. With some "secret sauce" they are able to show good performance with fully-connected feedforward ANNs.

At the wider level, every discipline except perhaps Philosophy owes a lot to the scientific tradition over the centuries. Statistical theory becomes much more tractable from a Measure theoretic perspective. Without Lebesgue's pivotal work, measure theory and staples like FFT won't exist today (developed by Tukey, a chaired Statistician). And unless the great Gauss had noticed how his land-survey "errors" had the most peculiar shape, maybe information theory would not exist today for then Shannon would not have written his legendary thesis on the same topic. Statistics was derided as simply "counting the combinations" until Laplace and others lent it respectability through (what was then) more rigorous Math. We all stand on the shoulders of giants and no one field begets another field. All learning theory owes its biggest thanks to Euclid perhaps. What is "similarity"?

To echo your elegant statement about the propaganda about "CS being the end of Math" (many of us need to write grant proposals), I'd wager that the greatest strength of Statistics is "its pragmatism and long history of working closely with practical matters". It is the art of asking the right questions that any scientist, Statistician or not must aspire to.

Vincent Granville

Hi Timothy, you wrote "data scientists seem not to know a thing about the underlying population-sample model for all estimations". You are talking about fake data scientists, see www.analyticbridge.com/profiles/blogs/fake-data-science . Real data scientists know robust sampling very well. Some like myself do know about all the intricacies of advanced statistical models, but have started to do robust estimation and prediction using statistical modeling without models, or even, robust predictions with confidence intervals, without using anything that is taught in statistical classes or textbooks. See www.analyticbridge.com/profiles/blogs/from-chaos-to-clusters-statistical-modeling-without-models .

Note that the initial question was aked by a Carnegie Mellon University professor. Not by myself. I'm actually both a data scientist and statistician, but have stopped calling myself statistician a while back, mostly because it creates the wrong perception about what I am actually doing.

Timothy Vogel

Vincent,

I knew the question was more rhetorical than specifically yours, but my skepticism about lack of depth of awareness regarding what comprises robustness among data scientists was clearly and well-qualified;

"...knows data science cannot possibly be the end of statistics simply because so many data scientists proferring that sort of ludicrous rhetorical question seem not to know a thing about the underlying population-sample model for all estimations...."

The reference was to data scientists who could pose such a question, not all data scientists.

I also don't understand why "data scientist" is more meaningful to people who misinterpret "statistics" as a profession. If one doesn't get how statistics, as a discipline, relates to data science how does calling yourself one over the other really matter?

All that has steadily changed for those of us who seek inference from data is the volume of compiled sources with even a glimmer of a chance to harbor even a grain of evidence about accurately predicting an event or stratifying a population in terms of its likelihood to "be" anything.

As far as I'm concerned, regardless of the label, I have ample opportunities to inform people about "what I do" independent of the label I use to apply to it. But I do agree those of us active int he field need to define it for others, not allowing others to define us by the labels they have recently decided to apply to us! ;^D