library(tidyverse)
<- c(
tags_vec '{pan-to="25%,-50%"}',
'{scale-by="1.5"}',
#'{zoom-to="3"}'
'{pan-to="18%,-15%" scale-by="1.5"}'
)
<- readr::read_delim('./CL_text.txt') %>%
out_collapsed setNames(c('Character', 'text')) %>%
mutate(text = text %>% str_remove_all(' \"|\"')) %>%
mutate(tag_rand = sample(tags_vec, nrow(.), replace = TRUE)) %>%
mutate(Character_tag = str_c('[@cr-', Character, ']',tag_rand), .before = 1) %>%
mutate(Character = str_c(Character, ':')) %>%
unite(out, c('Character', 'text', 'Character_tag'), sep = ' ') %>%
pull(out) %>%
str_c(collapse = '\n \n')
# CL_text %>% count(Character)
# CL_text$text[[1]]
The Two (or Three) Cultures of Statistics
On 2001, in a paper called ‘Statistical Modeling, the Two Cultures’, Leo Breiman termed two cultures 1.
“There are two cultures in the use of statistical modeling to reach conclusions from data. One, Data Modeling, assumes that the data are generated by a given stochastic data model.
The other, Machine Learning (ML) uses algorithmic models and treats the data mechanism as unknown”.@cr-paper
A third approach, Bayesian statistics, is not mentioned in Breiman’s paper.
Loosely defined, “Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a degree of belief in an event” 2.
Key Principals For Each Approach
Frequentist / Data Modeling
Machine Learning
Bayesian
- Key Principles:
- Focuses on building models that generalize well to unseen data, often prioritizing predictive performance over interpretability.
- Primarily data-driven, relying on patterns in the data rather than explicit probabilistic assumptions.
- Uses techniques like optimization, cross-validation, and regularization to minimize prediction error.
- Includes supervised, unsupervised, and reinforcement learning approaches.
- Strengths:
- Excels at tasks with large, complex datasets (e.g., image recognition, natural language processing).
- Flexible and capable of modeling highly non-linear relationships.
- Emphasis on real-world predictive accuracy.
- Excels at tasks with large, complex datasets (e.g., image recognition, natural language processing).
- Weaknesses:
- Models can lack interpretability (e.g., black-box nature of deep learning).
- Requires large datasets to perform well.
- May overfit or fail to generalize if not carefully regularized or validated.
- Models can lack interpretability (e.g., black-box nature of deep learning).
- Example Use Cases:
- Fraud detection, recommendation systems, autonomous driving, and more.
- Algorithms like decision trees, neural networks, and support vector machines.
- Fraud detection, recommendation systems, autonomous driving, and more.
- Key Principles:
- Focuses on long-run frequencies of events and emphasizes hypothesis testing and confidence intervals.
- Probabilities are interpreted as the long-term frequency of events. Parameters are fixed, unknown quantities.
- Relies on data to compute sample-based estimates and test hypotheses.
- Strengths:
- Simple and well-established frameworks for hypothesis testing (e.g., p-values, t-tests).
- Does not require prior assumptions (like priors in Bayesian statistics).
- Weaknesses:
- Interpretation of results (e.g., p-values) can be unintuitive.
- Less flexible when dealing with small datasets or complex models.
- Example Use Cases:
- Traditional hypothesis testing in experimental science.
- Classical methods like ANOVA, regression, and chi-squared tests.
- Key Principles:
- Combines prior knowledge with observed data to update beliefs about parameters using Bayes’ theorem.
- Probabilities represent degrees of belief or uncertainty about events.
- Parameters are treated as random variables with probability distributions.
- Prior distributions reflect existing knowledge or assumptions.
- Combines prior knowledge with observed data to update beliefs about parameters using Bayes’ theorem.
- Strengths:
- Provides a natural framework for updating beliefs as new data is observed.
- Can handle small datasets effectively when combined with informative priors.
- Results are often easier to interpret probabilistically (e.g., “the probability of the parameter being within a certain range”).
- Provides a natural framework for updating beliefs as new data is observed.
- Weaknesses:
- Requires specification of priors, which can be subjective.
- Computationally intensive, especially for complex models.
- Requires specification of priors, which can be subjective.
- Example Use Cases:
- Predictive modeling in medicine (e.g., estimating patient risk).
- Hierarchical modeling and small-sample inference.
- Applications in fields requiring uncertainty quantification, like robotics or weather forecasting.
- Predictive modeling in medicine (e.g., estimating patient risk).
In summary,
Frequentist Statistics
Best for traditional inferential tasks.
Bayesian Statistics
Best for uncertainty and small-sample problems.
Machine Learning
Best for predictive and large-scale applications.
Aspect | Frequentist | Bayesian | Machine Learning |
---|---|---|---|
Focus | Hypothesis testing | Probabilistic inference | Predictive performance |
Probability | Long-run frequencies | Degree of belief | Often implicit or not used |
Parameters | Fixed and unknown | Random variables | May be implicit or non-parametric |
Prior Knowledge | Not used | Explicitly incorporated | Not typically used |
Dataset Size | Works with moderate to large datasets | Can handle small datasets with priors | Requires large datasets for best results |
Interpretability | High | Moderate | Often low |
Applications | Traditional science | Fields needing uncertainty quantification | Industry, big data applications |
Back to Breiman’s two cultures paper…
1. Can 'Data Modeling' culture be considered as Frequentist?
The data modeling culture described by Leo Breiman is not strictly synonymous with frequentist methods, but it aligns closely with many principles of frequentist statistics. Here’s why:
Nuance in Data Modeling and Frequentist comparison:
Assumptions and Parameter Estimation: Data modeling culture often relies on parametric models like linear regression or generalized linear models, which are staples in frequentist methodology. These models estimate parameters based on observed data, with a focus on likelihood functions and hypothesis testing—core concepts in frequentist inference.
Model Interpretability: Like traditional frequentist approaches, the data modeling culture emphasizes interpretability and providing clear, theoretically grounded explanations of relationships within the data.
Role of Probability: Frequentist statistics typically view probabilities as long-run frequencies of events, which aligns with the reliance on fixed, pre-specified models in the data modeling culture. Probabilities here are tied to specific distributions assumed to represent the data-generation process.
Contrast to Bayesian Methods: Bayesian approaches, which incorporate prior beliefs into the model and update these beliefs with observed data, are less central to the traditional data model culture. This difference reinforces the association of the culture with frequentist perspectives, although Bayesian approaches can be used within data modeling paradigms in certain contexts.
Overlap but Not Identity:
While data modeling and frequentist methods often overlap, they are not identical. For instance, data modeling is defined more by the approach of assuming a probabilistic structure for the data and its focus on model-based inference than by strict adherence to frequentist philosophies.Emerging Perspectives:
Some modern extensions of data modeling, such as penalized regression (e.g., Lasso, ridge), use computational techniques that are not strictly frequentist but can still align with the interpretive goals of the data modeling culture.
In summary, while the data modeling culture aligns strongly with many aspects of frequentist methods, it is better understood as a subset or close associate rather than being fully synonymous.
It reflects the frequentist mindset in its reliance on fixed models, assumptions, and hypothesis testing.
2. Why Breiman did not include Bayesian statistics in his 2001 paper?
Breiman’s 2001 paper primarily aimed to critique the dominance of the data modeling culture and to introduce the algorithmic modeling culture as a more effective paradigm for handling modern, complex data analysis challenges.
The omission of Bayesian statistics likely occurred for a few key reasons related to the paper’s focus and the state of statistical practice at that time
Contemporary Perspectives:
Focus on Paradigm Shift Between Two Dominant Cultures: Breiman’s primary argument was to contrast the data modeling culture, characterized by traditional parametric models and interpretability, with the algorithmic modeling culture, represented by machine learning techniques that emphasize predictive accuracy and data-driven models. Including Bayesian methods would have introduced a third lens, which might have diluted the clarity of his two-culture argument.
Bayesian Methods’ Marginal Role in Practical Applications (2001): At the time Breiman wrote his paper, Bayesian statistics, though well-established theoretically, had not achieved widespread adoption in many practical or industrial data analysis settings. Computational limitations and a lack of accessible software hindered the popularity of Bayesian methods compared to frequentist approaches and emerging machine learning techniques.
Focus on Assumption-Driven Models: Breiman’s critique of the data modeling culture hinges on its reliance on rigid assumptions about data-generating processes. Bayesian statistics, while offering flexibility through prior distributions, also operates within model-based frameworks. Thus, Breiman might have considered it part of, or closely aligned with, the data modeling culture, even if not explicitly mentioned.
Bayesian Statistics Doesn’t Align with Algorithmic Modeling: Breiman was advocating for algorithmic modeling—methods that rely heavily on data-driven approaches and often do not assume an explicit probabilistic model. Bayesian approaches still heavily involve modeling (e.g., specifying priors and likelihoods) and thus may not align well with the algorithmic philosophy.
Paper’s Argument as a Provocation: Breiman likely structured his paper as a sharp, provocative dichotomy to spark debate within the statistics and machine learning communities. Adding more nuanced considerations, like Bayesian methods, might have complicated the narrative.
In modern data science, Bayesian methods have gained significant traction, aided by advancements in computational techniques (e.g. Markov Chain Monte Carlo) and software tools like Stan and PyMC.
Bayesian frameworks now offer powerful solutions that can bridge elements of the two cultures, suggesting that Breiman’s omission may reflect the context of his time rather than the relevance of Bayesian approaches to today’s data science landscape.
Still confused?
What if…
A Bayesian, Frequentist, and Machine Learner practitioners meet at a bar.
What happened next none of them could have predicted…
A fictional conversation between the three practitioners was found on stats.stackexchange3
Keep scrolling down for a conversation between a Bayesian (B), Frequentist (F), and Machine Learner (ML).
Skip it if you are not interested: Section 2 (The skit includes 65 sentences! It’s okay to change your mind.)
ML: Hello, I am a Machine Learning practitioner.
B: Hello, I am a Bayesian practitioner.
F: Hello, I am a Frequentist practitioner.
ML: I hear you guys are good at stuff. Here’s some data.
F: Yes, let’s write down a model and then calculate the MLE.
B: Hey, F, that’s not what you told me yesterday! I had some univariate data and I wanted to estimate the variance, and I calculated the MLE. Then you pounced on me and told me to divide by 𝑛−1 instead of by 𝑛.
F: Ah yes, thanks for reminding me. I often think that I’m supposed to use the MLE for everything, but I’m interested in unbiased estimators and so on.
ML: Eh, what’s this philosophizing about? Will it help me?
F: OK, an estimator is a black box, you put data in and it gives you some numbers out. We frequentists don’t care about how the box was constructed, about what principles were used to design it. For example, I don’t know how to derive the ÷(𝑛−1) rule.
ML: So, what do you care about?
F: Evaluation.
ML: I like the sound of that.
F: A black box is a black box. If somebody claims a particular estimator is an unbiased estimator for 𝜃, then we try many values of 𝜃 in turn, generate many samples from each based on some assumed model, push them through the estimator, and find the average estimated 𝜃. If we can prove that the expected estimate equals the true value, for all values, then we say it’s unbiased.
ML: Sounds great! It sounds like frequentists are pragmatic people. You judge each black box by its results. Evaluation is key.
F: Indeed! I understand you guys take a similar approach. Cross-validation, or something? But that sounds messy to me.
ML: Messy?
F: The idea of testing your estimator on real data seems dangerous to me. The empirical data you use might have all sorts of problems with it, and might not behave according the model we agreed upon for evaluation.
ML: What? I thought you said you’d proved some results? That your estimator would always be unbiased, for all 𝜃.
F: Yes. While your method might have worked on one dataset (the dataset with train and test data) that you used in your evaluation, I can prove that mine will always work.
ML: For all datasets?
F: No.
ML: So my method has been cross-validated on one dataset. You haven’t test yours on any real dataset?
F: That’s right.
ML: That puts me in the lead then! My method is better than yours. It predicts cancer 90% of the time. Your ‘proof’ is only valid if the entire dataset behaves according to the model you assumed.
F: Emm, yeah, I suppose.
ML: And that interval has 95% coverage. But I shouldn’t be surprised if it only contains the correct value of 𝜃 20% of the time?
F: That’s right. Unless the data is truly i.i.d Normal (or whatever), my proof is useless.
ML: So my evaluation is more trustworthy and comprehensive? It only works on the datasets I’ve tried so far, but at least they’re real datasets, warts and all. There you were, trying to claim you were more ‘conservative’ and ‘thorough’ and that you were interested in model-checking and stuff.
B: (interjects)Hey guys, Sorry to interrupt. I’d love to step in and balance things up, perhaps demonstrating some other issues, but I really love watching my frequentist colleague squirm.
F: Woah!
ML: OK, children. It was all about evaluation. An estimator is a black box. Data goes in, data comes out. We approve, or disapprove, of an estimator based on how it performs under evaluation. We don’t care about the ‘recipe’ or ‘design principles’ that are used.
F: Yes. But we have very different ideas about which evaluations are important. ML will do train-and-test on real data. Whereas I will do an evaluation that is more general (because it involves a broadly-applicable proof) and also more limited (because I don’t know if your dataset is actually drawn from the modeling assumptions I use while designing my evaluation.)
ML: What evaluation do you use, B?
F: (interjects)Hey. Don’t make me laugh. He doesn’t evaluate anything. He just uses his subjective beliefs and runs with it. Or something.
B: That’s the common interpretation. But it’s also possible to define Bayesianism by the evaluations preferred. Then we can use the idea that none of us care what’s in the black box, we care only about different ways to evaluate.
B: Classic example: Medical test. The result of the blood test is either Positive or Negative. A frequentist will be interested in, of the Healthy people, what proportion get a Negative result. And similarly, what proportion of Sick people will get a Positive. The frequentist will calculate these for each blood testing method that’s under consideration and then recommend that we use the test that got the best pair of scores.
F: Exactly. What more could you want?
B: What about those individuals that got a Positive test result? They will want to know ‘of those that get a Positive result, how many will get Sick?’ and ‘of those that get a Negative result, how many are Healthy?’
ML: Ah yes, that seems like a better pair of questions to ask.
F: HERESY!
B: Here we go again. He doesn’t like where this is going.
ML: This is about ‘priors’, isn’t it?
F: EVIL.
B: Anyway, yes, you’re right ML. In order to calculate the proportion of Positive-result people that are Sick you must do one of two things. One option is to run the tests on lots of people and just observe the relevant proportions. How many of those people go on to die of the disease, for example.
ML: That sounds like what I do. Use train-and-test.
B: But you can calculate these numbers in advance, if you are willing to make an assumption about the rate of Sickness in the population. The frequentist also makes his calculations in advance, but without using this population-level Sickness rate.
F: MORE UNFOUNDED ASSUMPTIONS.
B: Oh shut up. Earlier, you were found out. ML discovered that you are just as fond of unfounded assumptions as anyone. Your ‘proven’ coverage probabilities won’t stack up in the real world unless all your assumptions stand up. Why is my prior assumption so different? You call me crazy, yet you pretend your assumptions are the work of a conservative, solid, assumption-free analysis.
B: Anyway, ML, as I was saying. Bayesians like a different kind of evaluation. We are more interested in conditioning on the observed data, and calculating the accuracy of our estimator accordingly. We cannot perform this evaluation without using a prior. But the interesting thing is that, once we decide on this form of evaluation, and once we choose our prior, we have an automatic ‘recipe’ to create an appropriate estimator. The frequentist has no such recipe. If he wants an unbiased estimator for a complex model, he doesn’t have any automated way to build a suitable estimator.
ML: And you do? You can automatically build an estimator?
B: Yes. I don’t have an automatic way to create an unbiased estimator, because I think bias is a bad way to evaluate an estimator. But given the conditional-on-data estimation that I like, and the prior, I can connect the prior and the likelihood to give me the estimator.
ML: So anyway, let’s recap. We all have different ways to evaluate our methods, and we’ll probably never agree on which methods are best.
B: Well, that’s not fair. We could mix and match them. If any of us have good labelled training data, we should probably test against it. And generally we all should test as many assumptions as we can. And some ‘frequentist’ proofs might be fun too, predicting the performance under some presumed model of data generation.
F: Yeah guys. Let’s be pragmatic about evaluation. And actually, I’ll stop obsessing over infinite-sample properties. I’ve been asking the scientists to give me an infinite sample, but they still haven’t done so. It’s time for me to focus again on finite samples.
ML: So, we just have one last question. We’ve argued a lot about how to evaluate our methods, but how do we create our methods.
B: Ah. As I was getting at earlier, we Bayesians have the more powerful general method. It might be complicated, but we can always write some sort of algorithm (maybe a naive form of MCMC) that will sample from our posterior.
F: (interjects)But it might have bias.
B: So might your methods. Need I remind you that the MLE is often biased? Sometimes, you have great difficulty finding unbiased estimators, and even when you do you have a stupid estimator (for some really complex model) that will say the variance is negative. And you call that unbiased. Unbiased, yes. But useful, no!
ML: OK guys. You’re ranting again. Let me ask you a question, F. Have you ever compared the bias of your method with the bias of B’s method, when you’ve both worked on the same problem?
F: Yes. In fact, I hate to admit it, but B’s approach sometimes has lower bias and MSE than my estimator!
ML: The lesson here is that, while we disagree a little on evaluation, none of us has a monopoly on how to create estimator that have properties we want.
B: Yes, we should read each other’s work a bit more. We can give each other inspiration for estimators. We might find that other’s estimators work great, out-of-the-box, on our own problems.
F: And I should stop obsessing about bias. An unbiased estimator might have ridiculous variance. I suppose all of us have to ‘take responsibility’ for the choices we make in how we evaluate and the properties we wish to see in our estimators. We can’t hide behind a philosophy. Try all the evaluations you can. And I will keep sneaking a look at the Bayesian literature to get new ideas for estimators!
B: In fact, a lot of people don’t really know what their own philosophy is. I’m not even sure myself. If I use a Bayesian recipe, and then proof some nice theoretical result, doesn’t that mean I’m a frequentist? A frequentist cares about above proofs about performance, he doesn’t care about recipes. And if I do some train-and-test instead (or as well), does that mean I’m a machine-learner?
ML: It seems we’re all pretty similar then.
B: I am tired. Where is my drink?
ML: Me too. I am also hungry.
F: I am afraid we are no longer welcome in this bar.
ML: Let’s find the nearest place with the shortest path getting there.
B: Does anyone have a prior information about a nice place?
About this page:
The conversation section was built from a raw text file with 64 sentences (rows)!
Standard Closeread tags consist of tagging each section with a @ cr-tags for each ‘frame’ in the “story”.
A ‘frame’ is like a Power-Point slide, but with vertical scrolling, so slides (Frames) are automatically changed while you scroll down/up.
Since I was not going to “Copy -> Paste” 64 closeread @ cr-tags, I leveraged Quarto’s programmatic solution to run things iteratively.
The trick was first to combine each sentence from the text with the cr- tag, print it as a long string, and then, in meta-programming, embed it in the Quarto document with a #| results: asis chunk option.
Most of the technical content in this blog was inspired by responses from large language models (LLMs) to questions that had been puzzling me.
The primary educational goal of this post is to showcase how tools like Quarto and Closeread can be leveraged for meta-programming.
Additionally, it highlights that in the current age of LLMs (as of 2024), expertise in a specific domain is not necessarily a prerequisite for sharing valuable insights.
Instead, knowing how to ask the right questions and employing effective storytelling can transform complex technical topics into accessible, engaging resources for a broader, less technical audience.
About:
Dror Berel
Statistician Specialization in Causal Inference and Machine Learning.
R programmer
Solving business needs
Personal page
LinkedIn
Blog
Preston Burns for review, and CSS styling contribution.
Footnotes
Statistical Modeling: The Two Cultures Author(s): Leo Breiman Source: Statistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215 https://www2.math.uu.se/~thulin/mm/breiman.pdf↩︎