A short post that guest-tweeting at the Biotweeps account on Twitter got me thinking about– featuring a poll.
Imagine this: two scientists (colleagues, if you’re a scientist) are arguing thusly. Say it’s an argument about a classic paper in which much of the data subjected to detailed statistical analyses are quantitative guesses, not hard measurements. This could be in any field of science.
Scientist 1: “Conclusions are what matter most in science. If the data are guesses, but still roughly right, we shouldn’t worry much. The conclusions will still be sound regardless. That’s the high priority, because science advances by ideas gleaned from conclusions, inspiring other scientists.”
Scientist 2: “Data are what matter most in science. If the data are guesses, or flawed in some other way, this is a big problem and scientists must fix it. That’s the high priority, because science advances by data that lead to conclusions, or to more science.”
Who’s right? Have your say in this anonymous poll (please vote first before viewing results!):
link: http://poll.fm/4xf5e
[Wordpress is not showing the poll on all browsers so you may have to click the link]
And if you have more to say and don’t mind being non-anonymous, say more in the Comments- can you convince others of your answer? Or figure out what you think by ruminating in the comments?
I’m genuinely curious what people think. I have my own opinion, which has changed a lot over the past year. And I think it is a very important question scientists should think about, and discuss. I’m not just interested in scientists’ views though; anyone science-interested should join in.
I can see the poll here.
In my opinion data has the priority. We can always re-assess old data if carefully collected and documented. Old conclusion based on sketchy data (and we may find out about the sketchiness only decades later!) are worthless.
Case in point: there was ample support for a cyclicity in climate in certain sediments in the Vocontian Basin during Jurassic and Cretaceous times. Decades later, a colleague started re-sampling at much tighter intervals, because methods had improved. No cycles.
Turns out, the old data happened to hit an irregular curve so that an apparent sine wave was created.
If now we only had the old conclusions and the new conclusions, how would we be able to assess the validity of both publications? Because we have the data, because the old data was carefully documented and published, we can assess, and discount the old studies as methodologically insufficient.
Great example, Heinrich- thanks! Odd about the poll- maybe my AdBlock is excluding it although I set it not to block ads here. Anyway, glad the poll works one way or another.
I think the positions of both hypothetical scientists are off-track. On the one hand, data should not be “guesses” or otherwise indefensible, but on the other hand there is no such thing as perfect data. All data have errors, uncertainty, and sampling imperfections. In hypothetico-deductive science, data do not prove conclusions, data falsify conclusions. The preferred hypothesis is the one that best explains the observed data, taking into account its flaws and uncertainties. If Scientist 1 has taken into account the uncertainties in her data, then her conclusions are likely to be both sound and testable. However, if Scientist 1’s data truly are “guesses” then the data are probably arising from the hypothesis rather than testing it. Scientist 2, regardless of the quality of data, will never make a contribution because data can always be improved ad infinitum.
I would add that I feel strongly that the strongest test of a hypothesis is made with new data, not reanalysis of published data. Reanalysis of old data may expose analytical errors, but it perpetuates errors in the data and short-sightedness and limitations in the way the data are assembled. Except in the trivial case where the original author mis-analyzed, existing data will support existing conclusions, whereas new data from new sources may reveal that the original data were unrepresentative or were not the best way of testing the hypothesis. I am, of course, not referring to cases where data are repurposed to address a completely new question in a new way. A study of the functional morphology of a fossil that was collected for biostratigraphy is unlikely to perpetuate errors that arise from uneven temporal sampling; however, reanalysis of the biostratigraphic zonation based on the original collection will perpetuate unrecognized biases in a way that recollecting would not. The position of Scientist 2, which focuses on amalgamating data as the definition of advances in science, has the potential of stifling scientific progress by channeling scientific efforts into reanalysis of the same data rather than creative thinking about how to test questions in new ways.
I therefore favor Scientist 3, who attacks a problem in a new way by collecting new data that focus on the crux of the problem. Scientist 3 examines carefully the biases and uncertainties in her data, but is unafraid to draw forward-thinking, testable conclusions that attempt to explain those data and generalize from them to the extent that flaws allow. Scientist 3 is an excellent scientist because she has generated new data, new ways of thinking, new tests, and new ideas. Most importantly, she has paved the way for Scientist 4 to prove her wrong if he chooses to adopt an equally scientific approach.
If the argument is “data versus conclusions”, then my vote is for data, data, and more data…and developing new ways of collecting data to add to the original data. I encounter the scenario in my work where the data published for track specimens (for which there are no replicas and the chances of me physically studying the original specimen are slim to none) are incomplete, averaged, or focus on only the type. My analyses, and the resulting conclusions, are only as well-supported as the most spurious data I use. As long as the data are there, and every effort is made to add to the data and improve the data-collecting techniques, then previous conclusions can be reinterpreted and refined.
The one caveat I make for the “pro-data” stance is that one cannot become so obsessed with getting “just one more data point” that they hobble themselves, reporting-wise. At some point the data one has and the accompanying interpretations should be published, and future publications can always be used to add to the previous dataset.
I voted for option 3, because the other two options were just a little too simplistic in my reading. Favoring conclusions above all else strikes me as enabling sloppy science; in fact, I dare to say that many of the more problematic papers I’ve read in my career (hopefully few or none of them authored by me!) stem from this approach. Scientists 1 (at least as phrased in this poll) strikes me as someone who plays fast and loose with the data, and I wouldn’t be terribly likely to trust anything coming from them.
Scientist 2, on the other hand, strikes me as a version of Young Graduate Student Andy. I think many of us go through this phase (and I still have to fight this tendency), where the slightest flaw in a paper is an indication that the whole work teeters on the edge of falsehood. I.e., a character is coded incorrectly in a phylogenetic matrix, or the anatomy of an element was misinterpreted. As I matured as a scientist, I have gotten away from the “data fundamentalist” ravings of my early days. That said, I still lean a bit on the data-heavy side, if only because reproducibility hinges on the reliability and availability of the data.
Thanks for all the comments so far- these are wonderfully thoughtful and nuanced, just what I was hoping to get! More comments are very welcomed!
From the poll Comments:
Oliver Rauhut – 3 hours ago
Very well said, David! Data and conclusions are the two sides of the same coin in science. Collecting data for the sake of collecting data leads to nothing, and thinking up hypotheses on the basis of other hypotheses (=guesses) might be OK for philosophy, but also does not advance us in the natural sciences. At least in some areas of our science, I recognize a tendency towards data worshipping: The idea that, if we just collect enough data and then play whatever statistical tricks on them (often without much consideration whetehr they make sense, if only the P-value is right), they will tell us “the truth”. However, there are a number of problems with this approach, and that’s where methodology comes in: You need a null hypothesis to test, and you need the right analytical procedures to carry out such a test. And it is in the second step where the thing gets tricky: There are often a number of procedures that can be applied and that will give you some result, but which is most suited for the data you have? This question again depends on the problem you want to solve (=test your null hypothesis), so data and hypothesis cannot be separated.
This orientation towards high amounts of data and statistical analysis can also lead to stifling scientific progress: I had reviews where I was told that the data is insufficient to formulate any hypothesis. So, what are you to do? Collect fossils for another 20 years or so, until you feel you have enough data? In my opinion, it is better to formulate a hypothesis on the basis of sparse data that can then be tested by additional data than to just collect data in a hypothesis-free field (how can you tell then if the data is significant or not?). Furthermore, formulating hypotheses always has the positive potential to make other people think about that problem, maybe because they dislike your hypothesis and want to prove it wrong – if they suceed, scientific progress has been made!
This poll seems a bit like: which do you like better: flour or bread? Bread can be tasty or awful, depending on how it’s made. Flour, on the other hand, isn’t palatable until it has been processed through a recipe, but it has more potential than bread: it could be bread, but it could also be a cake, or a roux, or cookies…
When I was at an EarthCube workshop in 2013, we were discussing what, exactly, web databases should archive, and it came down to this same dichotomy: data or conclusions. Because, if you’ll think about it, a geologic map is actually a processed data product. I use geologic maps as the fundimental data for some of my paleontological analyses, but the actual data are the individual field observations. The map is actually a document merging observations and hypotheses to create a continuous geologic surface that’s a conclusion at one level and data at another. Which should be available on the web as a resource for data geoscientists?
For a more paleo example, much of the data in online paleo databases is actually processed data product, not raw data. How many of the identifications in a faunal list are actually hypotheses, informed by both the actual specimens (data) and some hypotheses about the relationship between the fauna under consideration and others that have been previously published. John Alroy has made a convincing case that differing taxonomic approaches among workers usually only adds noise to higher-level analyses, but sometimes the cloven hoofprint of hypotheses can have a dramatic overall biasing effect on these apparent ‘data.’ The best example is in the geographic bias in Pleistocene identifications of herpetofauna, as documented in the Covert Bias paper of Bell et al. 2009.
tl;dr: Data vs. conclusions is a false dichotomy, because conclusions at one level of science are data at another. Fractals! Always critically analyze the basis of any study.
+1! The idea of processed products being “data” is pretty universal, and problematic for the reasons you mention. I’ve been coming up against this big-time in the past week while trying to read the tea leaves of paleogeographic maps. Same for published reconstructions from CT scan data, paleomagnetic studies, etc. This is why I’ve been hammering hard for “the rawer the available data, the better.”
[…] « Showdown: Scientific Data vs. Conclusions (POLL) […]
If I have your data, I can apply whatever techniques I know to derive new conclusions. If all I have is the conclusion, there’s no way to reverse-engineer the data. Thus, although the sexy conclusion is what gets you into a glam-mag, your data is what actually provides value downstream. I can build on the data; I can’t built on the conclusion.