Whatever Happened to the One, True Scientific Method?
Back to Walden Two Fan Site homepage

by John Shannonhouse

[This was originally a posts on a listserv, copied with permission]

Hello,

I have talked with Richard and others and thought a lot about what consitutes good science and bad science across a large number of fields (both "hard" and "soft"). IMO, some fields have a high tolerance for bad science and/or give too much credit to conclusions that would be considered preliminary or tentative in other fields.

I don't like semantic confusion, so I looked up the word, "empirical," to make sure we are all on the same page. As it turns out, I had been told (wrongly) that empirical meant observation of data independent of theory, philosophy, bias, etc. in controled, reproduced experiments. In fact, controls are not necessary to be empirical. I wish to substitue "scientifically rigorous" for "empirical" in my last post.

By scientifically rigorous, I mean that the observations are recorded in an unbiased manner and always include as complete a description as possible of conditions under which the observations were made. Experiments need to be done with all possible controls. If for some reason a control is not or cannot be done, its absence should be pointed out (the failure to mention such caveats* is the largest failing I see in experiments done in my field; unfortunately, researchers are often hammered pretty hard for caveats they mention, so mentioning them is discouraged; I always mention them and have suffered considerably as a result). Results must be reproducible to be believable. Sometimes the nature of the experiment prevents reproduction (e.g., it is very expensive).

IOW, scientifically rigorous is empirical, controled, reproduced and includes all caveats.

Being rigorous is distinct from being robust. Rigor refers to method and means carefully or properly done. A robust result means that it is the same or nearly the same in every trial or case (reproducible, having a very low p-value**, consistent with other data). In the case of experiments where controls or reproductions cannot be done, the researcher must either always mention the caveats or can rely on a convergence of evidence (see below).

*Caveat- from the Latin caveat emptor - "may the buyer beware." This is any reason why a conclusion cannot be entirely trusted. It is often a control that has not been done for whatever reason or that an experiment has not yet been (or may never be) reproduced. It is sometimes a flaw in the experiments performance (e.g., the temperature was 2^C higher than it was supposed to be, but assuming that it doesn't make a difference, here is my conclusion).

**p-value- a statistical term for the probability of randomly producing a result that varies from the expected result by that much or greater. For example, if you flip a coin 4 times and only get heads once, does that mean it is more likely to come up tails? The p-value of this result is 0.325, or a 32.5% chance of getting one or fewer heads. In most fields, p-values must be <=0.05 to be considered statistically significant. Keep in mind that a p-value of >0.05 may still be a real result and a p-value of <0.05 might be a result of random chance, though. p-values are probabilities, not absolutes.

<<Only the extreme physical-science ends of the science spectrum (particularly mathematics and to a lesser extent physics) live close to empirical evidence and logical conclusions. As one passes through chemistry to biology to the social sciences, the "scientific method" changes radically and becomes "much softer," less empirical and logical. There are so-called "proofs" in math; then there is strong "statistical" evidence in some areas of physics; but by the time one gets all the way along the spectrum to the social sciences (even the "firmer" social sciences such as experimental psychology), one is often talking about p <.1 or <p .05.>>
The need to use p-values does not make a conclusion less rigorous. They make a good measure of how robust a conclusion is. If a data set came out supporting or refuting a conclusion based on chance (which often happens with such high p-values), then it should not be reproducible. My science (genetics) relies heavily on statistics to come to many conclusions. After reproducing an experiment, having large trials, etc., the p-values can get down to <0.001 (which means a less than 0.1% chance of seeing this data by chance alone, meaning that it is quite safe to say the correlation is real). This principle applies in clinical trials of a drug to show effectiveness and can apply in some "softer" sciences described below.

When I was ripping on military scientists, clinical psychologists and geneticists in my last post, it was not about how robust their conclusion were (which is what Richard seems to be talking about). It was about either being sloppy with controls, ignoring scientifically rigourous experimental results and clinging to their expert opinions and stating opinions without mentioning that there is no direct evidence to support them (there is often indirect evidence, but this is usually on par with historical evidence). I had one professor who was great at mentioning caveats to what the prevailing opinions were (the "party line" he called them).

In the clinical sciences, results are sometimes less robust than a researcher would like (not enough to be accetable to a straight biologist, perhaps). That is no reason why they cannot be rigorously done (what I meant by empirical in my last post).

<<"Evidence" in "softer" social sciences such as archeology and anthropology, and also economics, political science, etc. is even more imaginative--very different indeed from anything remotely acceptable in the physical sciences. And in the clinical "sciences" such as medicine and many aspects of psychology, even clinical "vignettes" and "informed opinions" are considered to have weight. And this is still "science.">>
And this still causes confusion and more tangible problems. IMO, several sciences can be run in a more rigorous fashion than they are (even in hard sciences). There are some who would say murderers have walked away scot free because of an "informed opinion" without experimental evidence or mention of caveat. E.g., the doctor who says "He did not die as a result of poison. I have seen a large number of similar cases and this is an allergic reaction." when the more correct thing to say would be "It could have been poison, but it looks more like an allergic reaction. Here is why it looks more like an allergic reaction..." because evidence is consistent with poison, but more resembles allergies. This sort of expert opinion would allow an investigator or jury to consider other lines of evidence rather than believing poison is impossible, which brings me to:

Convergence of the evidence - in order to be valid, a hypothesis must be consistent with all known data. Even when a hypothesis cannot be tested experimentally, it can often become upgraded to a theory with sufficient convergence of evidence. A hypothesis may make predictions that can be "tested" by comparing evidence from nonlaboratory conditions. Checking a hypothesis this way may require working from many areas of expertise. This falls under falsifiability (see below). Several hypotheses can explain the same phenomenon. They can be ordered from better to worse hypotheses using two principles:

  • Testability - the easier a hypothesis is to test, the better it is. The best hypotheses make predictions that can be used to attempt to disprove the hypothesis. This is called falsifiability. Checking to make sure the predictions of a hypothesis are consistent with existing data is a kind of test to attempt to falsify it. It is easier to falsify most hypotheses than to prove they are true. I consider predictive and explanative value to fall under testability, BTW.

  • Parismony - if two hypotheses can explain the same phenomenon, than the one that requires the fewest number of steps or has a higher probability is preferable.

Convergence of evidence is used when experiments cannot be done. It is the origin of the theory of evolution, various theories in cosmology, the earth sciences and other sciences.

Now for something different:
A court of science will need to decide which ideas are better than others. Therefore it will need to beware of that which muddles debates. A few things to watch out for:

  • Going on the offensive:
    A common tactic in debates about hypotheses, theories, etc. to attack your opponets' position. A moderator must always be careful to make sure that the offender can also outline his position and defend it. The best way to defend an untenable position is to go on the offensive (it works well for tenable positions, too, of course). Make your opponent defend his position so you don't have to defend yours.

  • Convergence of a little evidence:
    It is not enough that a position is supported by some evidence. It must be consistent with all evidence. In many debates where mountains of evidence are available, some will seem to support an untenable position. Don't loose site of the big picture.

  • Knowledge grows with time; modification of theories is acceptable:
    Archaeopterix (sp?) is no longer generally considered a transitional form between birds and reptiles. Many creationists attack evolutionists by saying "they say this is a transitional form, but look at all this evidence it isn't!" No, they SAID, they no longer SAY. Often in debates, someone will attack an opponets position by putting up a "straw man" of an obsolete idea that does not invalidate the hypothesis or theory and attack it by either saying that the old position is the current one or that they were wrong before, so they are probably wrong now.

  • Arguement over details does not mean a hypothesis is wrong:
    In many fields the details of various theories are being debated even though the overall theory is considered fact. Going back to the evolution example, two major theories of evolutionary rates are gradualism and punctuated equilibrium. Do not mistake debate over which is correct for debate over whether the theory of evolution is correct.

  • Context:
    In any debate (not just scientific), the meaning of a statement is context dependent. I'm sure any of you can think of an example of quoting out of context or telling just part of a story that leaves out profound details.

The above examples were taken from "Why People Believe Weird Things" by Michael Shermer. It gives an excellent outline of two debates that are scientifically resolved, but still burn on because of the side with untenable positions use these tactics (creationists v. evolutionists, holocaust deniers v. historians). It also contains describptions of common fallacies that lead to erroneous beliefs.

Here are some others I have seen:

  • An agenda does not invalidate a position:
    Just because someone has something at stake in a debate does not automatically mean they are suspect. Their arguements must stand or fall on the same grounds as if they had no stake. Scientifically rigorous data is valid no matter who produces it.

  • Analogy is not an accurate description:
    Analogies are great for aiding understanding, but few if any analogies are perfect. A hypothesis or theory must be evaluated by itself. It can't be evaluated by analogy.

  • Beware anecdotes and sayings:
    I [recently read the following]:
    <<"There is an old saying that the easiest way to predict a person’s future is to look at their past. This holds true for institutions as well.">>
    I have my own sayings:
    • "Beware anecdotes,"
    • "caveat emptor," and
    • "Really? How do you know that?" (Don't accept "Everybody knows that!")

    Anecdotes and sayings often sound clever and appeal to existing beliefs, so they are believed. The problem is people often take them as evidence or fact without really thinking about why they may (or may not) be true. I think that the above saying is not valid [unproven]. The reason is that it has ambiguous criteria and immediately brings to mind supporting examples, but not contrary examples.

    A perfect example of a bad adage is when someone says "For every study that says one thing, there is another one that says something else." I am sure most of you have heard this one. When you hear it do you think of any of the myriad of examples of all studies on a subject saying the same thing (or nearly the same thing)? Would you have remembered those old studies unless a new one came out contradicting it? Would a study that merely reproduces a previous study's conclusion even be reported in the mass media? I suspect the answers to these questions are no, no and no.

  • Ex catherdra (without evidence):
    Many times people, especially experts, will say something ex cathedra (without giving the evidence). Make sure you get the evidence on every important point. Be fair and allow them time to think about it and look it up. Just because someone is caught off guard or unprepared does not mean that they are wrong.

<<No theory can be ruled out so long as it has some kind of evidence to back it up.>>
It can be ruled out if it is falsified. There is evidence, for example, that protein is the genetic material. We know that protein is not the genetic material, though. There are other interpretations of the data supporting protein as the genetic material, and different experiments have shown that genetic material can be transmitted with out protein, ergo that hypothesis can be ruled out. There are famous examples of some evidence supporting all kinds of hypotheses that are known to be false:
  • The sun revolves around the Earth
  • The Second Law of Thermodynamics does not apply in living cells
  • AIDS is a spontaneous, sporadic condition independent of HIV infection etc.

However, as I said, a hypothesis must be consistent with all evidence, not just some. None of the above examples are consistent with all evidence, so we know they are false. That is the power of falsification.

Well, I have written enough for now. I should get back to work.

John Shannonhouse
Department of Genetics
University of Wisconsin-Madison
jlshanno@students.wisc.edu


Previous Post Follows:

Hello,

On 11 April 1999, Gale wrote:

<<The results of a science court are (a) a set of agreed statements, (b) a set of opinions from the technical panel on statements on which there is disagreement by the parties, (c) supporting arguments by the interested parties, and (d) a summary by the moderator.>>
Well, this sounds interesting. I have to state my opinion:
There needs to be an itemized list of all scientifically rigorous evidence in the report from a "science court." Scientifically rigorous evidence should carry more weight than expert opinions. There are examples of scientifically rigorous evidence and expert opinions disagreeing even after clear-cut results are published. I know of studies in clinical psychology that clearly show that expert opinion can be of dubious value (e.g., studies show that treatments where treament time and cost are determined by the expert judgement of therapists are no more effective than treatments where guidelines cap treament time and cost). There are certainly no shortage of examples in genetics where long-held opinions across the field were overthrown by scientifically rigorous evidence (e.g., histocompatability is not determined by many genes as was originally assumed, but is based almost entirely on a single gene: the MHC locus). There is one example I can think of where many geneticists have not acknowledged that recent scientifically rigorous evidence flies right in the face of some expert opinions they are stating (and it was a simple experiment that should have been done long ago, too). Without either scientifically rigorous evidence (or a convergence of many lines of indirect evidence), expert opinions are just guesses. They may be better guesses than lay guesses, but they are still guesses.

BTW, stating that evidence is scientifically rigorous does not make it so. I had a discssion yesterday when someone stated ex cathedra that it had been empirically determined that a soldier was no longer effective in combat after a certain length of time. I was suprised that I had not heard about any controled experiments on the subject, or even that the military had EVER done a controled experiment about a factor affecting combat effectiveness (as far as I can tell from my roommates officer's manuels, they normally rely on expert opinion supported by circular arguements and circumstantial evidence). When I questioned him about it, I found out that only one type of serviceman was "tested" (fighter pilots) and it was generalized to all type of combat servicemen (but not, interestingly enough, to non-combat servicemen), that no controled experiments had been done (the "experimental group" was Allied airmen and the "control group" was Axis airmen) and at least one other factor he mentioned could account for the difference in performence (I immediately thought of two others). The experts (military officers) had decided by fiat that this was the main factor accounting for the difference between the two groups and called it an empirical determination (this is typical of military "science").

<<We would select the issues, such as nuclear energy in space, find parties who strongly advocate different positions on the issues, find a technical panel>>
What science questions are you planning on answering? This approach sounds like it can rapidly mutate into political/economic issues. Science questions might be:
    (1) Is nuclear energy in space harmful to the biosphere and/or people who work in the space industry?
    (2) If so, what specifically are the harmful effects of nuclear energy in space?
    (3) What facts are not known that needs to be known and what experiments can be done to find out those facts?
    (4) What can be done to minimize or eliminate harmful effects of nuclear energy in space? What specifically do the measures do to minimize the harmful effects and to what extent do they reduce the harmful effects?
    (5) Is a given effect harmful?

Economic and political questions are:

  • Does nuclear energy in space's benefits out weight its problems?
  • What sorts of harmful effects are significant?
  • What sorts of harmful effects are negligible?
  • Is a given measure to minimize harmful effects of nuclear energy in space worth the cost?

Notice the difference between the two sets of questions. The first set, with the possible exception of #5, are all objective questions. They are about data, not what should be done. The second set's questions are subjective. Their answers are directives about what to do and the distinction between facts and opinions can esily become muddled.

John Shannonhouse
Department of Genetics
University of Wisconsin-Madison
jlshanno@students.wisc.edu


Back to Walden Two Fan Site homepage

This site was created by ex-member Nexus.


This Walden Two Fan Site is hosted by
Twin Oaks Community
but does not necessarily represent the opinons or policies of Twin Oaks Community.