Gil Kalai, "Bible code" controversy - Statistical analysis

Gil Kalai - "Bible Code" Controversy - Statistical Analysis of the Data from "Bible Code" Experiments

Mailing address: Institute of Mathematics, Hebrew University, Givat-Ram, Jerusalem 91904, Israel
Telephone numbers: Office (972)2-6584729, Home (972)2-6536301, Fax (972)2-5630702.
Email addresses: kalai@math.huji.ac.il, my home page
The "bible code" controversy

"Inspect every piece of pseudo-science and you will find a security blanket, a thumb to suck, a skirt to hold. What have we to offer in exchange? Uncertainty! Insecurity!"
Isaac Asimov in the tenth anniversary issue of The Skeptical Inquirer.

Witztum, Rips and Rosenberg (WRR) describe in a paper in "Statistical Science" (Vol. 9 (1994), p. 429) the outcomes of two experiments which purport to statistically prove the existence of a hidden code in the Book of Genesis. In later preprints they report on further successful experiments and yet another successful experiment was reported by Gans.

WRR paper showed surprising proximity (according to some notion of distance) between equidistant letter sequences (ELS for short) of names of famous rabbis and their known dates of birth and death. WRR ran their experiments on two lists of rabbis and the experiments are referred to as the "famous rabbis experiments". Gans (and later WRR themselves) showed similar phenomenon when the names of the rabbis were matched to the places where they were born or died. (This experiment is known as the "cities of the famous rabbis experiment".)

WRR's fantastic claims raise the question whether the outcomes they describe express their own expectations rather than any real phenomenon.

Indeed, many similar experiments to those reported by WRR performed by skeptics - McKay, Simon and others showed no trace of the alleged phenomenon. Moreover, running the "famous rabbis experiments" and the experiment reported by Gans based on appropriate process for data-selection left no trace of WRR's phenomenon. See: Brendan McKay's site and Barry Simon's site .
Detailed study by Bar-Hillel, Bar-Natan and McKay (BBM) of WRR's experiments show the existence of a large "wiggle room" in WRR experiment and gives plenty of evidence for biased data-selection.

A comprehensive paper "Solving the Bible Code Puzzle" about these findings as well as some of those presented below appeared in "Statistical Science" Vol 14(1999) 150-173. See, Brendan McKay's site .

A case which is very similar in various respects to the famous rabbis experiment (pointed out to me by B. McKay) is the "Mars effect" The claim by the French psychologist Michel Gauquelin that Mars occupies certain positions in the sky more often at the birth of sports champions than at the birth of ordinary people. The British psychologist Hans Jurgen Eysenck thought that Gauquelin's results are the only reason not to reject astrology completely. Astrologer Robert Hand has stated that the Gauquelin findings are 'one of the strongest threats to mechanist-materialism in existence. See here for the lovely paper of Jan Willem Nienhuys.

In this site I describe my statistical work with Brendan McKay and Maya Bar-Hillel on the subject. The idea was to study only the statistical outcomes reported by WRR without getting into the historical or grammatical choices and without carrying experiments on the Book of Genesis or any other book.
Our conclusion is:
"The results of Witztum, Rips and Rosenberg stretch credibility, even without challenging the validity of their hidden code hypothesis. Our analysis of the results of their replication and control experiments show them to express naive expectations rather than statistical reality."

Our paper

The Two Famous Rabbis Experiments: How Similar is Too Similar?

Appeared as a discussion paper number 182 of the Center of Rationality and Interactive Decisions at the Hebrew University of Jerusalem. Copies are available by request from the following address:

The Center of Rationality and Interactive Decisions
Feldman Building, Givat-Ram,
The Hebrew University of Jerusalem
91904 Jerusalem Israel.

My first paper on the subject (August 1997) - pdfs version

My first paper on the subject - (bad) html version
A draft of an expanded version - pdf version
The paper with McKay and Bar-Hillel - pdf file without the Figures

Summary of the papers and some discussion:

A lovely paper which discuss many statistical issues which are relevant to the discussion here is:

P. Diaconis, Theories of Data Analysis: from magical thinking through classical statistics, in: Exploring Data Tables, Trends, and Shapes, D. Hoaglin et als. (eds.), Wiley and Sons, New York, 1985.

The first Paper (August 1997):

The basic observation is: The two (original) p-values reported by WRR are too close.

The hypothesis suggested in the paper is: The significance in the second test of Witztum, Rips and Rosenberg (WRR) was achieved via a data selection process, which was stopped when the significance level of the first test was met.

The results further suggest that the data-selection process was carried out (at least in its final stages) by adding or deleting favorable appellations for the Rabbis.
It is also argued that the distributions of pair-distances in the two experiments do not support reasonable interpretations of the original research hypothesis of a hidden text.

A draft of an expanded version (December 1997):

Here I carried out some of the suggestions from the first version. I discover a dependence between the data of the two experiments that I could not explain: The two distributions of distances are closer together than expected even from two samples of the same distribution.

I point out that WRR's pair-distances distributions are close to distributions obtained by simple simulations of an optimization procedure based on the P2 statistics.

Criticism against my early argument:
Some criticism of my early hypothesis was offered by several people. The main points raised against it are:
1. The initial observation is a-posteriori
This is the most serious criticism against my argument. One can argue that it is always possible to find something which looks unlikely and make a story around it.
The best way to support a theory which is based on a-posteriori observasion is, of course, via a replication.
The subsequent study concerning the similarity of the two experiments of WRR and more than that, the fact that the same phenomenon (similar p-values) occurred in the "cities experiment" give much additional strength to my hypothesis.
2. The two p-values being close is quite an arbitrary event. One could make a similar claim if one p-value was precisely twice the other or if the ratio between them was close to 3.14159, etc.
It turns out that expecting the outcome of a replication to be similar to the outcomes of the original experiment is a familiar phenomenon which is discussed in the psychological literature.
Tversky and Kahneman who studied people as intuitive statisticians showed that people have inflated intuitive expectations of achieving the same significance in a replication as the significance of the original experiment.
3. A p-value of 1/100 is not enough to accuse somebody in tailoring the experiments
But is it enough to raise suspicion?
In any case, our further studies showed further statistical "finger prints" that WRR's results were tailored.
4. the alternative hypothesis does not quite fit: the inspection paradox
This is a correct and quite an interesting point. The expected waiting time for a bus when you arrive to the station at a random time is usually larger than 1/2 the expected gap between two buses. This was overlooked in the first paper.
However, after checking closely the situation at hand it turned out that this mistake is not very damaging.
5. "Your hypothesis suggests that WRR acted stupidly. One thing you cannot blame them is being stupid."
Empirical experiments by Tversky and Kahnemann showed that people's (including statistical savvy scientists) statistical expectations are quite different than what can normatively be expected. In this case, it was difficult to know in advance what to expect. Moreover, experienced, statistically savvy, famous scientists made similar mistakes when they fabricated experiments. See the paper of Dorfman - Science, Vol 201 (1978) p. 1177 on the case of Sir Cyril Burt.

6. The excessive similarity between the two experiments may have some explanation according to WRR's research hypothesis.
Of course, everything can be explained as expressing divine intervention. However, note that that the striking similarity of WRR's two experiments relates to the false statistical measures WRR initially used and their defunct computer programs. The striking similarity for the two lists of Rabbis in the cities experiment occurs for the initial lists of cities that was later withdrawn as imperfect.
Finally, a paper by J.B.S. Haldane entitled 'The faking of genetical results' that appeared in "Eureka": Cambridge undergraduate mathematics journal from 1942 seems quite relevant. (I found, to my surprise, this reference with a discussion and further references in the book: "Fourier Analysis" by T.W. Korner Ch. 82 p. 425.) Haldane is quoting his father (experimental physiologist) as saying "Unless the blood is very thoroughly faked, it will be found that duplicate determinations rarely agree". He continues to say: "In genetical work also, duplicates rarely agree unless they are faked."
The new (1998) paper with McKay and Bar-Hillel.

The Two Famous Rabbis Experiments: How Similar is Too Similar?

This paper gives much more evidence that WRR's outcomes express WRR's naive expectations. We offer a solution for the mystery why the two distance distributions are so close together. We also discuss another aspect of WRR's experiment- the control experiments. The most important control experiment is the one suggested by Diaconis. WRR presented (as they expected) a flat histogram for this control experiment, but in the context of their experiment such a flat histogram is unlikely. They also presented utterly flat histogram for their experiment when they ran it on the Samaritan version of the book of Genesis. Again this is "too good to be true".
The paper contains statistical analysis of the following observations:
The significance level of WRR's experiment 2 was inordinately similar to that of experiment 1. (p=0.01)
The distribution of the pairwise distances in experiment 2 was inordinately close to that in experiment 1. (p=0.035)
The particular visual display of the pairwise distances as described by histograms was optimal, namely, of all possible histograms like this one (same number of bins, same breadth of bin) none would have yielded a second histogram as close to the first as the one actually used. This support the explanation that the dependence between the distributions is due to intentional intervention aimed at presenting similar histograms.
The histograms of the control experiment suggested by Diaconis were inordinately flat. (p=0.003)
The histograms of the 3 other control texts reported in WRR 1987 preprint were inordinately flat (p=0.003, p=0.017 and p=0.86.)
The p-values of Gans' experiment (based on WRR's method at the time) were also inordinately close. (p=0.002)
We also point out that WRR changed their measurement tools during the review process of their paper. These changes were apparently unknown to the referees.

A file of distances

Challenges for the interested reader

1. Did the Maharishi meditation program influence middle-east peace and car accidents in Jerusalem?
The following paper was published in a peer-reviewed scientific journal:
ORME-JOHNSON, D. W.; ALEXANDER, C. N.; DAVIES, J. L.; CHANDLER, H. M.; and LARIMORE, W. E. International peace project in the Middle East: The effect of the Maharishi Technology of the Unified Field. Journal of Conflict Resolution , 32(4): 776-812, 1988.
A rather small group of meditators seemed to have achieved: "Improved Quality of National Life as Measured by Composite Indices Comprising Data on War Intensity in Lebanon, Newspaper Content Analysis of Israeli National Mood, Tel Aviv Stock Index, Automobile Accident Rate in Jerusalem, Number of Fires in Jerusalem, and Maximum Temperature in Jerusalem; Significant Improvement in Each Variable in the Index (Israel, 1983). Decreased War Deaths (Lebanon, 1983)."
The strong correspondence between the number of Transcendental Meditation-Sidhi program participants in the group in Jerusalem and a composite index of all the variables above can only be described as amazing. The graph can be found Here .
Challenge: Find out what is going on.
2. Study statistically the changes between the two versions of distances
WRR described in their 87 preprint all the distances (152 for the first list and 163 for the second) between Rabbis and appellations. The histograms of the Statistical Science paper are based on these distances.
They also supplied computer programs els1.c (and later els2.c) which give somewhat different distances.
WRR claimed that
the changes represent a blind debugging process.
The findings of our paper suggest that
the defunct distances represent deliberate effort towards similarity.
Our paper is based only on studying the original list of distances.
Challenge: study our hypothesis based on the two versions of distances.
3. A conjecture on the distribution of distances.
WRR give no clue how the distances in their samples will look like except that their distributions will be skewed towards small distances.
In the second version of my paper I proposed a conjecture for the distributions of distances which is based on the assumption of biased data-selection towards success in a permutation test:
If you have a sample of size n and the product of the numbers in the sample is A then the distribution will be given by: The probability that x is smaller than t is
(1- log t/log A)**(n-1).
The rational is that apart from the sample size and product of the entries, the distances will be "random".
Challenge: Check this conjecture for the distribution of distances for the various "successful" samples described by Witztum, Rips, Gans etc.