This is the html version of the file
G o o g l e automatically generates html versions of documents as we crawl the web.

Google is neither affiliated with the authors of this page nor responsible for its content.
These search terms have been highlighted:  no  basis  studies  don  tell  us  same  sex  parenting 

Page 1
No Basis:
What the Studies Don’t Tell Us
About Same-Sex Parenting
Robert Lerner, Ph.D., and Althea K. Nagai, Ph.D.
Marriage Law Project, Washington, D.C.
January 2001

Page 2
The Good, the Bad, or the Ugly:
Formulating the Hypothesis
Compared to What?
Methods to Control for Unrelated Effects
Does it Measure Up?
Bias, Reliability, and Validity
It all Depends on Who You Ask:
Just by Chance?
Statistical Testing
Give Me More Power:
How the Studies Find False Negatives
Evaluation of the Studies
Same-Sex Parenting Studies
and the Law
No Balance: Same-Sex Parenting
Studies in the News
Table of Contents
Executive Summary
About the Authors

Page 3
Executive Summary
It is routinely asserted in courts, journals and the media that it makes
no difference” whether a child has a mother and a father, two fathers,
or two mothers. Reference is often made to social-scientific studies that
are claimed to have “demonstrated” this.
An objective analysis, however, demonstrates that there is no basis for this
assertion. The studies on which such claims are based are all gravely deficient.
Robert Lerner, Ph.D., and Althea Nagai, Ph.D., professionals in the
field of quantitative analysis, evaluated 49 empirical studies on same-sex
(or homosexual) parenting.
The evaluation looks at how each study carries out six key research
tasks: (1) formulating a hypothesis and research design; (2) controlling
for unrelated effects; (3) measuring concepts (bias, reliability and valid-
ity); (4) sampling; (5) statistical testing; and (6) addressing the problem
of false negatives (statistical power).
Each chapter of the evaluation describes and evaluates how the studies
utilized one of these research steps. Along the way, Lerner and Nagai
offer pointers for how future studies can be more competently done.
Some major problems uncovered in the studies include the following:
Unclear hypotheses and research designs
Missing or inadequate comparison groups
Self-constructed, unreliable and invalid measurements
Non-random samples, including participants who recruit
other participants
Samples too small to yield meaningful results
Missing or inadequate statistical analysis
Lerner and Nagai found at least one fatal research flaw in all forty-
nine studies. As a result, they conclude that no generalizations can reli-
ably be made based on any of these studies. For these reasons the studies
are no basis for good science or good public policy.
Four Appendices follow. Appendix 1 is a bibliography of the studies
and related publications. Appendix 2 is a table that summarizes the
evaluation of each of the studies with regard to each research step. Ap-
pendix 3 (by William C. Duncan) is an overview of how these studies
have been used in the law. Appendix 4 (by Kristina Mirus) describes
how the media has covered these studies.

Page 4
By David Orgon Coolidge
Director, Marriage Law Project
What do existing studies tell us about the impact of same-sex
parenting on children?
That’s right, nothing.
You would never know that, however, if you were to read most
court decisions, law review articles, commission reports or newspa-
per articles. You would hear the opposite.
The point of the study which follows is not to try to answer the
question, “Why is this?” Instead, Robert Lerner and Althea Nagai
have simply evaluated the studies themselves. They have asked: What
are their hypotheses? How do they set about to prove them? What
do they conclude? In formulating, executing and analyzing their re-
search, do these studies get it right?
The results are not pretty. Lerner and Nagai identified 49 empiri-
cal studies on the subject of same-sex parenting.* After going
through them all, inch-by-inch, they found…nothing.
I first saw the need for such an evaluation back in 1996, in Ho-
nolulu, Hawaii. I sat through two weeks of testimony in the same-
sex “marriage” case, Baehr v. Miike. Almost all of the testimony was
* The terms “homosexual” (on the one hand) and “gay and les-
bian” (on the other) are both loaded. The studies evaluated here ex-
amine parenting by same-sex couples in sexual relationships. To
avoid distraction I have used the term “same-sex.”

Page 5
by social scientists. It raised questions I could not shake.
Many of those questions are larger ones, such as how science and
morality relate. But other questions were more straightforward: Are
these studies well-done by normal standards? Should journals pub-
lish them? Should policymakers rely on them?
The fact of the matter is that many people, including
policymakers, are relying upon these studies in litigation, legisla-
tion, scholarly writing, and in the larger public debate. (To confirm
this, see Appendices Three and Four by Bill Duncan and Kristina
The least that should be done is to take a serious look at the
methodology of the studies. That is what Robert Lerner and Althea
Nagai have done. At the risk of damaging their professional and
academic reputations, they have done this full-scale evaluation. Here
you have the results. You will learn more than you ever wanted to
know about how studies should be designed, implemented and
evaluated. And you will learn how even the best studies of same-sex
parenting fall far short of these standards.
Lerner and Nagai have not only taken apart existing studies, how-
ever. By setting their evaluation in the context of a broader discus-
sion of social-scientific research, they have pointed the way toward
better studies. They are clearing ground so others can go forward.
In the meantime, the rest of us have decisions to make. How
shall we proceed? Lerner and Nagai make no attempt to answer this
question. They have only one point to make: Whatever you do,
don’t do it based on these studies.
Take the time to see what Lerner and Nagai discovered about the
same-sex parenting studies. These authors know a better or worse
study when they see it, and they tell it like it is. Whether we like it
or not, we are all in their debt.

Page 6
“[C]hildren with two parents of the same gender are as well ad-
justed as children with one of each kind.”
This view, revolutionary in its implications, and unheard of five
years ago, is now commonly asserted by social scientists, lawyers,
policymakers and the media. Numerous studies are routinely offered
to show that the sexual orientation of a couple makes “no differ-
ence” to the well-being of children. The obvious implication of this
view is that two gay “dads” or two lesbian “moms” can raise a child
as well as can two married biological parents. Simply being sur-
rounded by two caring adults is thought to be enough to raise most
children to be healthy, well-adjusted adults.
Is this claim true? Does
the research supporting it stand up to scientific scrutiny? These are
the questions discussed in this study. Our approach to this question
concentrates on an analysis of the methodologies used to carry out
existing same-sex parenting studies. We conclude that the methods
used in these studies are so flawed that these studies prove nothing.
Therefore, they should not be used in legal cases to make any argu-
ment about “homosexual vs. heterosexual” parenting.
Their claims
have no basis.
What Social Science Requires
Social science research is a complex process, but it follows a series
of well-defined steps. Each of these steps must be carried out prop-
erly to obtain valid conclusions. Like a chain is only as strong as its
No Basis: What the Studies Don’t Tell Us
About Same-Sex Parenting
By Robert Lerner, Ph.D., and Althea K. Nagai, Ph.D.
Notes for this section begin on Page 9

Page 7
weakest link, the conclusions derived from any research study are
only as reliable as its weakest part.
The typical sequence of social-scientific research involves:
Formulating concepts and research hypotheses
Creating the research design
Establishing measurements for important concepts
Defining the sample and its selection procedures
Collecting the data
Performing statistical tests on the data analysis, and
Based on the above, hopefully reaching valid conclusions.
The studies discussed here will be analyzed by following the typi-
cal sequence of social science research methods textbooks. Under
each heading we will analyze all the studies to see how well they
meet accepted social science standards. Any failures in the process-
failure to properly design the study, failure to properly measure the
relevant variables, failure to properly control for extraneous variables,
and failure to use the proper statistical tests-make a study scientifi-
cally invalid. Most importantly of all, if a study claims to find no dif-
ference i.e. “non-significant results,” and that study failed to carry
out one or more of these research links in the proper manner, its
conclusions are purely and simply invalid. Why? Because failing to
carry out correctly one or more of these essential elements, in and of
itself increases the chances of finding non-significant results. In
other words, if you look for wrong findings using wrong methods,
it is even more likely you’ll get wrong results.
Social Science and Public Policy
With one exception, the authors of these studies wish to influence
public policy to support same-sex marriage and the adoption of
children by homosexual couples. While the authors of these studies
have every right to advocate this point of view, as do those who dis-
agree with them, their wish means that the stakes in obtaining valid
answers to these research questions are very high. It is not enough
for a study to be interesting, or raise important questions about a
subject, or to be provocative. While these criteria may be enough to

Page 8
get a study published, they are not strong enough to justify dramatic
alterations in long-established public policies. To justify changes in
public policy, studies should be strong enough that policy makers
have faith in the study’s reliability, and confidence that more research
is unlikely to overturn its findings.
This is not an unreasonable requirement. The public policy con-
sequences of relying on inadequate or insufficient studies can be dev-
In 1973, a literature review undertaken by social scientists Eliza-
beth Hertzog and Cecelia Sudia purported to find that the effects of
growing up in fatherless homes are at most minimal and likely to be
due to other factors. The authors did not stop here. They stated it
might be a good idea to increase community support for single par-
rather than developing policies that forestall the absence of fa-
thers, or that oppose easy divorce. This study was part of a larger
current of expert opinion proclaiming that growing up in a one-
parent family had no negative consequences for the children living in
these arrangements.
With more rigorous research, these interpretations were chal-
lenged and eventually overthrown. Research has demonstrated that
divorce is not the costless exercise for children that many had pro-
claimed it to be. The newer research demonstrated that children
growing up in fatherless families do not do as well financially, in
school, and emotionally both as children and as adults, as those in
families with their married biological parents.
Therefore, the stan-
dards used here, to investigate studies on the impact of same-sex
parenting on children, are necessarily demanding. We owe ourselves
nothing less.
How These Studies Were Selected
All of the articles used in this review deal with same-sex couples
and/or their children. We excluded dissertations, review articles, and
articles in the nonscientific press.
We have only analyzed reports of
original research studies (i.e., real social science). We have tried to be
as exhaustive as possible, although research is exploding in this field.

Page 9
Working from a variety of angles,
we arrived at a final list of 49
studies for analysis that have been either published in professional
journals or as chapters in a book.
All present the results of original
research on homosexual parents and/or their children.
Do these 49 studies offer conclusive proof that there is “no dif-
ference” between heterosexual and homosexual households? We be-
lieve that these studies offer no basis for that conclusion—because
they are so deeply flawed pieces of research. The reader is invited to
make his or her own judgment.
1. Harris, 1998, p. 51. Harris cites this body of studies in her controversial book
on child development.
2. For example, sociologist Judith Stacey, writing in a recent issue of Contempo-
rary Sociology, a book review journal, that focuses on sociology and public
policy, writes that “thus far the research on the effects of lesbian parenting on
child development is remarkably positive and therefore challenging [the status
quo] . . .” Stacey, 1999, p. 21.
3. This is not the same as concluding that traditional family arrangements are
better. It simply states that the evidence presented above does not justify the
opposite conclusion.
4. Since vocabulary related to homosexuality is extremely contentious, we
should explain our terminology. We have tried to generally use the term “same-
sex,” since the terms “gay and lesbian” (on the one hand) and “homosexual
and heterosexual” (on the other hand) are so ideologically polarized. However,
the studies themselves use one set of terms or the other, so the reader should
expect a variety of terms.
5. For example, one can have a perfectly selected sample, but concepts that are
so badly defined and poorly measured that one is unable to conclude anything
from the results of the study.
6. Cited and discussed in Popenoe, 1998, pp. 59-61; McLanahan and Sandefur,
1994, pp. 13-14. A well-known study in the same vein was sociologist Jessie
Bernard’s The Future of Marriage (1972), which became famous or infamous
for its comment that to be happy in a traditional marriage a woman must be
mentally ill (quoted in Whitehead, 1998, p. 51).
7. For detailed discussion of the extensive research literature see the following
Notes to Introduction

Page 10
works: Waite, 1995; McLanahan and Sandefur, 1994; Popenoe, 1998; Amato
and Booth, 1997; and Whitehead, 1996.
8. Dissertations are original studies, not review articles; but if they go unpub-
lished, the most one can say is that they met the minimum standards for receiv-
ing a degree from the university that granted them, and nothing more. Review
articles were excluded because they present no original data for assessment. Ar-
ticles found in the nonscientific press were excluded because their criteria for
publication (e.g., popular interest, immediate policy relevance) are not the same
as those for assessing the scientific credibility of a study.
9. Graciela Ortiz, M.S.W., conducted initial bibliographic research in the summer
of 1998. Additional studies were identified by examining law review articles
published by Wardle (1997) and Ball and Pea (1998), briefs filed in Baker v.
State, the Vermont same-sex “marriage” lawsuit, and Lesbian and Gay Parenting:
A Resource for Psychologists, Washington, D.C.: American Psychological Associa-
tion, 1995.
10. There is also one book, Tasker and Golombok (1997), which is part of the

Page 11
We’ve all heard the slogans: “If you don’t know where you’re go-
ing, you can count on getting there,” or “If you aim for nothing,
you’re sure to hit it.” The same is true for formulating the hypoth-
esis of a research study: If your goal is to prove no differences, you’re
bound to reach it. But you won’t have proved “no difference,” only
no basis.
All good studies begin with careful definitions of key concepts
and careful delineation of the relationship between these concepts.
Formulating the hypothesis is the crux of any scientific design, and
its development requires special care. The hypothesis determines the
main focus of the study, and frames all subsequent research endeav-
Hypotheses can be Good (affirmative), Bad (fuzzy), or Ugly
(null). Of the 49 studies, two are Good, 29 are Bad, and 18 are
Ugly. Understanding why requires Social Science Research Methods
101, which we will sprinkle throughout this and other chapters.
What is a Good Hypothesis?
All good social science studies have at their core a positive hy-
pothesis statement. This takes the form of an explicit conceptual re-
lationship between two variables whereby something (an
independent variable) “causes” something else (a dependent vari-
The researcher posits a direct relationship between the inde-
pendent and the dependent variables.
The hypothesis can and
Chapter 1
The Good, the Bad, or the Ugly:
Formulating the Hypothesis
Notes for this section begin on Page 22

Page 12
should be stated as a proposition that takes the following form: “the
greater the a, the greater the b,” where “a” is the independent vari-
able and “b” is the dependent variable.
Hypothesis statements may be either quantitative or qualitative.
Consider the following example. A study group of children is en-
rolled in a social program such as Head Start, while a control group
of children is not enrolled there. The independent variable here
therefore, is “enrolled versus not enrolled.” The dependent variable
might be something like “readiness for school.” Assuming that
“readiness for school” is a quantitative variable (i.e., it can be scored
on a three or more point scale), then the research hypothesis would
compare mean levels of school readiness of those in Head Start with
those who are not. Assuming “readiness for school” is a qualitative
(i.e., “yes” or “no”) variable, the research hypothesis would compare
the proportion of Head Start children who are ready for school
with the proportion who are not ready.
There are many different
possible hypotheses a reseacher might have, depending upon the na-
ture of the problem studied and the level of measurement assumed
in the independent and dependent variables.
Applying this view to
studying a parent’s sexual identity and its possible relationship to
child outcomes, the investigator should define conceptually the in-
dependent variable (“homosexual versus heterosexual” identity),
dependent variable (such as a child’s sexual identity, child’s psycho-
logical adjustment, or the child’s sexual behavior), and the posited
relationship between the independent variable and the dependent
An example of such a properly stated research hypothesis is: “the
children of homosexual parents are more likely to grow up to be ho-
mosexual than are the children of heterosexual parents.”
Good: The Affirmative Research Hypothesis
Only two studies among the 49 studies we examined actually
contain an explicit positive hypothesis statement of this sort
(Pagelow, 1980 and Miller, 1979).
Pagelow (1980) hypothesizes
that lesbian mothers are more oppressed than heterosexual mothers.
The researcher then seeks to measure this by the concept of perceived
oppression in the areas of freedom of association, employment,
housing, and child custody.

Page 13
Miller (1979) comes closest to presenting a hypothesis in
the proper format: Miller asks, A.) “Do gay fathers have children to
cover their homosexuality?” B.) “Do they molest their children?” C.)
“Do their children turn out to be disproportionately homosexual?”
D.) “Do they expose their children to homophobic harassment?”
While Miller does not put his hypotheses in precisely the y=(f)x for-
mat, the hypothesis statements are both specific and decisional (i.e.,
they can be answered as either “yes” or “no” regarding the homo-
sexuality of the father).
Thus Miller’s statements can be easily rephrased into the follow-
ing testable hypotheses: A.) The reason for gay men having children
is to cover their homosexuality (as opposed to other choices pro-
vided by the investigator, such as he loved the woman, he was con-
fused, he just wanted children, don’t know); B.) Gay fathers are
more likely to molest children than are straight fathers; C.) Children
of gay fathers are more likely to be homosexual than are children of
straight fathers; and D.) Children of gay fathers are more likely to be
exposed to homophobic harassment than are children of straight fa-
Stated in this form, the hypotheses can then be verified or re-
futed by empirical research.
Bad: The Fuzzy Hypothesis
A majority of the studies we examined (29 of them or 59 per-
cent) failed to produce a testable hypothesis. Of these, 12 studies
rendered their statement of the research problem in the form of “Are
there differences between homosexual and heterosexual parents?”
For example, Bigner and Jacobsen, 1989a state their research prob-
lem as, “an examination of factors that may motivate gay men to be-
come parents, and to explore whether gay fathers may differ from
heterosexual fathers regarding the value of children in their life as an
Brewaeys et al. (1997) poses the problem as an examina-
tion of “family relationships and emotional/behavioral and gender
role development of 4-8 year old children”
in lesbian donor-in-
seminated families, compared to heterosexual families who conceived
their child also by donor insemination and heterosexual families who
conceived their child naturally. Hypotheses that are stated in the
form of looking for possible differences do not suffice as statements
of research hypotheses. Formulation of a hypothesis in terms of pos-

Page 14
sible differences fails to address any of the causal questions that
guide hypothesis formation. Such a formulation is purely descriptive
in nature and is not “an explicit conceptual relationship between
two variables whereby something (an independent variable) “causes”
something else (a dependent variable).”
This kind of formulation, which may seem commendable in its
caution, fails the “so what?” test. A proper research hypothesis re-
quires the hypothesizing of some kind of causal mechanism operat-
ing in the real world so that some kind of tentative causal
conclusion can be drawn from the research results if they are valid
and the hypothesis test is successful.
Seventeen studies present the research problem in the form of,
“what are the characteristics?”
For example, Gartrell states, “The
aim [of this study] was to learn about the homes, families, and com-
munities into which the children were to be born.”
writes, “The family dynamics and developmental changes within
these families and the implications for the psychotherapeutic treat-
ment of lesbian mother families are the subject of this chapter.”
Pennington declares, “The purpose of this chapter is to discuss the
major issues confronted by children living in lesbian mother house-
Hypotheses that take the form of descriptions of characteristics
face a different problem from that faced by statements of possible
differences. A focus on what is “characteristic” of a population (e.g.,
the mean, median, or mode) can obscure causal relations that are
not “characteristic” of the populations studied, but are nonetheless
causal in nature.
For example, sociologists Sara McLanahan and
Gary Sandefur report that 29 percent of young adults from one-par-
ent families dropped out of high school while only 13 percent of
those from two-parent families dropped out. Dropping out is not
“characteristic” of children from either type of family structure, yet
there is little doubt that a causal relationship between the variables
of type of family structure and the propensity to drop out of school
exists (1994, Figure 1, p. 41). This problem can be put in more
general terms. Focusing on characteristics of populations obscures
the necessity for a proper research hypothesis to focus on the rela-
tionship between two variables and not the properties of each of

Page 15
them considered separately. In this respect, focusing on the charac-
teristics of an attribute is misleading and hinders the scientific re-
search enterprise. All of these studies are not a priori invalid as
instances of exploratory research. Compared to studies that state and
test the research hypothesis properly, however, they are much inferior
in their level of research sophistication and precision. It tells us to
look for and expect other problems in later research steps. Authors
with such weaknesses in their formulation of hypotheses are unlikely
to produce any conclusions sufficiently robust so to inform public
policy debates with any degree of dependability.
The Ugly: Affirming The Null Hypothesis
The remaining 18 studies explicitly seek to find no differences
between heterosexual and homosexual parents in child outcomes
and to make this formulation a kind of hypothesis statement. While
this procedure is superior in some way to those used in the other
studies, it is also highly problematic because of the difficulties asso-
ciated with testing hypotheses purporting to affirm the null
The authors of the null hypothesis-affirming studies seek to show
either that children raised by homosexual couples are not more likely
to grow up to be homosexual themselves than are those raised by
heterosexual couples, and/or that they are not more likely to grow
up with psychological problems than are children raised by hetero-
sexual couples, or both.
Eighteen studies explicitly seek to find no differences between
heterosexuals and homosexuals.
For example, Flaks et al, in their
study of 15 lesbian couples, 15 heterosexual couples and their chil-
dren, state, “On the basis of prior research, we expected [to find] no
differences between the children of lesbian and heterosexual parents
in any of the areas evaluated.”
In another case, Huggins studied adolescent children of lesbians,
expecting that a parent’s homosexuality would not result in confu-
sion of the child’s sexual identity, inappropriate gender role behav-
ior, sexual orientation, and overall psychopathology.
Likewise, Patterson’s studies of donor-inseminated lesbian fami-
lies all start with the expectation of finding no differences between

Page 16
the children of lesbian and heterosexual parents.
The same is also
the case with Tasker and Golombok (1995, 1997).
The “no difference” hypothesis used in the 18 studies discussed
above inverts the usual social science quantitative research procedure,
which would use a positive research hypothesis in the form described
above. This creates two major methodological problems that are un-
recognized by the all the authors of these studies save one (Chan et
al, 1998).
1) Failing to reject the null hypothesis necessarily leads to an
indeterminate result because one cannot validly “confirm”
the null hypothesis, and
2) Inverting the normal hypothesis testing situation makes it
too easy to fail to reject the null hypothesis, which is the
outcome favored by these researchers.
This results in an undue partiality in interpreting their research
findings and in carrying out the research itself. To see all of this
clearly, it is necessary to review the usual statistical testing procedure
in quantitative social science. This procedure requires statistical test-
ing of a positive research hypothesis.
A simplified example may help to visualize what is involved. For
example, suppose a researcher hypothesizes that political liberalism
leads to greater support for abortion rights than does political con-
servatism. One way to test this research hypothesis might be to use a
national sample survey of the American public (e.g., data from the
General Social Survey produced by the National Opinion Research
Center). With this body of data, which consists of individual re-
sponses to questions on a questionnaire, one computes the mean
“support for abortion” score of liberals and the mean “support for
abortion” score for conservatives.
One can assume that if this pro-
cedure is carried out, liberals will have a higher average score than do
conservatives (and in fact, they do). Since it is extremely unlikely that
such a comparison will yield exactly the same average score for both
liberals and conservatives, one must question whether this finding is
a real difference or whether it could be due to chance factors. The
difference in averages alone does not provide sufficient information
to determining the likelihood. To answer the question, statisticians

Page 17
have developed ways of distinguishing between statistically signifi-
cant and statistically insignificant differences. Insignificant differ-
ences might be due to sampling error, measurement error, or just
random fluctuations. In fact, these are competing claims that ought
to be considered in as possible explanations for any research finding.
Another way to state the null claim is, that if it is true, any difference
found in the data is due solely to random variation. Chance occur-
rences of this kind do happen; individuals do win the lottery and
draw royal flushes in honest poker games.
To ascertain whether in a given instance random variation explains
the findings in the data, the researcher carries out a statistical hypothesis
test. This requires a research hypothesis
and a null hypothesis.
The research hypothesis is of the kind already discussed. The null
hypothesis is a hypothesis of no difference or no effect. In the abor-
tion example, the research hypothesis is that liberals are more pro-
abortion than are conservatives.
The null hypothesis is that there is
no difference between liberals and conservatives in their support for
abortion rights. Both of these cannot be true. In carrying out statis-
tical hypothesis testing, the null hypothesis is a statistical device that
allows for calculation of the value of a test statistic. The test statistic
is calculated to determine the probability that the null hypothesis is
true given the data at hand. After the calculations are carried out,
the test statistic yields some number. If the number calculated from
the test statistic is greater than a certain preset value, which is called
the critical value (e.g., t>=2.00), the null hypothesis is rejected at
the associated level of statistical significance (e.g., p<.05)
and the
research hypothesis is accepted.
For example, suppose we find that the mean abortion rights score
for liberals is greater than it is for conservatives. In our example, the
relevant test statistic is calculated and the results are checked to see
whether they are statistically significant or not at the preset level of
statistical significance (in fact, there is a substantial statistically sig-
nificant difference between liberals and conservatives when this test
is carried out). Then we reject the null hypothesis and accept the al-
ternative hypothesis that “liberalism” is correlated with “support for
abortion rights.” Of course, carrying out this one hypothesis test
does not end the researcher’s task. In fact, the formal hypothesis test

Page 18
is just the initial step in analyzing the data. The social scientist then
has to show that this difference is not due to other factors (for ex-
ample, due to differences in education among sample members or to
selection biases in the sample). However, at least he has a statistical
relationship to work with, to try to either explain or explain away in
terms of broader substantive considerations.
There are two subsidiary points that need to be made here. First,
the reason the statistical test situation is conceived in the above man-
ner is to yield a determinate outcome. If the null hypothesis is re-
jected, then the alternative hypothesis is accepted. In our example
above, we reject the null hypothesis of no difference between liberals
and conservative on their support for abortion rights and accept the
alternative hypothesis that liberals have a higher average support for
abortion rights scores than do conservatives which is statistically sig-
nificant (and they really do). Also, statistical tests of this kind place
the burden of proof on the investigator to show support for his re-
search hypothesis. That is why the criterion for rejecting the null hy-
pothesis is difficult. Thus, social science researchers conventionally
use the p<.05 level of statistical significance. It does not have to be
this stringent (1 out of 20), but the practice has evolved so that it
has become the standard social science research convention in the
standard positive hypothesis social science research situation.
What happens if the null hypothesis cannot be rejected? In this
situation there are always two competing explanations for this result.
The first possible explanation for the failure to reject the null hy-
pothesis is that whatever differences are found really are due to
chance factors, so that no statistical, let alone causal, relationship be-
tween the two variables really exists. A statistician might say that if
the test were repeated an infinite number of times, a zero correlation
or a zero difference between the two groups studied would result.
The examples of the honest poker game and the honest lottery are
relevant here. The second possibility is that the research hypothesis is
true but its truth cannot be ascertained by the research results be-
cause there is some flaw in the study design or in the statistical test
itself, which causes the test statistic to yield a statistically insignifi-
cant result or p value.
Therefore, failing to reject the null hypoth-
esis by itself does not lead to a determinate result. Since every failure
to reject the null hypothesis has two possible explanations, one can-

Page 19
not simply “accept” the null hypothesis in the same way that one
can “reject” the null hypothesis and “accept” the alternative hypoth-
esis. Further investigation or conducting a new study is always in or-
This is the problem with the 18 studies that explicitly sought to
confirm the null hypothesis as their research hypothesis. These stud-
ies sought to prove the null hypothesis, which, as we have shown, is
not the same thing as failing to reject the null hypothesis. In sub-
stantive terms, their authors seek to show that homosexual parents
produce the same child outcomes as do heterosexual parents. This
means that they desire to be able to “accept” the null hypothesis as
showing that homosexual parenting has no effect on child outcomes
simply on the basis of failing to reject the null hypothesis. This vio-
lates the standard statistical hypothesis testing procedure. It is
wrong because, as we show above, failing to reject the null hypoth-
esis does not necessarily mean that the null is true.
This is not merely a technical flaw in these studies. These investi-
gators report their failure to reject the null hypothesis and falsely
conclude that there is no difference between homosexual and hetero-
sexual parents in child outcomes.
This false conclusion invalidates
the “findings” of no difference between heterosexual and homo-
sexual parents as reported in the research literature that we have sur-
veyed. Only the authors of one study (Chan et. al, 1998) showed
any awareness of the problem, but they did nothing to correct for it
or to alter their interpretations of their results because of it. If the
null hypothesis itself becomes the research hypothesis, and some
kind of research hypothesis is to become the new null hypothesis,
then the standard testing situation must be radically altered to ac-
commodate this situation and non-standard statistical tools are
needed in order to reach defensible results.
The studies we surveyed
all failed to do this or even to indicate that they saw the need for do-
ing it. This indicates that their authors’ understanding of the logic
of quantitative social scientific research is suspect. When the hypoth-
esis statement is properly conceptualized, the null hypothesis is used
in conducting statistical tests as the comparison hypothesis to the
one under investigation. It is no substitute for a properly formulated
affirmative hypothesis. It is the objective of properly stated hypoth-
eses, proper design, and proper execution of an empirical research

Page 20
study to decrease the probability that the relationship uncovered by
the investigator is due to chance.
The goal of genuine social-scientific research, in short, is to make
the null hypothesis less, not more, likely.
Properly speaking, then,
one can never prove the validity of the “null hypothesis.” When you
hear the statement that a study found “no significant difference,”
what this actually means is that, having done some tests, the investi-
gator can only say, “I looked for differences, and haven’t found any-
thing significant yet. But who knows?” In social-scientific terms, the
study “failed to reject the null hypothesis.” It proved nothing.
In summary, in conducting a statistical test of a hypothesis there
are two possible outcomes. The first is to be able to reject the null
hypothesis and accept the research hypothesis that a difference be-
tween the groups does exist that is not likely to be due to chance
factors. The researcher then proceeds to see if his or her hypothesis
can stand up to other tests of its validity, by introducing controls for
extraneous and confounding factors and the like. These are all the
subsequent research steps we will be discussing below.
The second possible outcome is to fail to be able to reject the
null hypothesis. This is NOT the same as showing that no effect ex-
ists. There are many possible reasons why one may fail to reject the
null hypothesis yet be in error in doing so. For example, the sample
used in the study may be too small to reach the appropriate level of
statistical significance for a given effect, the significance level used in
the significance test itself may be set too high, or the research instru-
ments used to measure the independent and dependent variables
may so highly unreliable that no stable results are possible. Even if
none of these factors can explain the absence of positive results, this
still does not show that no effect exists. The researcher then pro-
ceeds to see if his or her non-finding can stand up to additional tests
of its validity, by introducing controls for extraneous and confound-
ing factors that might cause a spurious non-correlation (see more
Precisely because the usual and correct research procedure is to try
to reject the null hypothesis, projects that aim to demonstrate no
significant differences between homosexual and heterosexual parents
and/or their children face serious problems. If the investigator starts

Page 21
with the goal of finding no differences, the investigator may be too
quick to assert that there is no relationship between the independent
and dependent variables, without taking into account all these other
aspects of a study that may go wrong. Serious social scientific re-
search is complex enough that the subsequent elements of a study
introduce ample opportunities for poor execution. Ironically, each
poorly executed research step, such as setting up comparison
groups, sampling, measurement, and statistical analysis, increases the
likelihood of finding no difference.
What Went Wrong and What Can De Done About It?
Let us restate briefly the lessons from Step 1: Formulating the
1) A proper hypothesis defines:
The posited cause (or independent variable),
The posited effect (or dependent variable), and
The posited causal relationship between the two.
Only two of the 49 studies do this.
2) A research project that looks for differences, but fails to
state its hypothesis in any clearly causal form, is off to a
very bad start. This describes 12 studies.
3) A research project that focuses on characteristics, but
fails to state its research hypothesis in a standard form, is
also off to a very bad start. This describes 17 studies.
4) A research project designed to “prove a negative”–the null
hypothesis–is doomed to great difficulty from the start. Yet
18 of the studies take exactly this approach.
If your goal is to prove no differences, you’re bound to reach it,
and the poorer research you do, the more successful you will be. But
there is nothing inevitable about this. There is no reason that proper
research hypotheses cannot be formulated in this area of study, and
then implemented through various methods that we will now de-

Page 22
1. A study that is properly thought-out at this conceptual level would not make
the kind of mistakes that show up subsequently in much of the research ana-
lyzed in this essay, such an improper sampling, neglect of extraneous variables,
and attempts to “prove” the null hypothesis.
2. See Rosenberg (1968) Hirschi and Selvin (1973) Cook and Campbell
(1979), Davis (1985), Rossi and Freeman (1995), and Nachmias and Nachmais
(1997) for discussions of the conceptual logic of social science research.
3. Mathematically, this is stated, y=f(x), where x is the independent variable and
y is the dependent variable.
4. Most causal statements in the social sciences exhibit neither necessary nor suf-
ficient conditions. Hypothesis statements are probabilistic in nature because
most social phenomena are due to the operation of multiple causes (see the
sources cited above for more discussion of this point).
5. This discussion of levels of measurement is relatively crude, but sufficient for
this purpose. Generally investigators distinguish between nominal, ordinal, in-
terval, and ratio scales. See Nachmias and Nachmais, 1997, pp. 158-163.
6. Any hypothesis can be translated into one of three statistical forms suitable
for direct assessment: 1) a correlation between two quantitative variables, 2) a
difference between two qualitative attributes in their mean scores on a quantita-
tive variable, or 3) a difference in (or a ratio of) proportions between two quali-
tative attributes.
7. Notice that this example begs all sorts of definitional questions concerning
homosexual and heterosexual “orientation.” Even a superficial examination of
the literature reveals that sexual orientation is far more complex than a simple bi-
nary attribute like gender. Not only are there different types of homosexuals,
but the self-identification question does not distinguish between identities,
feelings, and behaviors. Nor does it describe these properties in the life histories
of the individuals studied. For a thorough discussion of these questions, see
Laumann et al., 1995, pp. 283-286 and passim.
8. Both of these studies are seriously flawed. Pagelow does not focus on chil-
dren and Miller has no comparison group of heterosexual parents.
9. Pagelow, 1980, p. 191. Unfortunately, Pagelow subsequently used self-con-
structed and unreported measurements, and did no statistical testing of her self-
selected sample. This fatally flaws her study.
10. Miller, 1979, p. 545.
Notes to Chapter 1

Page 23
11. Unfortunately, Miller failed to compare his sample to a group of hetero-
sexual fathers and their children. So even though his hypothesis statement is
good, his study is fatally flawed. All the steps are important.
12. These studies are: Golombok et al, 1983; Green, 1978, 1982; Green et al,
1986; Kweskin and Cook, 1982; Lewin and Lyons, 1982; Lyons, 1983;
Mucklow and Phelan, 1979; Bigner and Jacobsen, 1989a, 1989b, 1992;
Brewaeys et al, 1997; Golombok and Tasker, 1996; Koepke et al, 1992; Miller
et al, 1982; and Bailey et al, 1995.
13. P. 166.
14. Brewaeys et al (1997, p. 1349).
15. See above.
16. Morris Rosenberg usefully distinguishes a number of causal relationships
commonly studied in social scientific research including the stimulus-response
and property-response relationships. Studies of the consequences of sexual ori-
entation can be classified as (potential) property-response relationships.
Rosenberg, 1968, pp. 13-17.
17. The studies that characterize the problem in the form of “what are the char-
acteristics?” are: Barret and Robinson, 1990; Bozett, 1980; Cameron, and
Cameron, 1996; Crosbie-Burnett and Helmbrechty, 1993; Gartrell et al, 1996;
Hare, 1994; Hoeffer, 1981; Javaid, 1992; Kirkpatrick et al, 1981; Lewis, 1980;
Lott-Whitehead, and Tully, 1992; McCandlish, 1987; O’Connell, 1993;
Pennington, 1987; Rand et al, 1982; Riddle and Arguelles, 1989; Ross, 1988;
Weeks et al, 1975; West and Turner, 1995; Wyers, 1987.
18. Gartrell et al, p. 274.
19. McCandlish, p. 23.
20. Pennington, p. 59.
21. Hirschi and Selvin (1973, pp. 123-125) develop this point at some length
when they show that focusing on the characteristics of attributes is incompatible
with the fact that social phenomena nearly always exhibit multiple causes.
22. These studies are: Chan et al, 1998; Crosbie-Burnett and Helmbrechty,
1993; Flaks et al, 1995; Green, 1982; Green et al, 1986; Harris and Turner,
1985; Huggins, 1989; Javaid, 1992; Kirkpatrick et al, 1981; Miller, 1979;
McNeill et al, 1998; Miller et al, 1982; Patterson, 1994a, 1994b, 1997; Tasker
and Golombok, 1995, 1997; Turner et al, 1990.
23. 1995, p. 106.

Page 24
24. 1989 p. 124.
25. E.g., Patterson, 1994a, pp. 157-158.
26. This can be set up in many ways, using many different types of data, and
many different test statistics. The account given here is illustrative only.
27. Statistics textbooks use the terms null and alternative hypothesis. We prefer
the term research hypothesis because the alternative hypothesis is what the re-
searcher is actually interested in.
28. One might object that the research hypothesis stated here is actually one-
sided, while the statistical tests described below are two-sided. While strictly
speaking this is correct, there is generally little difference in outcomes between
the two situations.
29. A two-sided t-test where the null hypothesis is to be rejected if p<= .05 re-
quires a value of 2.00 or greater with a sample size of 50 or more.
30. This is the two-sided 5 percent level of statistical significance. One-sided
tests are sometimes used, but they are vulnerable to a certain degree of manipu-
lation in order to achieve the desired result.
31. There are many possible flaws that could produce this result even when the
null hypothesis should be rejected and the alternative hypothesis accepted.
These include: too stringent a statistical test, too weak a test statistic, insufficient
sample size, measurement unreliability, restriction of variable range, suppressor
third variables, and inadequate comparison groups. Some of these are discussed
32. In fact, there were only five studies with any results worth considering at all.
This means that the authors used a heterosexual comparison group and they in-
cluded a sustained degree of multivariate testing. These are Brewaeys et al,.
(1997), Chan and Patterson et al., (1998), Flaks et al., (1995), and Green et al.,
(1986). Tasker and Golombok (1995; 1997) is noteworthy as being the only
longitudinal study of the issue, but the authors failed to analyze their data prop-
erly and their samples were vanishingly small.
33. Strictly speaking it is a logically invalid inference.
34. Statisticians have begun to develop practical means for doing this. For ex-
ample, Donohue describes how an alternative hypothesis of a particular magni-
tude can be used as a kind of “null” hypothesis. He then shows how a p-like
value can be calculated that may permit rejecting of this “null” hypothesis
35. There is always the probability, albeit sometimes extremely small (for ex-

Page 25
ample, one out of 10,000 instances), that any relationship uncovered by the in-
vestigator is due to chance. Needless to say, the smaller the probability that the
relationship is due to chance, the greater confidence we have in its reality, all
other things being equal.
36. This is because the social sciences, like all other sciences, are interested in
discovering laws governing the phenomena in question. There are presumably
an infinite numbers of ways of getting this wrong.
37. The late Jacob Cohen, a leading psychometrician, pointed out that any at-
tempt to prove the null hypothesis “is always strictly invalid.” Cohen, 1988, p.
16. One can only calculate the proper test statistic and then either reject it, or
fail to reject it.

Page 26
Now that you have a hypothesis, you put it to work. You need
operational definitions.
These are the specific recipes that translate
the independent and dependent variables that are part of the
researcher’s hypothesis into real world operations or actions. The
process of operationalization, as it is sometimes referred to, involves
translating these variables into concrete measurements that can be
recorded and analyzed.
For example: if your hypothesis involves
“gender,” you need an operational definition of this concept. If a re-
searcher plans to carry out a reputable survey, there must be box in
the survey where a respondent may mark his or her gender.
Without understanding the operational definitions used by a
study, it is impossible to know how the hypothesis was tested.
ditionally, another researcher cannot replicate the study. To make
matters worse, if operational definitions are imprecise or erroneous
they will have one predictable effect: To increase the probability of
“finding” that there is “no difference.”
Five Kinds of Controls and Why They Matter
If the goal of a study is to test a hypothesis, and one has
operationalized the concepts used in the hypothesis, then the stage
is set for the next step. The researcher must impose various controls
on the research design to eliminate false answers. While the actual
process can be complex, the basic idea is simple: If you want to
show that A “causes” B, you need to get other causes out of the way.
Chapter 2
Compared to What?
Methods to Control for Unrelated Effects
Notes for this section begin on Page 53

Page 27
The five key methods for doing this are (1) use a comparison group,
(2) control for extraneous variables, (3) control for suppressor vari-
ables, (4) use pair or group matching and (5) use multivariate statis-
tical tests. These methods build upon one another and most valid
studies use at least three of these methods.
In this chapter, we will define each method, and look at how it
was used, not used, or misused in these studies. In brief, our exami-
nation of the 49 studies disclosed:
Eighteen used no control method of any kind
Seven used only one control method
Fifteen used only three control methods
Eight used four control methods
One used all five control methods
Method One: Use Control Groups
As an absolute minimum, a study of whether parent sexual iden-
tity affects child outcomes needs a study group and a comparison
If the independent variable is the sexual orientation of the
parent, there must be at least two groups of parents, homosexual
and heterosexual.
Otherwise it is logically impossible to draw any
conclusions about the possible effects of parental sexual orientation.
Attributes that do not vary are not variables and can explain noth-
ing. Ideally, the study and comparison groups should differ solely
on the single variable of the parent’s sexual orientation. The groups
should be otherwise identical regarding the parent’s level of educa-
tion, the ages of their children, the ages of the parent, etc. ... These
other features are extraneous variables
whose influence the re-
searcher strives to eliminate as far as possible. The control group
should increase the likelihood that any results uncovered by the in-
vestigator are actually based on differences in parent sexual
It is disappointing to discover, therefore, that 21 studies (43 per-
cent) had no heterosexual comparison group at all. This makes them
scientifically invalid from the outset.
For example, Green’s study of 37 clinical cases of children raised
by transsexual and homosexual parents (1978) lacks even one hetero-

Page 28
sexual comparison group. Green notes this as a “limitation,” but
goes on to conclude (albeit tentatively), that children raised by
transsexual or homosexual parents are no different in their sexual
identity than those raised by heterosexuals.
But the lack of a control
group is more than a “limitation.” It makes it impossible for Green
to offer any scientific generalizations.
Similarly, Miller (1979), after formulating a good hypothesis,
fails to include any heterosexual control group in the study on gay
fathers and their children. He recognizes the need for carefully con-
trolled studies,
but his is not one of them. This does not keep him,
however, from making at least two significant claims based on his
study: (1) “[T] here does not appear to be a disproportionate
amount of homosexuality among the children of gay fathers,”
(2) while the children he studied had “problems of sexual acting-
the acting-out is more likely to be a function of divorce, not
of the father’s homosexuality.
This is a reasonable hypothesis. If he
used proper methods, we might be able to see if it is true. He does
not, however, so his claims have no scientific basis.
Lacking comparison groups, these studies tell us nothing about
the effects of differing parental sexual identities on child outcomes.
They should be simply disregarded.
The studies by Charlotte Patterson use comparisons but have
similar flaws. The Patterson study uses a method that does not al-
low controls for extraneous variables.
In her Bay Area Lesbian Fam-
ily Study, Patterson collected data on 37 lesbian families. The study
then compared scores from the study group with national averages
for various psychological tests. The study also statistically compared
the lesbians’ children’s scores on Eder’s “Children’s Self-View Ques-
tionnaire” with 60 children from Eder’s original study.
Because there is no control group, Patterson’s method makes it
impossible to know whether the study’s sample of lesbian mothers
differs significantly from the study’s other samples because of “sexual
orientation” or because of something else altogether. Other differ-
ences might account for Patterson’s findings. Comparing samples is
not a substitute for a full-fledged comparison group studied at the
same time and in the same way. Patterson’s colleagues acknowledge
this serious methodological flaw.

Page 29
Overall, only 28 studies (if we include Patterson’s three studies)
actually compared those in a homosexual study group to at least one
heterosexual comparison group.
Twenty-four studies recruited a
separate heterosexual comparison group, and one used a random
sample of a general population that included both homosexual and
heterosexual parents.
Both are common methods of allowing for
homosexual/heterosexual comparisons.
Method Two: Control for Extraneous Variables
The study group and comparison group should be identical ex-
cept for the independent variable. Unless a controlled experiment is
possible, however, this is nearly impossible to achieve.
that a controlled experiment is not feasible, either practically or ethi-
cally, the investigator should then use some form of control for ex-
traneous variables. This will increase the probability that any changes
found in the dependent variable are more likely due to changes in
the independent variable, rather than to other variables.
Storks in Sweden. One common example used in statistics
classes is the relationship between the number of storks and the
birthrate in Swedish counties. Counties with more storks (the inde-
pendent variable) also had higher birthrates (the dependent variable).
If the researcher proceeded mechanically, he or she would errone-
ously conclude that there is casual connection between storks and
babies. Yet there is a correlation. Where does it come from? A third
variable: rural-urban differences. Rural areas have both a greater
number of storks, and a higher birth rate (see Diagram 2).
Diagram 2.
Positive Relationship
Number of Storks
County Birthrate
Rural County
Positive Relationship
No Relationship between
Storks and Birthrate

Page 30
The only way we know that there is no relationship between the
number of storks and the birthrate is because we controlled for ur-
banization. Controlling for extraneous variables, therefore, is abso-
lutely critical in establishing any kind of causal inference in a
non-experimental setting. If extraneous variables are not controlled
for, or are improperly controlled for, the investigator cannot con-
clude that his or her findings have anything (or nothing) to do with
homosexual and heterosexual parental identities. Claiming a causal
relation (or lack thereof) may be the same as claiming that more
storks cause more babies.
Only 23 of the studies (46 percent) control for extraneous vari-
ables at all.
The basic rule of thumb is that any time the investigator finds a
substantial or significant difference on some third variable between
the study and the control group, this third variable must be entered
into subsequent statistical analyses as a control variable. Otherwise,
the effect will be missed, and the results of the study will then be in-
For example, Golombok et al. (1983) found a statisti-
cally significant difference between their homosexual and
heterosexual groups regarding education, psychiatric treatment and
contact with fathers. The lesbian mother group had significantly
greater levels of education, had more psychiatric treatment, and their
children had more contact with their fathers. All these variables
should have been taken into account when performing any subse-
quent analysis. If they are not, all subsequent findings could be a
function of the effects of any or all of these variables that produce
statistically significant differences. Since these studies did not do
this, their results are scientifically invalid.
Method Three: Control for Suppressor Variables
There is also a reverse problem, which is of central importance to
the research we are critiquing, that a study needs to address called
spurious non-correlation. This happens when a third variable, nei-
ther independent nor dependent, causes the false impression that the
independent and dependent variables are unrelated. This is called a
suppressor variable because it statistically suppresses the truth about
the real cause.

Page 31
The possibility that a suppressor variable is at work increases
when (1) there is no relationship (or a weak one) between the inde-
pendent and dependent variables, and (2) the third variable is posi-
tively associated with either the independent or dependent variable
and negatively associated with the other variable. Because of its dis-
tinctive relationship with the independent and dependent variables,
this suppressor variable masks the true relationship between the in-
dependent and dependent variables. The statistical effect is to create
a false impression that there are “no differences” at work. This false
impression, however, can be easily removed from a study if the sup-
pressor variable is controlled for.
Here’s an example. Imagine someone putting forth the hypoth-
esis that race (the independent variable) affects the likelihood of vot-
ing (the dependent variable). If the study does not use good
controls, it appears that African-Americans are less likely to vote
than are Caucasians. However, if education is controlled for, the re-
lationship reverses itself—African-Americans are more likely to vote
than are Caucasians. This is because: 1.) Education is positively re-
lated to the probability of voting (those with more education are
more likely to vote than those with less), and 2.) Currently, African-
Americans are less educated on average than are Caucasians. Put
slightly differently, it appears that race is the key factor in likelihood
of voting, but it turns out that the real factor is education. One
cause looks real, but it is not. Until controls are used, the other
cause is invisible. Then it turns out to be the real cause.
So what? Here’s the payoff: If one hopes to show “no differ-
ence”–in social science language, that “no effect exists between the
independent variable and the dependent variable”–then it is crucial
that the study identify and control for misleading suppressor vari-
ables. Otherwise real causes, if present, will be missed.
Unfortunately, only one of the 49 studies, Green et al (1986),
came close to explicitly addressing the problem of spurious non-cor-
relation, or to controlling for suppressor variables. The Green study
incorporates a method for automatically controlling for suppressor
effects, although it does so rather mechanically, and does not discuss
the problem of suppressor effects in studies that seek to show that
no differences” exist. The Green study carries out an initial regres-

Page 32
sion analysis, using seven variables to predict the child’s responses to
interviews and tests.
Statistically significant variables were included
in the final regression model,
which automatically includes poten-
tial suppressor effects.
Unfortunately, having done all this, the
Green researchers did not publish their actual regression results, so
we do not know which variables were included or dropped. The au-
thors only mention three findings that were statistically significant
based on their r
They do not, however, report any regres-
sion coefficients or standard errors, which would allow the reader to
perform his or her own significance test. This exclusion is extremely
odd. The common statistical practice is to provide a table of regres-
sion coefficients (and either their standard errors or their t-statistics)
when undertaking a multiple regression analysis.
The table displays
all variables entered into the equations, and discloses all subsequent
results, significant and non-significant, in a table so other investiga-
tors can examine the results and reach their own conclusions.
As a
result of the table’s exclusion, there is no way that anyone can treat
Green’s conclusions as scientifically valid. None of the other studies
on homosexual parents control for suppressor variables. But does
this really matter? Let us examine what difference it made in several
studies that failed to address this issue. We will consider two poten-
tial suppressors: prior psychiatric treatment and parental education.
Prior Psychiatric Treatment as a Potential Suppressor. For ex-
ample, the studies of lesbian families in Golombok (1983) found
prior psychiatric treatment to be statistically significantly higher for
the lesbian group than for the heterosexual group.
The studies, however, did not include prior psychiatric treatment
as an extraneous variable to be analyzed simultaneously with the in-
dependent variable. Why not is puzzling, but their logic is quite
simple. If lesbianism is associated with seeing a therapist, and if pa-
rental therapy is positively associated with child outcomes (which
could be the case if it is effective), then it follows that the “non-rela-
tionship” between homosexual orientation and child outcomes
Golombok claims might be spurious. The truth might be

Page 33
If suppressor variable effects are at work, however, things might
actually be as follows:
Diagram 4
It could be, in other words, that the real reason there is “no dif-
ference” is that the lesbian mothers were more likely to have had
prior psychiatric treatment. By not controlling for that suppressor
variable, any possibly negative relationship between a mother’s lesbi-
anism and her child’s behavior would be effectively shielded from
view. All statistical analysis performed by Golombok et al (1983)
and Tasker and Golombok (1995, 1997) should automatically
have included, at minimum, any statistically significant extraneous
Orientation of the
Behavior of
the Child
No Differences Found
Simple Bivariate Relationship
Diagram 3
Positive Relationship
Potential Third-Variable
Orientation of the
Behavior of
the Child
Treatment of the
Positive Relationship
Suppressed Negative Relationship
Between Independent and
Dependent Variables?

Page 34
variables. This would reveal any potential negative relationships be-
tween the “sexual orientation” of the mother and child outcomes.
Without this analysis, one cannot conclude that there are “no differ-
ences,” only that there is no information.
Education as a Potential Suppressor Variable. Even the most
statistically sophisticated homosexual parenting study we examined
overlooks the importance of suppressor effects in explaining their
findings. In their study of families created via donor insemination,
Chan et al (1998) find a statistically significant relationship be-
tween educational level and a parent’s “sexual orientation.” Homo-
sexual parents have significantly more education than heterosexual
parents. This is true when comparing the lesbian biological mothers
with the heterosexual mothers, and when comparing the lesbian so-
cial mothers with the heterosexual fathers. Chan et al also find no
significant differences in subsequent analyses of the relationship be-
tween parent’s “sexual orientation” and the child’s behavior, when
controlling for relationships within the family and the parent’s men-
tal health. Parental education, however, is a potential suppressor
variable because many child outcomes are positively related to paren-
tal education. If lesbian and heterosexual mothers produce similar
child outcomes to their heterosexual counterparts even where the
lesbians are better educated, it may well be the case that when educa-
tional levels are properly equated that heterosexual mothers produce
more favorable outcomes than do their homosexual counterparts. As
far as we can tell, however, education is not entered into subsequent
equations, and therefore remains an un-addressed potential
In this instance, what is needed is not simply more research on
homosexual parents and their children. Instead, we need much bet-
ter analysis of existing data. When the researcher is hoping to show
validly that no effect is present, this analysis should reflect at least a
minimal grasp of the role suppressors might play as extraneous
Method Four: Use Matching
In the non-experimental context, such as the studies examined
here, there are a number of ways of controlling potential effects of

Page 35
extraneous variables. Today, the most widely used method is multi-
variate statistical analysis.
Typically, the investigator draws a ran-
dom sample of respondents, and then statistically controls for the
effects of extraneous variables.
But there is an important alterna-
tive: matching by various methods.
Matching emerges out of the experimental tradition. Classic ex-
perimental design consists of an experimental group and one or
more control groups. Subjects are randomly assigned to groups.
The experimental group receives the treatment, while the control
groups does not. The groups vary only in terms of receiving or not
receiving treatment and any differences found between the groups is
likely to be due to the experimental treatment.
For obvious ethical
and legal reasons, these experiments cannot be used to investigate
certain research topics. A researcher cannot randomly assign babies
to different family structures, nor can he or she intentionally give
cancer-causing substances in conducting an experiment. To get
around these and other limitations,
while still properly comparing
study groups and control groups, social scientists have developed
quasi-experimental designs for selecting respondents (e.g., Campbell
and Stanley, 1966 and Cook and Campbell, 1979).
Matching is often used when the subjects to be investigated are a
rare or hard-to-reach population. If researchers were looking for
adult children raised by gay fathers, common forms of random sam-
pling would yield huge numbers of people that would not fit the
criteria. To avoid such a waste of time and money, researchers often
resort to one of two methods of matching: pair matching or group
matching. Of the 49 studies examined here, 23 used some form of
For reasons that we explore below, group matching used alone is
both unwise and inhibits the complex analysis necessary to fully ex-
plore and understand the data.
Where a random sample is not pos-
sible, pair matching should be the alternative method used.
Pair matching involves matching pairs of respondents based on
variables believed to affect the outcome variable. This is easy to visu-
alize in a two-group situation. One person in a matched pair is as-
signed to the study group, the other to a control group.

Page 36
Consider this example, taken from sociologists Peter Rossi and
Howard E. Freeman (1994), of how one might study the effects of
school vouchers. The investigator first chooses particular children as
voucher recipients. Then the investigator draws another child from a
pool of students not receiving the voucher, to be paired with the
voucher recipient. The paired students would be as identical to each
other as possible, in terms of age, sex, number of siblings and
father’s occupation.
This is pair matching.
These techniques are used in a number of studies we examined.
Bigner and Jacobsen (1989a) pair-matched their samples to obtain
precise controls, but when they did this, they depleted much of their
comparison group. They had 33 homosexual males in their study
group, and created the comparison group from a sample of 1,700
fathers, presumed to be heterosexual, drawn from a larger research
project on the social competency of children.
From this sample
they computer-matched the heterosexual respondents with members
of the study group according to age, marital status, income,
ethnicity and education. This left a final sample of only 33 compa-
rable respondents.
A study by Green et al (1986) appears to have pair matched their
They drew their control group from an initial pool of 900
single heterosexual mothers and matched them to the lesbian moth-
ers’ age (+ 5 years), race, children’s sex and age (+ 12 months),
length of time separated from child’s father, mother’s current mari-
tal status, current family income, mother’s education, and length of
absence of an adult male in the household.
This left 50 lesbian
mothers and 40 single heterosexual mothers.
Pair matching is superior to group matching, as we will discuss
below. As the above example suggest, pair matching can result in a
dramatic loss of cases when comparable individuals cannot be found,
since cases that are not matched are often dropped from the analysis.
Group matching is sometimes referred to as “block” or “aggre-
gate” matching.
Group matching creates study and comparison
groups with the same proportion of men and women, or rural and
urban inhabitants, depending on the study.

Page 37
How to Do it Well: Group Matching by Bell, Weinberg,
and Hammersmith.
The classic study of homosexuals by Bell, Weinberg, and
Hammersmith illustrates the difficulties of obtaining samples with
sufficient controls if one relies on group matching.
effort was first spent expanding the pool of potential homosexual
respondents. “Individuals who offered to be interviewed were placed
in a ‘recruitment pool’ and assigned to various categories, or cells,
on the basis of their recruitment source, age, gender, educational
level and race.”
The team started with a recruitment pool of 3,438
Caucasian homosexual males, 675 Caucasian homosexual females,
316 African-American homosexual males, and 110 African-American
homosexual females. They ended up with final homosexual samples
of 575 Caucasian homosexual males, 229 Caucasian homosexual fe-
males, 111 African-American homosexual males, and 64 African-
American homosexual females.
To obtain heterosexual comparison groups, the investigators
used probability sampling with quotas, a procedure that was devel-
oped by the National Opinion Research Center at the University of
Chicago. First the team used random sampling to select census
tracts. Then they randomly selected city blocks within the chosen
tracts. Quota sampling was then used for each block. This meant
that the interviewer determined if a person satisfied the requirements
to be included in the control groups, by seeing if the heterosexual
person fell into one of 48 needed categories based on age, gender,
education level and race.
The interviewers continued to find respondents until they
reached the pre-set number of cases for a particular category. For ex-
ample, since 25 percent of the Caucasian homosexual male sample
received a high school diploma or less education, the interviewers
needed to obtain a final sample of Caucasian heterosexual males
where 25 percent also obtained the same level of education (which
they obtained). The care with which they obtained their block-
matched samples is seen in Table 1. The percentages of study and
comparison groups are matched in terms of the levels of education

Page 38
Table 1.
Study and Comparison Groups in Bell, Weinberg, and
Hammersmith’s Study of Homosexuals.
Bell, Weinberg, and Hammersmith were able to obtain almost the
same distribution between Caucasian homosexual and heterosexual
groups. For example, 25 percent of Caucasian homosexual males
have high school degrees or less. They were able to obtain a sample
of Caucasian heterosexual males where 25 percent also high school
degrees or less. In their final sample of Caucasian homosexual men,
42 percent had a college degree or better. They managed to obtain a
quota sample of Caucasian homosexual males where 42 percent of
Caucasian heterosexual males have college degrees or more. Likewise,
23 percent of Caucasian homosexual females have a high school de-
gree or less. They obtained a sample of Caucasian heterosexual fe-
males where 25 percent have a high school degree of less, and so
Precision, however, is sacrificed as the subgroups become more
difficult to obtain. Even with their systematic quota sampling, the
investigators were unable to obtain the same percentages of African-
American heterosexual males and females for the three levels of edu-
cation compared to African-American homosexual males and
The quality of aggregate matching in controlling for extraneous
effects is as good as the precision of the match. Less than precise
matching introduces biases into the design. This increases the prob-
White- Black- White- Black-
White- Black- White- Black-
Female Female Male
Female Female
<High School
Some College
>Col. Degree

Page 39
ability that the results will be non-significant. A process by which the
investigator seeks to obtain roughly equal averages between groups
(mean age of study group and mean age of control groups) is not
sufficient. The distribution of cases across categories must be as iden-
tical as possible.
How The Parenting Studies Did It: Not So Well.
Twenty-two studies relied on group matching as one means of
controlling the potential effects of extraneous variables.
Of these,
Koepke et al, (1992) compares lesbian couples with and without
children, while Riddle and Arguelles (1989) and Turner et al
(1990) compare gay and lesbian parents without any heterosexual
control group.
We will examine a few studies that used group matching to high-
light problems associated with the technique. Studies that solely rely
on group matching without subsequent statistical controls are highly
inferior to those that use both techniques.
Brewaeys et al appears to be a study with aggregate matching on
extraneous variables. Their study group consists of 30 lesbian mother
families, each with a child between 4 and 8 years old conceived by
donor insemination (DI). Brewaeys et al had two control groups.
One comparison group was made up of all heterosexual DI families
with a child born between 1986 and 1990. The investigators made
no attempts to match for extraneous variables for the heterosexual DI
group, thereby seriously compromising the design of their study.
The other comparison group was of heterosexual naturally conceived
families, “matched as closely as possible with respect to the age of the
biological mother, age of the child, family size and birth seniority”
(oldest child was in the study). The exact mechanism of matching is
unclear. We assume they mean that if the age of the mother and of
the child and the size of the natural family fell within the range of
the lesbian family, the family was included in the study. We do not
know if the distributions are the same.
Flaks et al (1994) also group-matched respondents on several ex-
traneous variables. Flaks et al studied 15 lesbian families and 15 het-
erosexual families. Respondents were included if they had one child

Page 40
between the ages of 3 and 10. In addition, families were matched
“on the variables of sex, age and birth order of the children as well as
on race, educational level and income of the parents.”
they do not explicitly indicate their method, we assume Flaks et al
mean “group matched” because Flaks et al use the wrong statistical
procedures if they pair-matched.
Flaks et al present some data regarding these extraneous vari-
The “matching” is extremely imprecise, since one is supposed
to have roughly the same proportion between study and control
groups in each category for the extraneous variable. For example,
Flaks et al report the mean ages of parents. The lesbian biological
mother and the lesbian “social” mother are on average 2.2 years and
3.6 years older than the heterosexual father and the heterosexual
mother respectively. Flaks et al do not report the age distribution
for the four subgroups (lesbian biological mother, lesbian social
mother, heterosexual father, heterosexual mother).
The imprecision is even more evident if the distributions of these
variables are examined, and if data are presented as percentages, not
as raw numbers. In terms of education, Flaks et al report raw num-
bers, which gives the appearance of being roughly the same, because
the sub-samples are so small. We have converted them to percentages
to show the imprecise nature of their group matching process.
Table 2.
Education level, Flaks et al’s sample (15 adults in each subgroup,
converted by us from the reported raw frequencies
Bio. Mom Social Mom Het. Mom
Het. Father
High school
Grad School
Grad Degree

Page 41
The distribution of educational attainment among the four types
of parents does not closely match. This level of “matching” means
that substantial bias is introduced into the study, increasing the
probability of finding non-significant results.
Even less precision is found when we examine employment be-
tween the homosexual and heterosexual parents in Table 3 (15
adults per subgroup). The distribution of employment regarding
heterosexual and lesbian biological mothers is not a close match ei-
ther. None of the lesbian biological mothers are stay-at-home moth-
ers, compared to 27 percent of the heterosexual mothers.
Table 3.
Type of Employment in Flaks et al (1995), converted
into percentages.
“Matching” for individual income also illustrates other problems
with group matching (see Table 4). The precision of matching and
the distribution of respondents across categories is also a function of
how many categories the study has. Flaks et al introduce another
source of bias in their study that, in turn, increases the chance of
non-significant results. They divide income into only two groups
(above and below $55,000). This probably also masks major dispari-
ties in the distribution of individual income. Since 27 percent of
heterosexual mothers do not work, can we assume that a similar pro-
portion of heterosexual mothers have $0 individual income, com-
pared to none of the biological lesbian mothers and only a small
percentage of the social lesbian mothers. The distribution of indi-
vidual income for heterosexual fathers, nevertheless, appears much
higher than that of either lesbian parent.
Type of
Bio. Lesbian Social Lesbian Heterosexual Heterosexual
Employment Mother

Page 42
Table 4.
Individual Income in Flaks et al (1995) converted
into percentages.
In short, the matching of respondents in Flaks is noticeably less
precise than that used by Bell et al. Less precision regarding match-
ing of extraneous variables naturally and strongly increases the likeli-
hood of finding non-significant results.
The degree of control used in the parenting studies reviewed here
is primitive. Sociologists and noted evaluation researchers Rossi and
Freeman report “[m]atching has been supplanted to a considerable
extent by the use of statistical controls.”
Biostatistician Joseph
Fleiss, in a leading textbook on epidemiological statistics and re-
search design, recommends that “[m]atching should . . . be on a
small number of characteristics (rarely more than four and preferably
no more than two), with each defined by a small number of catego-
ries. . . If the investigator insists on controlling for biasing factors si-
multaneously, multivariate [statistical] methods . . . have to be
Method Five: Supplement Matching with Statistical Tests
Because matching subjects for rare populations is so difficult and
so limiting, investigators should include additional statistical tests.
At a minimum, differences between groups on various extraneous
variables should be controlled not only by matching but also by
multivariate statistical analysis.
Eight of the studies that use matching fail entirely in this regard.
They relied solely on group matching, without using supplementary
statistical analysis to check for extraneous variable effects. That is, at
Bio. Lesbian Social Lesbian Heterosexual Heterosexual

Page 43
a minimum, no attempt was made to see if there was a statistically
significant difference between the homosexual and heterosexual
groups on the variables.
The other 15 studies that used matching fare somewhat better in
terms of at least checking for differences. They used some form of
statistical check on their group matching to see if the extraneous
variables were significantly related to parent’s sexual orientation.
Green et al (1986)’s study is the best in terms of choice of
method for controlling for extraneous effects. As discussed earlier,
Green et al (1986) used pair matching supplemented by statistical
analysis of extraneous variables. They chose to control for extraneous
effects via multiple regression techniques—the optimal method for
doing so.
The other 13 studies rely on t-tests or chi-square statistics to find
variables that are statistically significant, a method that is not as
good as directly entering the variables in a multiple regression equa-
tion, since it makes it impossible to pick up interaction effects be-
tween the independent and extraneous variable.
Despite using statistical tests to check for significant differences
regarding extraneous variables, a number of the important studies
failed to take the next step, which is absolutely critical to avoid in-
valid results from the data analysis. If the investigator should find
statistically significant differences on these extraneous variables, the
proper procedure should be to then enter these extraneous variables
into subsequent prediction equations. Finding statistically signifi-
cant differences but not entering them in subsequent statistical
analysis invalidates the later analyses.
Despite finding significant differences on several extraneous vari-
ables (e.g., educational levels between lesbian and heterosexual par-
ents, annual household income between couples versus singles),
Chan et al (1998), Miller et al (1982), and Golombok and her
various colleagues (Golombok and Tasker, 1996; Golombok et al,
1983; Tasker and Golombok, 1995; 1997) failed to enter these
variables into subsequent statistical analyses. This makes their later
analyses and conclusions invalid. Hoeffer (1981) finds no signifi-
cant differences using a t-test on marital status, educational level,

Page 44
and occupation, but she finds a significant difference in support for
feminism (lesbians being more supportive of feminism). Hoeffer,
however, fails to enter feminism as the extraneous variable in the
subsequent analysis, thus invalidating the overall project.
In general, the overwhelming bulk of the studies either failed to
control for these most basic demographic variables or controlled for
them improperly. We will now discuss some more specific examples,
this time looking at how the studies treat specific variables.
Putting It All Together: Variables, Matches, and Statistical
Rossi and Freeman (1995) provide a useful list of a priori charac-
teristics for which investigators frequently match cases and groups.
If the investigator is looking at individuals, common extraneous
variables include: age, sex, education, socio-economic status (in-
come, wealth), occupation (prestige), ethnicity (race, language
groups, religion), intellectual functioning (cognitive ability, knowl-
edge), and labor force participation. When controlling for house-
hold characteristics, the investigator should look at life-cycle stage,
number of household members, number of children, housing ar-
rangements, socio-economic status of members, and ethnicity of
We examined the following extraneous variables to see how many
studies either group matched, pair matched, and/or statistically
controlled for them. These potential extraneous variables are: gender
of the child; educational level of the parent, the occupation; income,
socio-economic status or social class of the parent; the partnership
status of parent (living alone, living with a partner); and the age of
Child’s Gender. The child’s gender is the extraneous variable
most frequently controlled. Twenty-one studies report data for girls
and boys separately, or directly control for the child’s gender.
The child’s gender is an important extraneous variable, but con-
trolling for the child’s gender highlights the single biggest problem

Page 45
with all these studies: The samples are too small for the statistical
tests used. A credible study of homosexual parenting must have at
minimum, a heterosexual control group. When the investigator con-
trols for the gender of the child, the sample is further divided, to get
four sub-samples. Unless the investigators obtain sufficiently large
sub-samples, the investigators will probably obtain non-significant
results, even if the null hypothesis is in reality false.
For example, Flaks et al (1995) compare boys and girls con-
ceived through donor insemination and raised by lesbian versus het-
erosexual couples. The investigators have eight girls and seven boys
in each group. Given Flaks et al’s two independent variables (sexual
preference of mother and child’s gender), the probability of finding
statistically non-significant differences is extremely high.
Since all these studies have small samples (we will discuss this is-
sue later), the dilemma of introducing extraneous variables, as illus-
trated by using child’s gender, is this: As one introduces more
extraneous controls, one increases the probability of finding non-
significant results because the samples are small (and because the sta-
tistical procedures use up available degrees of freedom). The
investigator may arrive at non-significant results as an artifact of
small samples; the non-significant results may falsely mask the real
relationship. There is no way around this problem except to increase
sample size.
If the investigator chooses not to control for third variables, so as
to maximize the sample size that is statistically tested, the results will
be unconvincing, precisely because these alternative variables were
not introduced. There is no way the investigator can know whether
the relationship between the independent and dependent variables is
in fact due to a third variable that accounts for both the differences
in the independent variable and differences in the dependent
Education. As discussed in the sections on suppressor effects,
education is a critical variable that is conventionally incorporated in
controlling for extraneous effects. We used a fairly generous defini-
tion of “controlling for education.” It included 1) whether the in-
vestigators used education as a potentially confounding variable

Page 46
statistically, and/or 2) whether the investigator screened respondents
for levels of education and then noted whether there was a difference
or not between the study and comparison groups.
Fourteen studies, while testing for differences in levels of educa-
tion, fail to enter education into subsequent statistical analysis de-
spite the fact that education is found to be statistically significant.
Note that all these studies generally find no differences between the
homosexual study group and the heterosexual control group regard-
ing the dependent variable, but find a significant difference in edu-
cation between the study and control groups. This should have led
the investigators to search for suppressor effects. That is, this situa-
tion raises the possibility that the positive relationship between the
homosexual study group and education is masking the negative rela-
tionship between sexual preference of the parent and the dependent
For example, Golombok and colleagues studied 27 lesbian
mothers and their 39 children, plus 27 single heterosexual mothers
and their 39 children, first in 1976-1977 and then again in 1992-
Their analysis of the 1976 data found education level to be
significantly higher for lesbian parents compared to heterosexual
parents. Despite this significant difference, education level is not en-
tered into subsequent statistical analysis, thus making reported re-
sults invalid.
Subsequently, the adult children in the 1992-1993 study was
compared with respect to age, gender, ethnicity and education
with no statistically significant differences found among the
adults (although the adult children from the lesbian group have
generally a higher level of education). The subsequent studies do
not test for differences among the mothers of these adult children,
despite the fact that the initial 1976 sample showed a statistically
significant difference in education (and other variables) between the
lesbian and heterosexual groups. This is a major error. It increases
the likelihood that non-significant results are due to the presence of
mother’s education level as a suppressor.
The same practice of finding significant differences in education
but not subsequently entering it into a multivariate statistical analy-

Page 47
sis is found in Miller et al’s study of 34 lesbian and 47 heterosexual
mothers. Miller et al (1982) also found lesbian mothers having
higher levels of education compared to heterosexual mothers. De-
spite the differences being statistically significant, the investigators
failed to enter education (and other statistically significant variables)
into their subsequent statistical analysis. This is another case of ig-
noring possible suppressor effects. Failure to enter education into
their statistical analysis increases the likelihood that the claims of
no significant differences” between lesbian and heterosexual moth-
ers regarding their views of the caregiving role are not valid.
Brewaeys et al, (1997) found significant differences in educa-
tional level among their groups, and did enter education level into
their subsequently statistical analysis as a proper control variable. In
their study of 30 children of donor inseminated (DI) lesbian
couples, 38 children of DI heterosexual couples, and 30 children of
naturally conceiving (NC) heterosexual couples, Brewaeys et al found
the lesbian couples to be significantly better educated than the DI
and NC heterosexual parents. They found no significant differences
in behavioral adjustment, the child-parent relationship, and in gen-
der role development. The problem with the study, however, is also
one of small samples. Smaller samples increase the likelihood of non-
significant results, even if a real relationship exists. Brewaeys et al fur-
ther add to the probability of finding non-significant results by
recruiting as comparison groups an unmatched DI heterosexual
sample but a matched NC sample.
Occupation, Socio-Economic Status, or Social Class. Occupa-
tion is often an extraneous variable, in many ways similar to educa-
tional attainment. Occupational prestige, like educational
attainment, is a source of human capital, enabling individuals to
better function in modern society. The studies failed to explain how
they defined their occupational categories. For example, Hoeffer
finds that 65 percent worked in a “white collar” occupation. What
does “white collar mean?” What occupations make up “white-collar”
versus “blue collar?” These classificatory schemes should rely on con-
ventional measures for occupational stratification, but none of the
studies referring to occupation use them. This is poor technique.

Page 48
Eighteen studies controlled for occupation, income, socio-eco-
nomic status, or social class as a variable.
The studies by Golombok
and colleagues relied on “social class” as the comparable British cat-
egory. In these cases involving British respondents, there is a stan-
dard measure of social class, which they used. In American cases,
investigators poorly control for this variable by claiming they found
no “class” differences between their lesbian and heterosexual groups
regarding class background or state that their subjects are “middle
class” or “upper middle class,” without a proper operational defini-
tion as to what this means.
The most detailed occupational classificatory scheme is found in
Patterson’s Bay Area Family study.
She extensively classifies the oc-
cupations of the mothers—professional occupations, technical and
mechanical, business and sales, and others (e.g., artist), although the
classificatory scheme and the logic behind it is not described.
fortunately, Patterson fails to have a proper heterosexual control
group with which to compare lesbian mothers, so the demographic
descriptions of the lesbian mother study group are scientifically use-
In Patterson’s study, occupation is a very serious potential sup-
pressor variable. While 62 percent of the sample is classified as hav-
ing professional occupations, only 28 percent of the national adult
population is employed in the professional-managerial occupa-
Yet, Patterson compares test scores from her sample to na-
tional averages. Because Patterson finds no differences in mean
scores between the children of the lesbian mothers compared to na-
tional averages, one must raise the suppressor variable issue. Since
Patterson’s sample is significantly higher in occupational status as
compared to the national population, occupation is likely to be act-
ing as a major suppressor variable. Patterson cannot respond scien-
tifically to these criticisms of her studies regarding occupation (and
other variables), because the study fails from the start to use a proper
comparison group. Her study’s conclusions, as they stand, are not
Partner Status of Parent. All future homosexual studies should
follow the design used by Chan et al (1998) to control for partner
status. Studies should use two study groups and two comparison

Page 49
groups: families headed by a lesbian couple, families headed by a
single lesbian mother versus families headed by a heterosexual
couple, and families headed by a single heterosexual mother. This
four-group design allows simultaneous comparisons of groups based
on the sexual identity of the parent and the number of parents in the
household. Alternatively, studies should pair match subjects on
whether they live with someone or live alone, and statistically adjust
The traditional two-parent family whose child is biologically re-
lated gets the short shrift in these studies. Only four studies
(Brewaeys et al, 1997, Flaks, et al., 1995; Miller, 1982, and
Mucklow, 1979) compare lesbian parents to the traditional hetero-
sexual household. Miller (1982) and Mucklow (1979) fail to dis-
tinguish lesbians living with and without partners.
Brewaeys et al (1997) create a good series of comparisons. They
compared the lesbian parents and their child created through donor
insemination, with a heterosexual couple and their child created
through donor insemination and with a heterosexual couple and
their child created the traditional way.
Regarding comparisons with the traditional family, Chan et al
(1998) explicitly rule out making comparisons with the traditional
family, where a married couple are raising their biological children.
This rules out any possible generalizations based on Chan et al’s
findings to the larger population, since they only study donor-in-
seminated families.
Chan et al argue that the child’s relationship to parents in the
traditional family creates a situation of “ownness,” and thus cannot
be compared to the other types of family structure involving donor-
inseminated children. This is a mistake. It would have been a better
research strategy to have included the traditional family as another
comparison group, as well as the single divorced mother and her tra-
ditionally conceived child.
These authors’ approach implicitly concedes that the traditional
family, a heterosexual (monogamous) husband and wife couple rais-
ing their own biological children, is the optimal form of family
structure. This apparent concession contradicts the more general

Page 50
paradigm of research studies carried out one of the above study’s co-
authors, Charlotte Patterson, who has argued elsewhere that the
children of lesbian and gay parents “develop in a normal fashion.”
The other studies with control groups seriously compromised
their studies by mixing lesbians living alone and lesbians living with
partners in one study group, against heterosexual mothers living
Another study compared a mixed group of lesbians with
married heterosexual mothers (Miller et al, 1982), and two com-
pared a mixed group of lesbian and heterosexual mothers (Lewin,
1982; Lyons, 1982).
The studies that used partnered and non-partnered parents in one
group and partnered or non-partnered parents in the other group
increased the likelihood of finding non-significant results. Controls
should have been included either at the beginning of the study,
while finding respondents, or subsequently in statistical analysis.
Age of Child. The age of the child as an extraneous variable is ex-
tremely important when the dependent variable is related to the
child’s development as most of them are in fact. For example, it
would be developmentally appropriate to ask respondents for sexual
preference if they were adolescents but not if they were young chil-
dren. This variable should have been directly controlled via statistical
testing, thus avoiding the problem of improper (i.e., imprecise)
Only Green et al, 1986 pair matched his subjects, then statisti-
cally controlled for the age of the child through subsequent multiple
regression analysis along with a host of other possible extraneous
Twenty studies controlled for the age of the child through group
matching and/or statistical testing.
None found significant differ-
ences in the children’s ages and therefore, none used the variable in
subsequent statistical analysis.
Flaks et al, (1995) report the mean ages of the two groups of
children, without statistical comparisons. Others deal with the age
of children improperly. Kirkpatrick et al, (1981) report that the
ages of children are similar, but provide no numbers. Javaid, (1992)

Page 51
only presents the age distribution of the children. Lewin and Lyons
report the age range of the children in the study as a whole, but
nothing more.
Patterson (1994a, 1994b, 1997) also reports the children’s mean
ages when statistically comparing scores of Eder’s sample of 60 chil-
dren (5.5 years) with her sample of 35 (6 years, 2 months). She per-
forms a t-test to see if the scores between Eder’s sample and hers
differ, but does not control for the child’s age.
There is one other point to make regarding the treatment of the
ages of the children. Only two studies look at adult children of male
homosexuals. One, by Bailey et al, studies the adult sons of gay fa-
Bailey et al found that 9 percent of the adult sons of homo-
sexuals are gay, as reported by the sons themselves or by their fathers
(when the sons would not respond to the survey). Another, by
Miller (1979), looks at the adolescent and adult children of gay fa-
thers, but like Bailey et al, Miller also fails to compare them to the
proper heterosexual control group. Nevertheless, Miller still con-
cludes, “Evidence in the children’s biographies pointed to problems
of sexual acting-out.” This included premarital pregnancies, abor-
tions, prostitution, etc. Because there is no comparison group, we
cannot tell it if this acting out is a function of divorce, or an interac-
tion of divorce and having a gay father, although the findings are
somewhat suggestive.
The other group of studies looked at adult children of lesbian
versus single heterosexual mothers, after initially studying these same
children when they were much younger. Tasker and Golombok at-
tempted to locate the initial group of children of lesbian versus het-
erosexual mothers that they had studied in 1976-1977. They
re-interviewed them as young adults, on a wide series of topics in-
cluding the adult children’s sexual behavior, desires and sexual
Despite the ridiculously small samples they used, making it ex-
tremely difficult to obtain any statistically significant results, Tasker
and Golombok managed to find statistically significant differences
between the two groups. They find that adult children of lesbian
mothers were significantly more likely to think about having homo-

Page 52
sexual relations than were the adult children of heterosexual moth-
ers. They also find that two women raised by lesbians were both in a
lesbian relationship and identified themselves as lesbian, while no
women raised by heterosexuals were either in a lesbian relationship
or considered themselves to be lesbian.
These findings are not
properly explained by the authors and contradict the “no-homo-
sexual parent effects” view favored by these authors in their writings.
What Went Wrong and What Can Be Done About It?
Here are the lessons from Step 2: Controlling for Unrelated Ef-
fects. Recall that the question is, “Do these studies use methods that
justify their asserted scientific conclusions about whether or not
sexual orientation has any impact on childrearing?” To answer this,
we have looked at the general methods that social science would
1) You must use a comparison group to draw valid conclu-
sions about the possible effects of something on something
else. Twenty-one of 49 studies had no control groups.
2) You must control for extraneous variables in order to
eliminate false causes because correlation need not mean cau-
sation. Twenty-three of the 49 studies have some kind of
control. But anytime you find a significant difference on
some third variable, you must enter it as a control into sub-
sequent analysis in order to obtain valid results. Of the 12
studies that found this, only one took this step.
3) You must control for suppressor variables in order to
eliminate false (but true-looking) causes because non-corre-
lation need not mean non-causation. Of the 49 studies, only
one even came close to addressing this issue, and that study
failed to even report which variables were dropped or added
in its analysis.
4) You might use some form of matching. If you cannot use
full-blown statistical analysis, you need an alternative basis
for discerning what is or is not a significant difference. The

Page 53
least accurate method, group matching, was used in 23 stud-
ies (two of which also used pair matching). The best
method, pair matching, was used by only three studies.
Twenty-seven studies used no form of matching at all.
5) You must supplement matching with multivariate statisti-
cal analysis, to test your matching and to deal with suppres-
sor variables. Of the 23 studies that used matching, only 15
did this.
If a study does not use all five methods, it is badly flawed re-
search. But our findings are that only one of the 49 studies used all
five methods, Green, et al (1986). Even this study failed to explain
its methods or justify its conclusions, as we discussed. In short, these
49 studies were conducted with control methods that are so inad-
equate that they cannot be relied upon for either scientific conclu-
sions or public policy reforms.
1. A useful discussion of operational definitions is given in Nachmias and
Nachmais, 1996, pp. 30-32.
2. Ibid.
3. Needless to say, operationalization of variables is often far more complicated
than this.
4. For instance, in their authoritative study of American sexuality, Laumann, et
al. (1994) the term “homosexual” can be defined by means of desire, behavior,
self-identification, or a mix of the three. Depending on the operationalization,
the results vary. Laumann, et al. (1994), Chapter 8. Note also that this does not
deal with the question of bisexuality.
5. Sometimes these are called the “experimental” or “treatment” group and the
“control” group. Since neither homosexuality nor heterosexuality can or should
be regarded as an “experiment” or “treatment,” however, we use the term
“study group” to refer to the primary population of the study.
6. These need not be explicitly selected groups, but can result from a random
sample survey. With rare populations, however, selection of members of the rare
group at a disproportionate rate is highly desirable. The issue is discussed at great
length in Chapter 4.
Notes to Chapter 2

Page 54
7. Or confounding variables.
8. The studies lacking a heterosexual comparison group are: Bailey et al, 1995;
Barret and Robinson, 1990; Bozett, 1980; Crosbie-Burnett and Helmsbrechty,
1993; Gartrell et al, 1996; Green, 1978; Hare, 1994; Koepke et al, 1992;
Lewis, 1980; Lott-Whitehead and Tully, 1992; McCandlish, 1987; Miller,
1979; O’Connell, 1993; Pennington, 1987; Rand et al, 1982; Riddle and
Arguelles, 1989; Ross, 1988, Turner et al, 1990; Weeks et al, 1975; West and
Turner, 1995; and Wyers, 1987. Bailey et al (1995) compare adult sons of gay
fathers with gay monozygotic and gay dyzygotic twins. Koepke et al (1992)
compare lesbian mothers with childless lesbians. Riddle and Arguelles, 1989,
Turner et al, 1990, West and Turner, 1995, and Wyers, 1987 compare gay ver-
sus lesbian parents.
9. p. 696.
10. Ibid.
11. p. 548.
12. p. 547.
13. Ibid.
14. Ibid.
15. Patterson, 1994a, 1994b, Children of the lesbian bably boom: Behavioral
adjustment, self-concepts, and sex-role identity in Green, B.T., Herek, G.M.
(eds.) Lesbian and gay psychology: Theory, research, and clinical applications.
156-275 1997, Lesbian mothers and their children: findings from the Bay Area
Families Study in J. Laird and R.J. Green (ed.) Lesbian and gays in couples and
families: A handbook for therapists (pp. 420-436). New York: Jossey-Bass.
16. Patterson, 1994a, p. 161. This statistical comparison, using a t-test, between
Patterson’s sample and Eder’s original 60 child participants could be done, be-
cause Eder provides the means and standard deviations of his group of 60 chil-
dren, Patterson, 1994a, p. 161.
17. The point is conceded by Chan et al. (1998) who use a comparison group
of heterosexual donor-inseminated parents in their study. This study is discussed
18. These studies are: Bigner and Jacobsen, 1989a, 1989b, 1992; Brewaeys et
al, 1997; Cameron and Cameron, 1996; Chan et al, 1998; Flaks et al, 1995;
Golombok and Tasker, 1996; Golombok et al, 1983; Green, 1982; Green et al,
1986; Harris and Turner, 1985; Hoeffer, 1981; Huggins, 1989; Javaid, 1992;

Page 55
Kirkpatrick et al, 1981; Kweskin and Cook, 1982; Lewin and Lyons, 1982;
Lyons, 1983; McNeill et al, 1998; Miller et al, 1982; Mucklow and Phelan,
1979; Pagelow, 1980; Patterson, 1994a, 1994b, 1997 (with the limitations dis-
cussed above); and Tasker and Golombok, 1995, 1997.
19. Cameron and Cameron, 1996. This study has other problems, however,
which we note below.
20. Where feasible, a random sample taken from a population is the superior
method of obtaining a comparison group. This is also discussed further in the
discussion below.
21. In this case, the two groups are equated by randomly assigning individuals
to the experimental group or the control group. This is the best available re-
search design, because any statistically significant differences between the two
groups can be plausibly attributed to the experimental manipulation (e.g.
Campbell and Stanley, 1966).
22. Stated slightly differently, controlling for extraneous variables reduces the
probability that the relationship between the independent variable and the de-
pendent variable is spurious. Spurious relationships are those that are not true
causal relationships, but are due to the presence of a third variable. The exist-
ence of spurious relationships, caused by extraneous variables, account for the
substantial degree of truth in the old adage “correlation is not causation.”
23. This tells us nothing, of course, about whether the relevant variables were
controlled, or whether the controls were executed properly. These studies are:
Bigner and Jacobsen, 1989a, 1989b, 1992; Brewaeys et al, 1997; Chan et al,
1998; Flaks et al, 1995; Golombok and Tasker, 1996; Golombok et al, 1983;
Green, 1982; Green et al, 1986; Harris and Turner, 1985; Hoeffer, 1981;
Huggins, 1989; Javaid, 1992; Kirkpatrick et al, 1981; Koepke et al, 1992;
Kweskin and Cook, 1982; Lewin and Lyons, 1982; Lyons, 1983; Miller et al,
1982; Pagelow, 1980; Riddle and Arguelles, 1989, Tasker and Golombok,
1995, 1997; and Turner et al, 1990. Koepke, Riddle and Turner lack hetero-
sexual control groups.
24. Rossi and Freeman, 1994, p. 311, provide a standard list of extraneous vari-
ables that should be considered in social science studies that is discussed
25. This also includes subsequent studies by Golombok and Tasker 1996 and
Tasker and Golombok, 1995, 1997. As we shall see next, this applies to non-sig-
nificant as well as significant results.

Page 56
26. E.g., Hirschi and Selvin, 1973, Rosenberg, 1968, Davis, 1985.
27. Conversely, if one aims to avoid finding real causes, skipping this method
will be helpful. We cannot say what the motives are of the researchers whose
studies we are examining, but they have used (or failed to use) methods in a
way that makes finding real causes very unlikely, and in some cases impossible.
28. The variables are child’s age, child’s age at separation from his or her father,
the child’s age when last adult male was living in house, mother’s education,
mother’s feminist activism, whether mother was lesbian or heterosexual, and
mother’s lesbian political activism.
29. Multiple regression is a very common statistical technique for estimating the
effects of a set of independent variables on a dependent variables. It allows ex-
amination of the effect of each independent variable while controlling for the
effects of all the others.
30. p. 171. Also included were variables closely related to another variable (p.
171). For example, lesbians are more likely to be feminist and lesbian activists
than are heterosexuals. This problem of multicollinearity, of which Green et al.
are unaware, substantially complicates the interpretation of their results.
31. A measure derived from the multiple regression equation, which indicates
how well the dependent variable is predicted by the full set of independent
variables included in the equation.
32. See e.g., Chan et al, 1998, p. 452.
33. Those few findings that are reported are themselves quite interesting. They
report a very large R-squared value of 0.21 (which is equal to a correlation of
0.46) in a regression equation predicting that the longer boys lived without a
man in the house the more likely they would be to mention a women as a per-
son they would like to be when grown (pp. 178-9). This statistically significant
result (p<.05 level) is strong enough to count as what psychometrician Jacob
Cohen (1988) calls a large effect. This concept is discussed below.
34. There is no a priori list of potential suppressors. Suppressors are a function of
the research problem under investigation. We select these two because they are
relevant to the studies we discuss.
35. Golombok et al, 1983, p. 556.
36. See below in the discussion of individual studies and in note 83.
37. Other statistical problems with Chan et al’s study are discussed in the sum-
maries of individual studies.

Page 57
38. This often uses multiple regression mentioned above, or its close relative,
the analysis of variance.
39. Another method of controlling for extraneous variables is test factor stan-
dardization, a form of reweighting which is used in demographic analysis (e.g.,
Davis, 1985) but not in any studies discussed here. We discuss sampling and sta-
tistical tests in greater detail in Chapters 4 and 5.
40. E.g., Campbell and Cook, 1979.
41. Experiments often have the problem of extrapolating their findings to the
real world.
42. Bigner & Jacobson (1989a), Bigner and Jacobson (1989b) and Green, et al
(1986) use some form of pair matching. The balance (there is some overlap) use
group matching. See the list below at note 51.
43. We discuss group matching at length on pp.16-24. The reliance on group
matching increases the propensity of finding non-significant results. This group-
level matching control for extraneous effects is extremely imprecise and an inad-
equate substitute for statistically controlling for extraneous variables.
44. Id., p. 305.
45. Id., p. 167.
46. We infer this from their subsequent use of McNemar’s chi-square test,
which is used for paired cases. Green et al, p. 170.
47. Id., p. 169.
48. Rossi and Freeman (1994), p. 305.
49. See Bell and Weinberg, pp. 29-40.
50. p. 33; for the actual numbers in the recruitment pools, see Table 2-1,
Appendix C.
51. Categories included, among others: white heterosexual male, with high
school or less, and 25 years old or less; white heterosexual male, with high
school or less, and 26 to 35 years old; white heterosexual male, with high
school or less, and 36 to 45 years old; white heterosexual male, with high
school or less, and 46 years old or more; black heterosexual male, with high
school or less, and 25 years old or less; black heterosexual male, with high
school or less, and 26 to 35 years old, and so on. Id.
52. Some percentages in the table may not add up to 100 percent because of
53. Despite their quota sampling regarding the comparison groups, Bell et al still

Page 58
relied on subsequent statistical analyses to further control for extraneous
variable effects.
54. These studies are: Bigner and Jacobsen, 1992; Brewaeys et al, 1997; Chan
et al, 1998; Flaks et al, 1995; Golombok and Tasker, 1996; Golombok et al,
1983; Green, 1982; Green et al, 1986; Harris and Turner, 1985; Hoeffer,
1981; Huggins, 1989; Javaid, 1992; Kirkpatrick et al, 1981; Koepke et al, 1992;
Kweskin and Cook, 1982; Lewin and Lyons, 1982; Lyons, 1983; Miller et al,
1982; Pagelow, 1980; Riddle and Arguelles, 1989; Tasker and Golombok,
1995, 1997; Turner et al, 1990.
55. The investigators report no significant differences between the study and
control groups on mother’s age, child’s age, and the number of children, but
found significant differences in education levels and gender distribution of chil-
dren. Brewaeys et al chose to report data separately for girls and boys rather than
also control for the latter statistically. They did not include education in subse-
quent analyses when controlling for child’s gender. Conversely, they did not in-
clude child’s gender when controlling for education level of the parent. Both
gender of child and education should have been entered into the analysis simul-
taneously as statistical controls, although this increases the probability of finding
non-significant results as a function of sub-samples that are extremely small.
56. p. 107.
57. See Green et al, 1986 p. 170 for the correct statistical procedure if pair-
matching is used.
58. p. 107.
59 Percentages do not add up to 100 percent due to rounding. 60. 1994,
p. 303.
61. 1981, p. 134.
62. Bigner and Jacobsen, 1992; Flaks et al, 1995; Javaid, 1992; Kirkpatrick et al,
1981; Lewin and Lyons, 1982; Lyons, 1983; Miller et al, 1982; Pagelow,
63. Brewaeys et al, 1997; Chan et al, 1998; Golombok and Tasker, 1996;
Golombok et al, 1983; Green, 1982; Green et al, 1986; Harris and Turner,
1985; Hoeffer, 1981; Huggins, 1989; Koepke et al, 1992; Kweskin and Cook,
1982; Riddle and Arguelles, 1989; Tasker and Golombok, 1995, 1997; and
Turner et al, 1990.
64. Brewaeys et al, 1997; Chan et al, 1998; Golombok and Tasker, 1996;
Golombok et al, 1983; Green, 1982; Green et al, 1986; Harris and Turner,

Page 59
1985; Hoeffer, 1981; Huggins, 1989; Koepke et al, 1992; Kweskin and Cook,
1982; Riddle and Arguelles, 1989, Tasker and Golombok, 1995, 1997; and
Turner et al, 1990.
65. More generally, it is a useful omnibus checklist of extraneous variables.
66. Rossi and Freeman (1994), p. 311
67. These studies are: Brewaeys et al, 1997; Chan et al, 1998; Flaks et al, 1995;
Golombok and Tasker, 1996; Golombok et al, 1983; Green, 1982; Green et al,
1986; Harris and Turner, 1985; Hoeffer, 1981; Huggins, 1989; Javaid, 1992;
Kirkpatrick et al, 1981; Koepke et al, 1992; Kweskin and Cook, 1982; Lewin
and Lyons, 1982; Lyons, 1983; Pagelow, 1980; Riddle and Arguelles, 1989;
Tasker and Golombok, 1995, 1997; and Turner et al, 1990.
68. We discuss the probabilities of obtaining non-significant results in Ch. 5 (the
logic of statistical testing).
69. The following studies report the educational level of the parents: Bigner
and Jacobsen, 1989a, 1989b; Brewaeys et al, 1997; Chan et al, 1998; Flaks et
al, 1995; Golombok and Tasker, 1996; Golombok et al, 1983; Green, 1982;
Green et al, 1986; Hoeffer, 1981; Javaid, 1992; Kirkpatrick et al, 1981; Koepke
et al, 1992; Kweskin and Cook, 1982; Lewin and Lyons, 1982; Miller et al,
1982; Riddle and Arguelles, 1989, Tasker and Golombok, 1995, 1997; and
Turner et al, 1990.
70. The studies are Chan et al, 1998; Golombok and Tasker, 1996; Golombok
et al, 1983; Miller et al, 1982; and Tasker and Golombok, 1995; 1997. Hoeffer
(1981) and Kweskin and Cook (1982) compare lesbian and heterosexual
mothers with respect to education level but find no statistically significant differ-
ences. Turner et al (1990) and Koepke et al (1992) compare gay and lesbian
parents but find no statistically significant differences regarding education. Flaks
et al (1995), Javaid (1992), Kirkpatrick (1981), and Lewin and Lyons (1982)
only block match for education.
71. See Golombok et al, 1983, Golombok and Tasker, 1996, Tasker and
Golombok, 1995, 1997 for versions of the study.
72. See Golombok et al, 1983.
73. Golombok and Tasker; 1996; and Tasker and Golombok, 1995, 1997.
74. See the earlier footnote for a discussion of the findings on p. 107.
75. Standard measures of occupational prestige include O. D. Duncan’s Socio-
economic Index, Siegel’s (NORC) Prestige Scores, and the Nam-Powers (U.S.
Census) Score. See pp. 327-365 in Delbert C. Miller, Handbook of Research

Page 60
Design and Social Measurement (1991) for a discussion of nine measures of so-
cioeconomic status involving occupation.
76. These studies are: Bigner and Jacobsen, 1989a, 1989b; Chan et al, 1998;
Flaks et al, 1995; Golombok and Tasker, 1996; Golombok et al, 1983; Green,
1982; Green et al, 1986; Hoeffer, 1981; Javaid, 1992; Kirkpatrick et al, 1981;
Koepke et al, 1992; Kweskin and Cook, 1982; Lewin and Lyons, 1982; Riddle
and Arguelles, 1989, Tasker and Golombok, 1995, 1997; Turner et al, 1990.
77. E.g., 1994a.
78. p. 250.
79. Statistical Abstract, 1996, p. 405.
80. Patterson, 1997, p. 269.
81. These studies that mix lesbians living alone with lesbians living with partners
versus heterosexual mothers living alone are: Golombok and Tasker, 1996,
Golombok et al 1983; Green, 1982; Green et al., 1986; Harris, 1985; Javaid,
1992; Kirkpatrick, 1981; Kweskin, 1981; McNeill et al, 1998; Patterson, 1994a,
1994b, 1997; Tasker and Golombok, 1995, 1997.
82. Bigner and Jacobsen, 1992; Brewaeys et al, 1997; Chan et al, 1998; Flaks et
al, 1995; Golombok and Tasker, 1996; Golombok et al, 1983; Green, 1982;
Green et al, 1986; Hoeffer, 1981; Huggins, 1989; Javaid, 1992; Kirkpatrick et
al, 1981; Kweskin and Cook, 1982; Lewin and Lyons, 1982; Lyons, 1983;
Pagelow, 1980; Riddle and Arguelles, 1989, Tasker and Golombok, 1995,
1997; Turner et al, 1990.
83. The investigators fail to compare adult sons of homosexual fathers to adult
sons of heterosexual fathers.
84. This study is the only one we looked at that used longitudinal design.
85. Tasker and Golombok, 1995, pp. 210-211. They also find that 36 percent
of the adult children of lesbians but only 25 percent of the adult children of
single heterosexual mothers experienced attraction to someone of the same
gender. While this was not statistically significant, it is quite suggestive in light of
the other findings and the extremely small sample sizes (Tasker and Golombok,
1995, pp. 210-211). In fact, a careful examination of Tasker and Golombok’s
(1997) data in Table 6.1, p. 107, suggests a very strong relationship between
the sexual orientation of the mother and the child. Two of the four results are
correctly presented as statistically significant, one is incorrectly presented as statis-
tically insignificant, and the fourth is statistically significant (25 percent differ-
ence between the groups) if a one sided test is used, despite the fact that the
number of cases in each test is only 45.

Page 61
In Chapters 1 and 2 we have seen that a well-formulated hypoth-
esis is critical, and that the researcher must use certain methods to
control for unrelated effects that may skew a study’s results. We have
also seen that same-sex parenting studies are severely flawed in these
But what kind of measurements does a study use? This too is cru-
cial. If one assumes that when a person lies, that person will per-
spire, then one might measure the sweat. But if the amount of sweat
is a silly thing to gauge to uncover lying, then it’s a poor measure.
Take for example, astrology. Astrology claims to measure personality
traits by aligning the stars and planets. Is the celestial alignment of
stars and planets a silly measure of personality structure?
Regarding a study’s measures, there are three questions that need
to answered:
1) Is the measure self-constructed?
2) Is it reliable? and finally,
3) Is it valid?
We will provide a brief description of each topic, and look at how
the parenting studies fare. The studies offer us no confidence in
their results.
What Measures Are and Why They Matter
If variables are properly measured, we can say with greater confi-
dence that the differences between the scores of two respondents are
likely due to real differences. If the variables are wrongly measured,
Chapter 3
Does it Measure Up?
Bias, Reliability and Validity
Notes for this section begin on Page 67

Page 62
we have a false impression based on errors of measurement. Since no
measure is perfect, however, there will always be some error in the re-
sults. Sometimes the errors are based on the state of the respondent.
The respondent may be sick, or tired, or inattentive in some other
way. Or, the respondent may have previously answered a similar kind
of survey and not be paying much attention. Of course, the respon-
dent may not like the interviewer and give less-than-cooperative an-
swers. The last feature is very important regarding homosexual
parenting studies. The respondent may give what he or she perceives
as the socially desirable answer. Respondents have been known to
conceal their true feelings, actions and attitudes from the interviewer
when they are undesirable. Respondents have also been known to ex-
aggerate or even invent their actions and attitudes when respondents
believe it may put them in a favorable light.
Even if the respondent’s
state is fine, the essential problem for evaluating the quality of mea-
sures is that one cannot really tell, given a respondent’s answer, what
proportion of that answer is true. Faced with the social desirability
problem, checking for measurement errors, particularly in controver-
sial areas, should be done, but should be done indirectly. To do so,
an evaluator looks for indicators: Has the measure been used before?
Does the measure work again and again? Does it really measure the
thing it claims to measure?
In particular, one should be on the lookout for measurement er-
rors that slant the responses consistently in one direction. After all,
the researcher is supposed to eliminate, as much as possible, mea-
surement errors that produce a systematic bias. Otherwise the mea-
suring instruments will themselves increase the invalidity of the
Are the Measures Self-Constructed?
The key to accurate measurement is a scientific consensus that a
measure works. This means that it has been subject to repeated use,
and that the use has confirmed or revised the measure in such a way
that it can be relied upon with confidence. This is even more impor-
tant when the goal of a study is to impact public policy, not just re-
main as an interesting piece of academic research.
Therefore, while it is possible to construct one’s own measures to

Page 63
use in a study, such a strategy should be a last resort.
structed measures are generally a bad idea. At the very least, they of-
fer no reason to trust that anything has yet been accurately and truly
measured. The burden of proof is on researchers who design their
own measures. The more self-constructed or unexplained the mea-
sure, the more dubious the study evaluator should be. For these rea-
sons, we consider all self-constructed measures to be inadequate
without direct, extensive evidence showing otherwise.
Looking at the 49 studies, we find that 23 studies appear to have
created some of the measures used in their studies.
We say, “appear
to,” because they do not say. On this question, the studies are en-
tirely silent. The watchwords here should be “Trust and Verify.”
Without further information, these studies should not be trusted.
Are the Measures Reliable?
Reliability is the extent to which repeated applications of the
measure result in the same outcomes. No measuring instrument is
perfectly reliable, but some measures are better than others. Estab-
lished measures of physical variables such as height, weight, and
body temperature are less prone to reliability errors than those in the
social sciences.
For example, if you use a ruler to measure a person’s height at
four different times during a week, the ruler should give you the
same number of inches. In contrast, administering the Scholastic
Aptitude Test (SAT) to the same subject four times will produce
more varied results. The SAT is therefore a less reliable test, com-
pared to using a ruler. On the other hand, if one compares SAT re-
sults to an individual’s answers to a survey question, such as,
“Should there be less regulation of the economy?” asked of the same
subject four times. An individual’s SAT score is far more reliable
than an individual’s responses in a survey. Reliability is a matter of
This means, for better or for worse, there is no standard level of
acceptability when testing for reliability, but there are three basic
methods of assessing the reliability of a measuring instrument: test-
retest, parallel forms, and split halves.
Of the three methods, experts

Page 64
agree that the test-retest index is the best measure of reliability.
other words, pick a measure already established in the area and care-
fully report upon and study its reliability. Well-known evaluation re-
searchers and sociologists Peter Rossi and Howard Freeman’s rule of
thumb is that unless a measuring instrument yields the same results
75 to 80 percent of the time, it is not useful.
Context is important. It is one thing to pioneer an exploratory
study. It is another thing to set out to influence policymakers, in-
cluding courts. When the goal is to affect the larger society with the
findings obtained, researchers should be able to show that the mea-
sure has been widely used, in many studies, for a long period of time
with good results.
This is not the case with most of the studies under examination
here. Looking at the 49 studies, we find the following:
Twenty-three studies do not refer at all to tests for
Five studies reported on the reliability of their
Fifteen studies referenced measures previously used in other
Six studies reported checks for reliability
This is not to say that the measures are unreliable. We just cannot
say that the measures are reliable. If we cannot say they are reliable,
we cannot recommend public policies.
One good example of a carefully tested set of measures is in
McNeill et al (1998). The investigators had each respondent com-
plete four inventories to measure family relations and parental atti-
tudes. These measures were 1) the Index of Family Relations,
2) the
Index of Parental Attitudes,
3) the Family Awareness Scale,
and 4)
the Dyadic Adjustment Scale.
Each measure was developed and
tested in many previous studies over a long period of time. McNeill
et al also report “test-retest reliability of .87 or higher.”
That is,
when an individual took the same test a second time after a reason-
able passage of time, 87 percent of the answers were the same.
One final point about reliability has to do with the effect of
unreliability on studies that seek to affirm the null hypothesis. Unre-
liable measures tend, as a rule, to lower the magnitude of correla-

Page 65
tions and other statistics; this tends to make it easier to fail to reject
the null hypothesis and bias results in favor of finding “no differ-
ences between homosexual and heterosexual parents.”
Are the Measures Valid?
Validity is the other major concern regarding measurement. Being
able to replicate a measurement is essential but not sufficient. The
measurement also needs to actually measure what it purports to mea-
sure. For example, do readings on your oven thermometer truly mea-
sure the temperature of your oven? Do readings of PH levels from a
soil testing kit really measure the degree of acidity or alkalinity in
your lawn? Does an individual’s astrological birth sign really measure
personality traits?
There are two kinds of validity: construct validity and empirical
Construct validity evaluates whether the measure (the reading on
the oven thermometer) is a valid indicator of the underlying con-
struct (the temperature).
Empirical validity (also called predictive validity) evaluates to what
degree a measure correlates empirically with other independent mea-
sures of the same construct.
Here are two examples of how validity might be tested. The first
concerns tests of mathematical ability. Standardized scores on Test X
should be the same as those on other measures of math ability. If
scores on Test X correlate better with a measure that is seemingly un-
related to math ability—such as church attendance—Text X is an in-
valid measure of mathematical ability.
Another example is the SAT. These tests are supposed to measure
the theoretical construct, “academic ability.” To a lesser extent, so
do high school grade-point-averages. SATs are highly correlated
with GPAs. The SATs in turn are highly correlated with first-year
college grades. That these two measures are highly correlated with
each other increases the validity of the SAT as a measure of academic
ability. In contrast, suppose we used another measure, such as num-
ber of high school extra-curricular activities. This measure might

Page 66
have no validity regarding academic ability. As such, we would not
expect it to predict college performance as measured by a college
GPA. In other words, the SAT would have high predictive validity,
but participation in high school extra-curricular activities would
have little or no such validity.
How did the 49 studies fare regarding measurement validity?
Twenty studies provide no references or reports of calculations re-
garding validity.
Twenty-nine studies provided references or carried
out calculations regarding validity Of the 29 studies, four reported
and referenced the validity of their measures.
Three presented the
reported validity of their measures.
The other 22 merely referenced
the validity of their measures.
What Went Wrong and What Can Be Done About It?
Here are the lessons from Step 3: Use Reliable and Valid Measures.
1) Avoid self-constructed measures. No one has a reason to
trust them
2) Use reliable measures. Tools must be capable of being
3) Use valid measures. Tools must measure the object of
your study
In a way, it is not a surprise that these studies perform poorly on
the questions of measurement. Consider that assessing the reliability
and validity of various IQ tests as measures of intelligence have been
going on for more than 50 years, for large national samples all
around the world (e.g., for different countries, for different age
groups) yet questions are still sometimes raised concerning the valid-
ity and reliability of IQ scores. Because the study of homosexual par-
ents and their children is so new and untested, claims of reliable and
valid measurement are dubious without many careful and repeated
studies of such measures. These are nonexistent. One should con-
sider most measures as applied to homosexual parents and their chil-
dren to be only exploratory. They should form no basis for public
policy recommendations.

Page 67
1. The summaries of our comments can be found on pp. 21 (for Chapter 1) and
52-53 (for Chapter 2).
2. When the interviewer is the researcher also, which is the case in a number of
these studies, the potential problem of response contamination is very great. In
their path-breaking scientific study of sexual behavior in America, Laumann et
al. (1995) used a respondent self-administered form to ask a number of ex-
tremely sensitive questions, so that the interviewers could not influence
respondent’s answers (1995, p. 60).
3. Miller goes so far as to state that creating one’s own measure should be the
action of last resort in any social science research. Miller, 1991, p. 580.
4. This is evidenced by the lack of references regarding their measures. The
studies are: Barret and Robinson, 1990; Bozett, 1980; Cameron and Cameron,
1996; Gartrell et al, 1996; Hare, 1994; Harris and Turner, 1985; Javaid, 1992;
Lewin and Lyons, 1982; Lewis, 1980; Lott-Whitehead and Tully, 1992;
Lyons, 1983; McCandlish, 1987; Miller, 1979; O’Connell, 1993; Pagelow,
1980; Pennington, 1987; Rand et al, 1982; Riddle and Arguelles, 1989; Ross,
1988; Turner and Harris, 1990; Weeks et al, 1975; West and Turner, 1995; and
Wyers, 1987.
5. Rossi and Freeman, p. 230; Nachmias and Nachmias, p. 170-171.
6. We will spare the reader the technical aspects of assessing reliability, since these
are easily found in Nachmias and Nachmias, 170-175.
7. Miller, Handbook of Research Design and Social Measurement, 1991, p. 580.
8. Rossi and Freeman, p. 232.
9. This is evidenced by the lack of references regarding their measures. The
studies are: Barret and Robinson, 1990; Bozett, 1980; Cameron and Cameron,
1996; Gartrell et al, 1996; Hare, 1994; Harris and Turner, 1985; Javaid, 1992;
Lewin and Lyons, 1982; Lewis, 1980; Lott-Whitehead and Tully, 1992;
Lyons, 1983; McCandlish, 1987; Miller, 1979; O’Connell, 1993; Pagelow,
1980; Pennington, 1987; Rand et al, 1982; Riddle and Arguelles, 1989; Ross,
1988; Turner and Harris, 1990; Weeks et al, 1975; West and Turner, 1995; and
Wyers, 1987.
10. Chan et al, 1997; Kirkpatrick et al, 1981; McNeill et al, 1998; Tasker and
Golombok, 1995, 1997. Referencing measures is clearly the best approach in a
relatively new field.
Notes to Chapter 3

Page 68
11. The studies are: Bailey et al, 1995; Bigner and Jacobsen, 1989a, 1989b,
1992; Brewaeys et al, 1997; Flaks et al, 1995; Green, 1978; 1982; Green et al,
1986; Golombok and Tasker, 1996; Golombok et al, 1983; Kweskin and Cook,
1982; and Patterson, 1994a, 1994b, 1997. This makes them seem somewhat
more reliable than those lacking any reported or referenced reliability
12. Crosbie-Burnett and Helmbrechty, 1993; Hoeffer, 1981; Huggins, 1989;
Koepke and Moran, 1992; Miller et al, 1982; and Mucklow and Phelan, 1979.
13. In Hudson, 1992, The Walmyr Assessment Scales, Scoring Manual.
14. In Hudson, 1992, The Walmyr Assessment Scales, Scoring Manual.
15. In Green Kolevzon and Vosler, 1985.
16. In Spanier, 1976.
17. McNeill et al, 1988, p. 60.
18 For example, Cohen provides a brief discussion of this point (1987, p. 537).
19 There are many different labels and differentiated types of validity cited in
the testing literature. Differentiating between construct and empirical validity is
the minimal necessary distinction needed here.
20. Barret and Robinson, 1990; Bozett, 1980; Cameron and Cameron, 1996;
Gartrell et al, 1996; Harris and Turner, 1985; Lewin and Lyons, 1982; Lewis,
1980; Lott-Whitehead and Tully, 1992; Lyons, 1983; McCandlish, 1987;
O’Connell, 1993; Pagelow, 1980; Pennington, 1987; Rand et al, 1982; Riddle
and Arguelles, 1989; Ross, 1988; Turner et al, 1990; Weeks et al, 1975; West
and Turner, 1995; and Wyers, 1987.
21. Tasker and Golombok, 1995, 1997; Brewaeys et al, 1997; Kirkpatrick et al,
22. Huggins, 1989; Koepke, 1992; Miller et al, 1982.
23. Bailey et al, 1995; Bigner et al, 1989a, 1989b, 1992; Chan et al, 1997;
Crosbie-Burnett and Helmbrechty, 1993; Flaks et al, 1995; Golombok and
Tasker, 1996; Golombok et al, 1983; Green, 1978; 1982; Green et al, 1986;
Hoeffer, 1981; Kweskin and Cook, 1982; and Patterson, 1994a; 1994b; 1997.

Page 69
We have examined hypotheses, controls, and measurements. The
next key issue is sampling. Sampling is a simple concept—choosing
cases to include in your study. The question is: “Have you used a
method from which you can reasonably generalize?”
What Sampling Is and Why It Matters
Sampling is the systematic means by which cases are selected for
inclusion in a study. There are two basic types of samples: probabil-
ity and non-probability samples. The distinction is critical because
one cannot generalize from a non-probability sample.
Probability versus non-probability sampling is a fundamental dis-
tinction in research. The most important fact about the 49 studies
we evaluated is that 48 of them used non-probability samples.
If we
exclude the four clinical case studies with five or fewer subjects, we
have 44 deeply flawed quantitative studies using non-probability
One cannot generalize from these studies. They may give us inter-
esting leads, and suggest possible insights, but nothing reliable can
be inferred from them outside the individuals studied.
In this chapter we explain the difference between probability and
non-probability sampling, types of each, and give examples of faulty
sampling from the parenting studies.
Chapter 4
It all Depends On Who You Ask
Notes for this section begin on Page 78

Page 70
Probability Sampling: The Key to Valid Research
In a probability sample, each unit of the population studied has a
known probability of being included in the sample. These studies
use randomization methods to select the respondents for a study.
There are three types of probability samples: the simple random
sample, the stratified random sample, and the cluster sample. In the
simple random sample, each unit in the population has an equal
chance of being included in the study. In the stratified random
sample, the population is divided into strata, and each stratum must
be represented in known proportions within the sample. Indepen-
dent samples are selected by random procedure within each stratum,
and each unit must appear in one and only one strata.
The third type of probability sampling is the cluster sample. The
population is divided into homogenous, geographically defined
groups. A sample of these groups is next drawn by random proce-
dure, and elements within each of these samples are in turn selected
by random procedure. For example, cluster-sampling households in
a major city might first involve separating the census tract into ho-
mogeneous geographical sections. The investigator would then ran-
domly select a sample of these geographical sections (first-stage
cluster sampling), then randomly select city blocks within the
sample of geographical sections (second-stage cluster sampling), and
finally select randomly the households within the sample blocks
(third-stage cluster sampling).
Modern sample surveys typically rely on both clustering and
stratification to obtain probability samples. These complex designs
require calculation of sampling weights and statistical software for
statistical estimation and testing.
Since the researcher can legitimately draw generalizations about the
larger population from probability samples, the size of the sample is
important. Larger probability samples yield more accurate results.
Non-Probability Sampling: It Doesn’t Do the Job
As mentioned above, 44 studies used forms of non-probability
The researchers on homosexuals and homosexual parents

Page 71
repeatedly blundered because they use non-probability samples to
wrongly make population estimates.
The First General Problem with Non-Probability Samples. The
first general problem with non-probability surveys has already been
mentioned. One cannot generalize from a non-probability sample,
although Patterson and others repeatedly try to do so. In a review
of the literature, Patterson claims that the estimated American ho-
mosexual population is 10 percent of the adult population based on
information gleaned from the dated Kinsey report.
Patterson states that the percentage of gays and lesbians that are
parents as 10 percent and 20 percent respectively, based on “large-
scale survey studies” such as those by Bell and Weinberg, (1978) and
Saghir and Robins, (1973).
In the same review, Patterson (1992)
also cites population estimates for the number of lesbians having
borne children after coming out.
Patterson claims that the figures under-represent the actual num-
However, she has no scientific basis for this claim. There is no
way one can make population estimates on volunteer samples of any-
thing. It has nothing to do with discrimination or stigmatization of
homosexuals. It has everything to do with the basic distinction be-
tween a probability and non-probability sample.
The Second General Problem with Non-Probability Samples.
The size of the sample is irrelevant for making estimates, because
population estimates based on non-probability samples are not sci-
entific, despite appearing to be so.
A large non-probability sample
does not give you better population estimate than a small non-prob-
ability sample. The size of the sample is only relevant for probability
samples, where larger samples allow greater precision of population
The Third General Problem with Non-Probability Samples.
The third general problem with non-probability samples is their
tendency toward bias. While probability samples systematically elimi-
nate this problem of bias through random selection procedures,
non-probability samples do not. Have the investigators overlooked
any obvious biases in their process of sample selection? Where did
the investigators get their participants? How were they found? Were
there any predetermined restrictions on eligibility? How does this af-

Page 72
fect the final sample used for data collection and analysis, and how
might it affect the results?
For example, it is a sociological “truism” that college students
become more liberal the longer they stay in college. The original
study was done on a sample of college students at Bennington Col-
lege. Even if the study managed to use the complete population of
the college, the results would nevertheless contain biases associated
with the selection of the individual school itself (which may be con-
siderable). Magazine volunteer polls are common forms of non-
probability sampling. In a typical magazine poll, the magazine
reports the results of those who voluntarily respond to a question-
naire in a magazine. The respondents almost certainly differ in sys-
tematic ways from the non-volunteers, first by showing a strong
interest in the subject of the questionnaire. There are also biases in-
herent in those reading the magazine itself, as compared to the gen-
eral population. No matter how large the number of respondents,
findings from such magazine, television, or Internet surveys cannot
be generalized to the larger population.
The study of human sexual behavior is especially plagued by over-
reliance on non-probability sampling. The famous estimates of the
proportion of homosexuals the U.S. population in the Kinsey re-
ports, for example, are based solely on non-probability samples,
which rely substantially upon volunteers and on heavy sampling of
highly unrepresentative locales such as prisons.
There are times, however, when the researcher must rely on non-
probability sampling. Social scientists sometimes do so if probability
sampling is too expensive or difficult. We will now turn to four
general types.
Specific Types of Non-Probability Samples.
There is no standardized classification of types of non-probability
sampling. The three main types of non-probability sampling are
convenience sampling, purposive sampling, quota sampling, and
snowball sampling.
Convenience sampling is just what it sounds like. One selects
whoever is available, such as students in an introductory psychology

Page 73
Purposive sampling involves selecting cases that the investigator
believes are representative of the larger population. An example
would be election forecasting. A small number of precincts in each
state are selected, based on the extent to which those precincts mir-
ror the overall state election returns for the previous election. Elec-
tion forecasting rests on the assumption that these precincts still will
mirror the state election returns.
With quota sampling, the investigator tries to select a sample as
similar as possible to the sampling population. An investigator may
seek an equal number of men and women, if the investigator thinks
the population from which the sample is drawn will have an equal
number of men and women. Quota sampling requires the investiga-
tor to use his or her judgment to identify all the important features
that might affect the sampling.
Snowball sampling is another method of selecting cases that is
not strictly speaking a form of sampling. The snowball method is
sometimes used to study rare populations. It presumes that a net-
work exists among members of the particular rare population. The
investigator depends on one member of the network to identify oth-
ers in the same network until the sufficient number of cases is
As we said before, the researcher sometimes is forced to use non-
probability samples. Because participants in any non-probability
sample are not randomly selected, identifying where the participants
come from is another critical component for evaluating the design
of a study. Have the investigators overlooked any obvious biases in
their process of sample selection? Where did the investigators get
their participants? How were they found? Were there any predeter-
mined restrictions on eligibility? How does this affect the final
sample used for data collection and analysis, and how might it affect
the results?
Non-Probability Studies. Potentially Biased Participants
Of the 44 quantitative studies using a non-probability sampling
design, in five studies, the researchers established the initial list of
potential subjects. In three of these, subjects were initially recruited
from the investigators’ own list of clinical cases.
In the other two,

Page 74
the subjects were drawn from lists of sperm bank patients.
The re-
maining 39 studies relied on some form of self-selecting volunteers
for their final pool of subjects.
The inherent problem with relying
on self-selected volunteers is obvious: when either or both the study
and comparison groups know the purpose of the study and have a
large stake in its substantive outcome, one almost inevitably intro-
duces very serious sample selection biases into a study. The partici-
pants have every incentive to paint themselves in the best possible
light. Laumann et al. (1994), authors of a groundbreaking, scientifi-
cally rigorous study of sexual behavior in the United States, reviewed
non-probability based studies of sexual behavior and found that
generalizations “are very likely to be strongly biased in an upward
direction (i.e., overestimating the incidence of certain behaviors) be-
cause the samples are highly self-selected on the very variables of in-
It is Laumann et al.’s view that the public’s perception of
sexual behavior in the United States is formed by highly visible non-
probability samples, which overestimate substantially the true fre-
quency and types of sexual behavior.
Another well-known
quantitative social scientist has come to a similar conclusion.
These findings have serious implications regarding the studies
under review. All the studies on homosexual parents and their chil-
dren save one used some form of self-selecting non-probability
sample. This means they cannot answer the question of whether
there is “no difference” between homosexual versus heterosexual
parents and their children in the larger population. The fact of self-
selection bias and the problem of overestimation are inherent in
these studies.
We find that investigators drew both heterosexual and homo-
sexual parents from a socially active pool. This introduces biases, be-
cause the general public is inactive. Belonging to an organization is
more active than subscribing to a newsletter; being a leader or regu-
lar participant is even more so. Ironically, being a study volunteer is
associated with a host of demographic traits (higher levels of educa-
tion and greater occupational prestige, for example) that makes one
different from the general public.
Where did these studies draw subjects? Twenty of the studies re-
lied on some form of snowball sampling, all in combination with

Page 75
other methods. Study participants were asked to name or contact
others that might fit the sample criteria and be interested in partici-
Homosexual Participants. Publications and newsletters were also
a major vehicle for recruiting homosexuals but not heterosexuals.
Seventeen studies relied on gay-lesbian or feminist publications for
the homosexual parent sample.
In contrast, one heterosexual
sample was obtained from an advertisement in a feminist newsletter,
which is likely to minimize rather than maximize differences be-
tween homosexual and heterosexual respondents.
Ten studies also recruited homosexual parents from gay-lesbian
parent support groups,
while 11 recruited gay-lesbian parents
through gay and lesbian organizations.
Three studies relied also on
feminist groups for lesbian mothers,
but none did so for the het-
erosexual samples. Three studies relied on day care centers and day
care newsletters to recruit lesbian mothers.
Since hard-to-find populations are expensive and time consuming
to properly survey using some form of probability sampling, it is no
wonder that investigators choose one or another variety of non-
probability sampling. Unfortunately, drawing from a more activist
pool of parents raises the problem of biased results. In particular,
the parents in gay-lesbian parent support groups have every incentive
to give the socially desirable answer to any questions about their
children as a way of justifying their choices and lifestyles. These
highly educated respondents are almost inevitably going to be biased
toward giving the socially desirable answer that would put them-
selves and their children in the most positive light.
Heterosexual Participants. Three studies used random samples
to obtain a heterosexual comparison group.
But the most common
means of recruiting heterosexual mothers was the single parent sup-
port group or newsletter, used by 12 studies (or 80 percent) of
those studies that used a heterosexual comparison group.
studies relied on day care centers and day care newsletters to recruit
heterosexual mothers,
while two others drew heterosexual mothers
from the local PTA.
These studies did not attempt to correct for

Page 76
biases, even after acknowledging their existence.
With the exceptions of Cameron and Cameron (1996) and
Bigner and Jacobsen (1989a and 1989b), these studies draw
heavily from sources that seem extremely unrepresentative of single
parents. The most common source of finding heterosexual single
mothers was through single parent organizations and newsletters
(80 percent of the studies using a comparison group). This seems
like an unusual source for single parents. While we do not know
what percentage of single mothers participate in single parent orga-
nizations, we do know that 46 percent of single mothers participate
moderately or often in school activities such as the PTA, school vol-
unteer work, school committee work, or school event.
Single moth-
ers from local PTAs may be a more representative source of
single-mother respondents. In fact, the typical single mother, ac-
cording to calculations from census data, is African-American or
Latino, 33 years of age, with less than one year of college educa-
Since the typical single mother respondent in these studies is
Caucasian with some college education, the comparison sub-samples
are already grossly non-representative of the general single-mother
Similarly, heterosexual participants drawn from institutional
daycare centers and newsletters are also not representative of hetero-
sexual mothers. The majority of children under age six are not in in-
stitutional daycare centers. Most children under age six during the
workday are either cared for by their parents (40 percent), relatives
(21 percent), or home-based daycare (18 percent). Only 31 percent
are in day care centers, nursery schools, Head Start, or some form of
institutional pre-Kindergarten program.
This does not stop some from trying to salvage these studies.
One review by Patterson and Redding goes so far as to totally invert
the scientific standard regarding sampling and populations.
It ar-
gues that criticizing the non-probability sampling studies as non-
representative is unjustified. Why? “[B]ecause researchers do not
know the actual composition of lesbian mothers, gay fathers, or
their children (many of whom choose to remain hidden), and hence
cannot possibly evaluate the degree to which particular samples do
or do not represent the population. At present, there is no more rea-

Page 77
son to argue that samples do not represent the population of lesbian
mothers, gay fathers, and their children than there is to argue that
they do represent it (italics added).”
Patterson and Redding’s argument turns standard principles of
sampling and statistics on their head. The criticism that these studies
are non-representative is not based on the assumption that samples
do or do not represent the larger population. Non-probability
samples are not random by definition and therefore cannot be used
to generalize to the larger population because there is no way of do-
ing so. To assume that a sample is representative unless shown other-
wise is simply absurd. Representativeness is a matter of science, not
of advocacy.
What Went Wrong and What Can Be Done About It
Here are the lessons from Step 4: Use Valid Samples.
1) Use probability samples. There is no substitute. Only
these offer any basis for scientific generalization to larger,
representative populations.
2) Ignore studies based on non-probability samples when
making policy decisions. They offer little basis for scientific
generalization. Therefore they have no valid implications for
general questions of public policy.
3) Especially ignore studies where participants recruit other
participants. These are so subject to bias, that the limited re-
sults cannot be trusted. Patterson and Redding argue that,
“In the long run, it is not the results obtained from any one
specific sample but the accumulation of findings from many
different samples that will be most meaningful.”
This is a
perfect illustration of the problem with these studies. In the
long run, non-probability samples will yield non-generaliz-
able and biased results-just as they have in the short run.
Nothing plus nothing equals nothing. Meaningful and cor-
rect generalization to the population has nothing to do with
the number of studies done on the subject, nor on the sizes

Page 78
of their samples, nor on having “many different samples.” It
has to do with correct, that is to say probability, sampling.
Whether researchers have the time and resources to properly
survey populations such as homosexual parents is a question
that only the researchers can answer. But proper research
can be done, and there is no excuse in this age of interdisci-
plinary training and diffusion of elementary knowledge of
the principles of modern scientific statistical methods for the
low quality and inflated claims made by the studies we
evaluate here. Laumann et al (1994) stands as an example of
what can and should be done in for the study of the impact
of homosexual parents on their children. They successfully
used probability methods to obtain their study subjects, and
then successfully minimized both bias on the part of the in-
vestigators (because the respondents who are sampled are
not known individually to the researchers) and in the selec-
tion of subjects (because the sample is randomly selected, it
does not consist of volunteers, and because they achieved a
high response rate in obtaining information from the origi-
nal sample).
Any study seeking to survey a special population, especially for
the purpose of influencing public policy decisions, ought to have
available to its research team a professional sampling statistician.
Brought in during the early stages of the planning process, the stat-
istician would assist in creating a proper sampling design, and hope-
fully would prevent the repetition of the flaws present in the studies
we have investigated, and might even prevent the inflated claims for
their findings put out by their authors.
1 The only exception is Cameron and Cameron (1996). This study, while using
randomization, did not select a national sample and because they did not use
stratification, obtained only 17 adults who reported having homosexual parents
(Cameron and Cameron, 1996. p.764) Other problems are discussed in
Chapter 5.
2. They are: Bailey et al, 1995; Bigner and Jacobsen, 1989a, 1989b, 1992;
Notes to Chapter 4

Page 79
Bozett, 1980; Brewaeys et al, 1997; Chan et al, 1998; Crosbie-Burnett and
Helmbrechty, 1993; Flaks et al, 1995; Gartrell et al, 1996; Golombok and
Tasker, 1996; Golombok et al, 1983; Green, 1978, 1982; Green et al, 1986;
Hare, 1994; Harris and Turner, 1985; Hoeffer, 1981; Huggins, 1989; Javaid,
1992; Kirkpatrick et al, 1981; Koepke et al, 1992; Kweskin and Cook, 1982;
Lewin and Lyons, 1982; Lewis, 1980; Lott-Whitehead and Tully, 1992;
Lyons, 1983; Miller, 1979; McNeill et al, 1998; Miller et al, 1982; Mucklow
and Phelan, 1979; O’Connell, 1993; Pagelow, 1980; Patterson, 1994a, 1994b,
1997; Pennington, 1987; Rand et al, 1982; Riddle and Arguelles, 1989; Tasker
and Golombok, 1995, 1997; Turner et al, 1990; West and Turner, 1995; and
Wyers, 1987.
3. Nachmias and Nachmias, 1996, pp. 185-195 provide a textbook discussion of
types of probability sampling, and examples of drawing a nation-wide sample.
4. See also Kish, 1965; Kalton, 1987; and Schaeffer et al., 1996 for detailed and
technical discussions of modern techniques of probability sampling.
5. Kish, 1965; Kalton, 1987; and Schaeffer et al., 1996 op. cit for detailed dis-
cussion of the sophisticated sampling methods developed by modern sampling
6. Five used no form of sampling at all.
7. 1992, p. 1026. Patterson obtained the figures from Kinsey, Pomeroy, and
Martin, 1948.
8. Ibid.
9. Ibid.
10. Ibid. Estimates of lesbian mothers are provided by Falk (1989), Gottman
(1990), Hoeffer (1981), and Pennington (1987), whil Bozette (1987),
Gottman (1990), and Miller (1979) provide estimates of gay fathers. The num-
ber of children raised by homosexual parents is estimated by Bozette (1987),
Peterson (1984), and Schulenberg (1985).
11. The size of the sample does affect the power of the statistical tests used to
detect statistical significance (this is discussed in far more detail in subsequent
sections on the logic of statistical testing).
12. Laumann et al. provide an explanation as to how the Kinsey 10 percent fig-
ure became accepted as the “right” proportion of homosexuals in the U.S. popu-
lation in an insightful discussion entitled, “The Myth of 10 Percent and the
Kinsey Research” (1994, pp. 287-290).

Page 80
13. Nachmias and Nachmias, 1996, pp. 184-85.
14. Crosbie-Burnett, and Helmbrechty, 1993; Green, 1978; and Pennington,
15. Brewaeys et al, 1997; Chan et al, 1998.
16. Bailey et al, 1995; Bigner and Jacobsen, 1989a, 1989b, 1992; Bozett,
1980; Flaks et al, 1995; Gartrell et al, 1996; Golombok and Tasker, 1996;
Golombok et al, 1983; Green, 1982; Green et al, 1986; Hare, 1994; Harris and
Turner, 1985; Hoeffer, 1981; Huggins, 1989; Javaid, 1992; Kirkpatrick et al,
1981; Koepke et al, 1992; Kweskin and Cook, 1982; Lewin and Lyons, 1982;
Lewis, 1980; Lott-Whitehead and Tully, 1992; Lyons, 1983; Miller, 1979;
McNeill et al, 1998; Miller et al, 1982; Mucklow and Phelan, 1979; O’Connell,
1993; Pagelow, 1980; Patterson, 1994a, 1994b, 1997; Rand et al, 1982; Riddle
and Arguelles, 1989; Tasker and Golombok, 1995, 1997; Turner et al, 1990;
West and Turner, 1995; and Wyers, 1987.
17. p. 46.
18. 1994, p. 46.
19. Greeley, 1994. Greeley compared two surveys, one a popular non-probabil-
ity survey, “The Janus Report” the other a nationally recognized, probability-
based sample survey, “GSS,” that is widely used in the quantitative social
sciences. The first set of results was from “The Janus Report” (Janus and Janus,
1993). These were based on a non-probability sample of 8,000 respondents
gathered unsystematically from various sources, including patients of sex thera-
pists and their friends and acquaintances. The other set of results were from the
General Social Survey (GSS). The GSS is based on a national household-based
probability sample of adults. It is conducted by the National Opinion Research
Center at the University of Chicago, nearly every year, for the past 20 years,
funded by the National Science Foundation. General Social Surveys, 1972-1996:
Cumulative Codebook, 1996. Greeley compared, among other data, the percent-
age of persons reporting they had sex at least once a week. Janus estimates were
much higher than those in GSS. For men in the youngest age group (between
the ages of 18 and 26), 72 percent in the Janus report claimed to have sex at
least once a week versus 57 percent in the GSS. For women in the same age
group, it was 68 percent in Janus versus 58 percent in the GSS. In another com-
parison, 83 percent of Janus men versus 56 percent of GSS men and 68 percent
of Janus women versus 49 percent of GSS women between 39 and 50 years of
age report having sex at least once a week. The overestimates are greatest among

Page 81
the oldest respondents: 69 percent of Janus men versus 17 percent of GSS men,
and 74 percent of Janus women versus 6 percent of GSS women over 65. The
Janus estimates are four times that of the GSS estimates for men over 65, and
12 times that of the GSS estimates for women. That is, Janus respondents claim
to have sex much more often than those surveyed in the GSS. The overall pat-
tern is one of consistent overestimation by the non-scientific Janus Report.
20. The following studies rely on the snowball technique: Crosbie-Burnett and
Helmbrechty, 1993; Flaks et al, 1995; Gartrell et al, 1996; Hare, 1994; Harris
and Turner, 1985; Huggins, 1989; Javaid, 1992; Kirkpatrick et al, 1981;
Koepke et al, 1992; Lewin and Lyons, 1982; Lott-Whitehead and Tully, 1992;
Miller, 1979; O’Connell, 1993; Patterson, 1997; 1994a; 1994b; Rand et al,
1982; Riddle and Arguelles, 1989; Turner et al, 1990; and West and Turner,
21. Bailey et al, 1995; Crosbie-Burnett and Helmbrechty, 1993; Flaks et al,
1995; Gartrell et al, 1996; Golombok and Tasker, 1996; Golombok et al, 1983;
Green et al, 1986; Hare, 1994; Harris and Turner, 1985; Kirkpatrick et al,
1981; Lewin and Lyons, 1982; Lewis, 1980; O’Connell 1993; Tasker and
Golombok, 1995; Tasker and Golombok, 1997; Turner et al, 1990; West and
Turner, 1995.
22. Kirkpatrick et al, 1981.
23. Bigner and Jacobsen, 1989a, 1989b, 1992; Crosbie-Burnett and
Helmbrechty, 1993; Flaks et al, 1995; Gartrell et al, 1996; Golombok and
Tasker, 1996; Golombok et al, 1983; Tasker and Golombok, 1995; 1997.
24. Flaks et al, 1995; Gartrell et al, 1996; Golombok and Tasker, 1996;
Golombok et al, 1983; Harris and Turner, 1985; Lott-Whitehead and Tully,
1992; Pagelow, 1980; Tasker and Golombok, 1995, 1997; Turner et al, 1990;
Wyers, 1987.
25. Green et al, 1986, Miller, 1982, O’Connell, 1993.
26. Harris and Turner, 1985; Lewin and Lyons, 1992; Turner et al, 1990.
27. Bigner and Jacobsen, 1989a, 1989b; Cameron and Cameron, 1996.
28. These studies are Bigner and Jacobsen, 1992; Golombok and Tasker, 1996;
Golombok et al, 1983; Harris and Turner, 1985; Huggins, 1989; Javaid, 1992;
Kirkpatrick et al, 1981; Kweskin and Cook, 1982; Lewin and Lyons, 1982;
Pagelow, 1980; Tasker and Golombok, 1995, 1997.
29. Harris and Turner, 1985; Lewin and Lyons, 1992.

Page 82
30. Miller et al, 1982; Mucklow and Phelan, 1979.
31. In contrast, Bell, Weinberg and Hammersmith (1981) attempted to correct
for self-selection bias in their non-probability sample of homosexuals in the fol-
lowing way. They believed (correctly) that a homosexual respondent’s familiar-
ity with various theories of homosexuality might bias their responses. As a filter
question, they asked whether the homosexual respondents had read books or
articles, or attended lectures about homosexuality. Responses were subsequently
controlled for this a priori familiarity. In some cases, it made a significant differ-
ence, in others, it did not, and reporting of results was adjusted accordingly.
Bell, Weinberg, and Hammersmith, 1981, p. 20.
32. Zill and Nord, 1994, p. 46.
33. Zill and Nord, 1994, p. 14-15.
34. “Regular Child Care Arrangements for Children Under 6 years Old, by
Type of Arrangement,” 1995, Statistical Abstract, 1996, p. 386.
35. Patterson and Redding, 1996, p. 44.
36. Ibid.
37. Ibid.
38. Laumann et al. 1994, “Sampling Procedures and Data Quality,” pp. 549-

Page 83
Chapter 5
Just by Chance?
Statistical Testing
We come now to the culmination of the social-scientific process:
statistical testing. For a non-social scientist this may sound like a val-
ley rather than a mountaintop. But if hypotheses are properly con-
ceptualized (Chapter 1); if extraneous variables are properly
controlled (Chapter 2); if concepts are properly measured (Chapter
3); and if populations are properly defined and samples properly
drawn (Chapter 4); then we are ready for the process of statistical hy-
pothesis testing. It should be quite straightforward.
There is a lack of scientific rigor in the same-sex parenting studies
in this step. Of the 49 published studies, we find the following
Four are case studies that do not carry out any statistical
analysis of the data.
Eighteen use only descriptive statistics, which offer no basis
for generalization.
Five use statistical tests, but fail to apply them to any kind
of control group.
Twenty-two use statistical tests in comparison with at least
one control group.
Forty-eight lack sufficient statistical power to validate their
This means that no scientific generalizations can be reliably made
from these data. The researchers have not shown that the results are
not a function of chance factors. This chapter begins by explaining
Notes for this section begin on Page 92

Page 84
different kinds of statistics. It then takes a closer look at inferential
statistics, the most important test for this research. It identifies the
two types of statistical errors, and focuses on how to avoid Type II
error. We conclude with a list of recommendations for good research
methodology in this area.
What Statistical Tests Are and Why They Matter
Quantitative summaries of data come in two flavors: descriptive
and inferential. Descriptive statistics are inherently limited. Descrip-
tive statistics are used to organize and summarize data. Percentages
are a type of descriptive statistics. They do not allow the researcher
to scientifically generalize beyond the findings at hand. Eighteen of
the studies use only descriptive statistics. Descriptive statistics in-
clude single variable (univariate) representations, such as the mean,
median, range, standard deviation, interquartile range, and so on.
They are statistics performed on one variable, and describe the vari-
able mathematically.
Descriptive statistics may also appear in two-variable (bivariate)
form. These are usually expressed as percentage differences between
two groups. Less formally, they may take the form of comparing sets
of percentages of one group to another.
For example, Pagelow (1980) compares the percentage of les-
bian mothers and heterosexual mothers with regard to custody
problems, living arrangements, home ownership, income after di-
vorce and job discrimination. Similarly, Javaid (1992) compares 26
children raised by lesbian mothers with 28 children raised by hetero-
sexual mothers, in terms of percentages. Javaid finds that 80 percent
of girls in the heterosexual group (12 of 15) desired marriage and
children for themselves compared to 55 percent of girls in the les-
bian group (6 of 11). Javaid notes that a majority of both groups
desired marriage and children.
But this is the wrong answer to the
question of whether being raised by a lesbian mother makes girls less
likely to want marriage and children for themselves. The proper sci-
entific question is: “Are girls raised by lesbian mothers less likely to
desire marriage and children than girls raised by heterosexual moth-
ers?” Put in methodological terms, the question is whether there is a

Page 85
statistical association between “being raised by a lesbian mother”
and “wanting to be married with children when grown-up.” It ap-
pears that such an association may exist, but we do not know for cer-
tain, because the samples were far too small and no inferential
statistical tests were used.
Because of their limitations, descriptive
statistics have been supplemented with more sophisticated inferential
statistics. Inferential statistics enable the investigator to estimate the
likelihood of whether or not the data support the research hypoth-
esis. In other words, what is the probability that the differences
found in the data are real or a function of chance? Results of inferen-
tial statistical testing are typically reported as statistical estimates.
This is confusing to many people, but it is much more accurate.
Because descriptive statistics are irrelevant in questions of generali-
zation, we will focus the rest of the chapter on inferential statistics.
Types of Inferential Statistics Used in the Studies
The simplest types of inferential tests are two-variable (bivariate)
statistics, such as chi-square and t-tests that rely on one independent
variable and one dependent variable. While these measures are a con-
siderable improvement over descriptive statistics alone, they are not
entirely satisfactory. They do not provide as complete a range of ca-
pabilities as is provided by the general linear model (to be discussed
below). They are, however, more easily understood than the more
complex statistics and allow a more useful degree of analytic simplifi-
cation not present in analyzing descriptive statistics such as percent-
ages alone. We will discuss these tools of data analysis used by the
One-Independent One-Dependent Variable Tests. We find that
18 studies use one-independent/one-dependent variable tests to
compare homosexual and heterosexual subsamples.
These studies,
already severely limited by small, self-selected volunteer samples, are
further compromised by using a one-independent variable test.
While they are a considerable improvement over percentages alone,
statistical tests using only one independent variable are not satisfac-
tory, because they provide no way of controlling statistically for ex-
traneous variables. They do not provide as complete a range of
capabilities as is provided by multivariate statistical testing. They are,

Page 86
however, more easily understood, and allow a greater degree of ana-
lytic sophistication compared to presenting descriptive statistics such
as percentages alone.
Multivariate Models. The preferred method of statistical testing
is reliance on some kind of multivariate statistical test that can math-
ematically express the relationship of two or more variables on a
third. These tests are variations of what statisticians call the general
linear model, and are mathematically related. Multivariate statistics
allow calculation of the effects of any one independent variable,
while holding constant the effects of other variables. A researcher
uses multivariate statistical analysis is to control for the
relationship(s) between these other variables and the dependent vari-
able. It is a way to avoid attributing the effects of these additional
variables to the independent variable. In other words, the investiga-
tor should be ruling out spurious correlations or spurious non-cor-
relations. In conceiving of a relationship, the normal approach is to
avoid spurious correlation. “Spurious correlation” is the classic situ-
ation where one is cautioned against confusing correlation with cau-
sation. Here, the situation is that one has a statistically significant
bivariate relationship that has been hypothesized to hold true.
The investigator should control (i.e., statistically adjust) for other
potentially confounding variables to make sure the significant bivari-
ate relationship is still significant. As noted in our earlier discussion
on suppressor variables, spurious noncorrelation is a problem that
takes the form of overlooking suppressor variables. Again, because
the interest of the investigators is in trying to show that something
is not the case, that there is no difference between heterosexual and
homosexual parents, the search for potential suppressor variables is
of critical importance in either establishing or rejecting their
Flaws in The Studies Despite Using Inferential Tests
Only five studies use statistical techniques with more than two
variables. Brewaeys et al (1997) use a two-way analysis of variance
(ANOVA), Flaks et al (1995) and Koepke et al (1992) use multi-
variate analysis of variance (MANOVA), while Chan et al (1998)
and Green et al (1986) use multiple regression. That is, they con-

Page 87
trol for extraneous variables, which they treat statistically as the other
independent variables along with parent sexual preference.
Multivariate without Adequate Control Groups. Koepke et al’s
study, however, does not use a heterosexual comparison group. In-
stead, Koepke et al (1992) compare lesbian couples with and with-
out children. Here, too, the lack of a proper comparison group
cannot be compensated for by using sophisticated statistics. This
study cannot be used as scientific evidence regarding homosexual
parents and their children, although the authors conclude that the
study can be used for such purposes.
The use of MANOVA, which is a statistical test for situations em-
ploying multiple independent and multiple dependent variables in
the same estimation procedure (e.g., Flaks et al, 1995), also intro-
duces a bias in favor of the investigators’ general hypothesis, of find-
ing no differences between homosexual and heterosexual families.
This is because the use of multiple dependent variables as well as
multiple independent variables reduces the degrees of freedom avail-
able for statistical use, and thus reduces the power of the statistical
test for correlations of a given size.
In short, with the same sample
size, and the same overall correlation between independent and de-
pendent variables, it is more difficult to reject the null hypothesis
than it would be using a single dependent variable and multiple
Inferential statistical testing is often misused, but scientific gener-
alizations cannot be made without the use of inferential statistical
testing. Use of inferential statistics alone does not however make a
study minimally acceptable. Major defects in basic research design,
measurement, and data collection cannot be corrected by using in-
ferential statistical testing.
Multivariate Without Proper Comparison Groups. Five studies
used inferential statistics within their homosexual samples, but did
not use inferential statistics to compare homosexual and hetero-
sexual samples-a critical design flaw, as discussed previously.
studies ought not to be considered as providing scientific evidence
regarding the quality of homosexual parenting and/or the condition
of their children because they lack an appropriate comparison group.

Page 88
For example, Bailey et al (1995) used inferential statistical tests
to compare the rate of homosexuality among adult sons raised by
gay fathers with the incidence of gay adoptive brothers, the incidence
of gay monozygotic twins, and the incidence of gay dizygotic twins.
The rate of homosexuality among adult sons of gay fathers was not
statistically different from the rate of homosexuality among adoptive
brothers. The incidence of homosexuality among adult sons of gay
fathers was lower than either the rate of homosexuality among
monozygotic twins or the rate of homosexuality among dizygotic
twins. Moreover, the differences were statistically significant.
Bailey et al, however, failed to compare the homosexuality rate of
adult sons of homosexual fathers to the rate of homosexuality of
adult sons of heterosexual fathers-a serious limitation of their study.
Bailey et al recognize this limitation, but nevertheless, argue that
male sexual orientation is inheritable, not environmental. In their
abstract, the authors state that their “results suggest that any envi-
ronmental influence of gay fathers on their sons’ sexual orientation
is not large.”
In fact, the absence of a heterosexual control group
means that Bailey et al’s results are largely useless.
In another case, Koepke et al (1992) used inferential statistics
and compared lesbian couples with and without children in terms of
the quality of their relationship. Their statistical analysis found les-
bian couples with children had higher relationship satisfaction and
felt better about their sexual relationship compared with lesbian
couples without children. The differences were statistically signifi-
cant. This led the investigators to the conclusion, among others,
that since children thrive in stable, loving relationship between
caregivers, the stable and loving quality of lesbian relationships
should benefit children. “It is important that educators, clinicians,
and practitioners who work with lesbian families understand that les-
bian relationships can provide a positive family environment for child
rearing (italics ours).”
The investigators, however, failed to question the children in the
study, failed to include heterosexual couples without children and,
most important of all, failed to include a control group of hetero-
sexual couples and their children. They should have compared the

Page 89
views of the children of the lesbian couples with children of the het-
erosexual group if they are going to draw any remotely valid conclu-
sions about positive family environments. The authors recognize the
design limitations of their study, nevertheless they make policy pro-
nouncements regarding lesbian couples and children anyway.
This leaves us with 22 (or 45 percent) of the studies of homo-
sexual parents and their children that use inferential statistics in any
kind of comparison with at least one heterosexual control group.
Despite having a proper heterosexual control group, these studies
are still flawed.
Lack of Published Estimates. The use of a control group and of
statistical significance tests alone is no guarantee of a study’s quality.
The use of inferential statistics is greatly misleading if the investiga-
tors do not publish their statistical estimates, and only report
whether the results were significant or not. For example, Green et al
(1986) carry out several regression analyses. They report the variables
entered into their statistical equations,
but noticeably fail to report
the actual values. These results should be reported, even if they were
not statistically significant. Such reporting is important, because the
differences may generally be consistent even if not significant, and
because the level of significance that is actually achieved may be of
considerable relevance. For example, one’s interpretation of the re-
sults is likely to quite different if the achieved probability value is
0.06 or 0.5. It is also necessary to publish detailed results because it
gives independent observers a chance to see what was actually done
and to ascertain if standard procedures for analysis of statistical data
were followed or not.
Making It Too Easy. Investigators can also use an overly strin-
gent statistical procedure, further increasing the probability of find-
ing non-significant results. All of the studies by Golombok and
Tasker use what are called Bonferroni corrections, as does Chan et
al (1998). While often useful in studies that explicitly seek to reject
the null hypothesis, they should not be used at all in studies that
seek to show that no effect in the data, because they make it too easy
for the investigator to prove his case and not, as the inventors of
these techniques originally intended, to make it more challenging
for the investigator to prove his case. We discuss why this is below.

Page 90
Chan et al provides an example worth considering at some
length. This relatively sophisticated piece of research studied chil-
dren created via donor insemination and their parents.
The authors
compare lesbian and heterosexual biological mothers, and the non-
biological lesbian parent and the father (non-biological because of
donor insemination). The investigators compare statistically the par-
ents’ self-reports on several demographic characteristics: age, educa-
tional attainment, employment (hours per week), annual individual
income, and annual household income. Differences in educational
level between the lesbian biological mother and heterosexual mother
and between the lesbian non-biological mother and the father were
statistically significant.
The critical level of significance needed to reject the null hypoth-
esis was set extremely high. Chan et al (as well as Golombok and
Tasker’s studies) use the Bonferroni correction for their statistical
tests. The ordinary level of statistical significance is 0.05, meaning
that of 100 t-tests; one may obtain significant results five times due
to chance. A p value of less of .01 means that significant results of
one out of 100 test may be due to chance, and so forth.
Chan et al use t-tests extensively. The t-test is a statistical proce-
dure that compares the means of two groups to see whether the dif-
ference between the two means are likely to be due to chance or
statistically significant. Since Chan et al performed t-tests for several
variables for their study and comparison groups, they adjusted their
significance levels using the Bonferronni correction so that results
were considered statistically significant when they were strong
enough to achieve the significance level of .003, or 3 out of 1,000 t-
tests rather than the usual .05 level of statistical significance.
There are several problems with the kind of study Chan et al car-
ried out. This study on children resulting from donor insemination
is exploratory, and it also suffers from small and imbalanced
subsamples (e.g., 51 lesbian families versus 25 heterosexual fami-
These conditions increase the likelihood of finding non-sig-
nificant results and increase the probability of failing to reject the
null hypothesis when it should be rejected.
Moreover, there are differences between lesbian and heterosexual
families in the study that favored the lesbian families. The lesbians

Page 91
were older, had greater individual incomes (e.g., among the biologi-
cal mothers, lesbians earned $46,000 and heterosexuals earned
$31,700), and had greater household incomes ($82,000 versus
$63,200) compared to their heterosexual counterparts. The lesbian
parents, however, worked fewer hours per week than did their het-
erosexual counterparts (e.g., for lesbian non-biological mothers,
36.9 hours per week versus 40.2 hours per week for fathers).
The primary problem in using the Bonferroni correction in these
studies, however, is that it introduces a bias in favor of the investiga-
tors’ own hypotheses. The correction adjusts downward the signifi-
cance level for each individual test so as to increase the probability of
finding non-significant results. In the normal case, where the inves-
tigators wish to reject the null hypothesis, use of the Bonferroni cor-
rection can be very helpful because it counts against the
investigators own affirmative research hypothesis. In other words,
the investigator makes the situation more difficult for himself or
herself and if the results warrant it, can have more confidence than
otherwise in his or her conclusion.
In the circumstances here, however, the investigators intended to
find, as stated in their initial research hypothesis,
that there is no
difference between the homosexual and heterosexual parents. Thus,
applying the Bonferroni corrections favor obtaining a finding of no
difference, which is what the investigators are trying to achieve.
Rather than making it more difficult for themselves, the investiga-
tors have misunderstood the logic of scientific procedure and have
made it easier for the results to favor their view that there is no dif-
ference in child outcomes between homosexual and heterosexual do-
nor inseminated parents. Researchers should carry out the following
procedure instead. In situations where the investigators desire to
show that no significant differences exist, we propose instead that
the logical parallel of employing the Bonferroni correction in the
normal situation is to use a less stringent significance level for reject-
ing the null hypothesis than is normally used. For example, a study
might use a significance value of p <.10.23 None of the investigators
(including Chan et al) do this, even though Chan et al is the only
research team to mention the problem of accepting the null hypoth-
esis as a major logical problem that their study and all the other
studies face. It is one thing to notice a problem, but another thing

Page 92
to solve it. Chan et al, perhaps the most technically sophisticated of
the studies, fails in this task.
What Went Wrong and What Needs to be Done
The lessons of Step 5, Using Statistical Tests, are as follows:
1) Do the prior four steps correctly so that your testing will
not be a waste of time. Don’t try to make up for poor de-
sign or controls by using fancy statistical techniques.
2) Use inferential statistics so that you may properly assess
whether your findings are due to chance or not.
3) Properly correct for repeated use of statistical tests, keep-
ing in mind that you are engaged in the difficult task of
“trying to affirm the null.”
4) Make sure statistical tests do not inadvertently favor the
investigator’s hypothesis. One should not use applications
of conventional statistical procedure, developed for affirm-
ing the alternative research hypothesis, without extensive
justification when trying to find “no difference.” When a
routine is developed to find “X,” the investigators looking
for “Not X” should be made to explain why the routine is
applicable to “Not X” as well.
If statistical routines do not follow the steps above, the findings
of “no difference” should be considered unreliable.
1. Barret and Robinson, 1990; McCandlish, 1987; Ross, 1988; Weeks et al,
2. Bozett, 1980; Gartrell et al, 1996; Green, 1978, 1982; Hare, 1994; Javaid,
1992; Kirkpatrick et al, 1981; Lewin and Lyons, 1982; Lewis, 1980; Lott-
Whitehead and Tully, 1992; Lyons, 1983; Miller, 1979; O’Connell, 1993;
Pagelow, 1980; Pennington, 1987; Riddle and Arguelles, 1989; West and
Turner, 1995; and Wyers, 1987.
Notes to Chapter 5